GPU mean memory reductions

Following discussion with miner dev Lolliedieb, it appears possible to make cuckatoo31 run in 11GB, and cuckaroo29 in 6GB with negligible slowdown,
by splitting the first trimming round in two parts, allowing for bufferA and bufferB to overlap.
Furthermore, by splitting the set of buckets in two, and seeding+trimming one half before the other, it appears possible to make cuckatoo31 run in as little as 8GB, and cuckaroo29 in 4GB. The doubled siphashing would cause a roughly 20% slowdown here.

8 Likes

I now have memory reductions implemented for cuckatoo in branch memred.
Additionally, I implemented PART_BITS, which allows cuda31 to work with 64x64 buckets.

The good news is that cuda31 thus runs efficiently in 11 GB (Makefile target cuda31).

The bad news is that the NVIDIA cards I tested don’t actually have 11 GB available to allocate. The 2080 Ti only has about 10.5 GB available, which requires truncating the buckets to such a degree that a small fraction of the solutions are lost (Makefile target cuda31_). The 1080 Ti has closer to 10.8 GB available and needs less truncation, resulting in a tiny solution loss. Unfortunately, it’s over twice as slow as a 2080Ti…

The NVIDIA V100 cards with >= 16 GB could have a blast. Please let me know if those work now…

3 Likes

Here’s how the more truncated buckets on 2080 Ti affects the number of cycles found among first 100 nonces:

cycle length cuda31.0 cuda31.1
2 40 41
4 19 22
6 11 11
8 10 10
10 8 8
12 6 6
14 12 15
16 2 3
18 2 4
20 3 3
22 5 5
24 2 2
26 4 4
28 5 6
30 1 2
32 1 1
34 3 5
40 2 2
42 1 1

On this range, cuda31.1 loses no cycles compared to the reference lcuda31 (that’s guaranteed to find all cycles).

cuda31.0 is the soon-to-be-released miner that can use all 64 KB shared memory on Turing GPUs (like 2080 Ti). cuda31.1 is the already released miner that needs to set PART_BITS=1 to cope with only 32/48 KB os shared memory, as on the 1080 Ti.

1 Like

I have memory reduction on cuckaroo29 done as well (mines full speed in 5.5 GB), but see signs of a possible bug which I will try to track down tomorrow…

1 Like

To recap the related discussions in Gitter lobby: https://gitter.im/grin_community/Lobby?at=5c2d4f6fbabbc178b21ada89

And welcome to continue discussions here in forum.

Peter Salanki @salanki 07:45
So the regular mean cuckaroo 29 miner will now work on a 6GB GPU? That changes things

Dan Saltman @dbl_twitter 07:55
Im not sure that reducing it to 6gb is a great idea. The amount of nvidia 1060 cards out there with 6gb is fucking massive. Thats literally ever big commercial farm.

Peter Salanki @salanki 07:57
@dbl_twitter: Since it’s not an algorithm change someone would have figured it out in private anyway, so better to have it in public. Potentially the algorithm could be changed if the same complexity as before wants to be maintained.

engmsf @engmsf 07:58
@dbl_twitter I agree with you. The Nvidia P106-100 are everywhere. That would cause the initial mainnet total GPS to be massive. Solo mining would be prohibitive. Maybe delay the 6gb for a few weeks/month.

John Tromp @tromp 07:58
@dbl_twitter better that 6GB miner is public than private

engmsf @engmsf 08:03
@tromp I still vote for the 6gb implementation to be delayed for a week or two after mainnet.

Peter Salanki @salanki 08:04
@engmsf: The problem with that is that the cat is out of the bag now. Someone else can find the optimizations and do it only for their huge farm without releasing it to the public.

Dan Saltman @dbl_twitter 08:06
If someone made a 6gb miner when the memory cap was 8gb, doesnt that mean by lowering it now we should expect one for 4gb?

John Tromp @tromp 08:07
4gb will be possible only with nontrivial performance penalty

Dan Saltman @dbl_twitter 08:08
so, to be clear- gone is the mantra that this coin would only be mineable by people with high end cards and gamers.

John Tromp @tromp 08:08
@engmsf no delay. fairness requires optimal miners public at launch

Dan Saltman @dbl_twitter 08:09
1060s is one thing, opening it up to 4gb cards is another ball park

John Tromp @tromp 08:09
@dbl_twitter hign end cards have AT31 for themselves

Gary Yu @garyyu 08:10

Im not sure that reducing it to 6gb is a great idea. The amount of nvidia 1060 cards out there with 6gb is fucking massive. Thats literally ever big commercial farm.

It’s really a big topic, to let existing big farms (suppose there are massive 6gb card there) in or kick them out. hope to hear more voices here, especially at the time only 12 days left before launching mainnet…

vonneumaniac @vonneumaniac_twitter 08:13
could move the in-depth discussion here
GPU mean memory reductions - #3 by tromp

engmsf @engmsf 08:15
@garyyu My vote is to kick them out for a few weeks or when a block height is reached. Discussion at your next developer meeting. There are lot of farms with Nvidia P106. All that work the developers have put in will just go to the farms and not the average guy/gal with 1 card or a small rig.

atlanticcrypto @atlanticcrypto 08:16
i think its a bit naive to think that there wont be big farms participating even at the 7GB threshold.

engmsf @engmsf 08:18
@atlanticcrypto I agree, however I believe the 6gb farm is 1000x or more than that of the 8gb farms. Again just a dumb guess and no data to back it up anything.

Wayne George @waynegeorge 08:18
I guess there will be a lot of idle cards out there however as was mentioned couldn’t they just tweek the code themselves and mine anyway @engmsf

Gary Yu @garyyu 08:22
@atlanticcrypto I think @engmsf and @dbl_twitter ’s key point is :

That would cause the initial mainnet total GPS to be massive (if existing farm with massive 6gb cards can switch mining into Grin)

Peter Salanki @salanki 08:23
If there was an original point to keep the number of addressable GPUs small, that point is now gone.

engmsf @engmsf 08:30
@garyyu Agree on the initial GPS to be massive. No doubt about this in my mind that it will be if allowed. The 6gb farms should not be able to mine Grin until a certain block height.

Peter Salanki @salanki 08:32
I assume the goal there is to exclude the huge 40k GPU P106 farms from getting the grunt of the mining rewards initially
to ensure a more normal distriubtion

Gary Yu @garyyu 08:34
Regarding to 6gb card, perhaps we can continue listening more voices here, and then feel free to join the planed next giverance meeting (today: 3th Jan 2019, 3:00PM UTC), refer to mimblewimble/grin-pm#31, I propose to add this into agenda.

vonneumaniac @vonneumaniac_twitter 08:35

@garyyu Agree on the initial GPS to be massive. No doubt about this in my mind that it will be if allowed. The 6gb farms should not be able to mine Grin until a certain block height.

sounds like central planning to me. especially if the cat’s out of the bag and people with proprietary mem optimized implementations will be able to use them anyway.

engmsf @engmsf 08:35
@salanki Well said. Disclosure, I have a 1060 6gb so I am not doing myself any favors.

The purpose of ASIC resistance is to reduce the likelihood of a single party controlling the bulk of the hashrate by having access to something others do not, right? How is C29 now accomplishing that goal if one or two huge low-memory GPU farms will dominate it? By that standard, C31 seems to be the more effective anti-centralization PoW, especially considering it was just made usable by 11GB consumer-grade GPUs. Why not just launch with 100% C31?

Does anyone has any credible GPU mining statistics data of the whole market? I’ve done a simple research based on this source(http://ethosdistro.com/versions/). The memory reduction to 5.5 GB would impact but maybe not a dramatic one. According to the statistics, GTX 1060/P106-100 (6G) would just double the hashrate.

AMD (31255)
RX 580 85067 41.80% 8G
RX 570 60970 29.96% 4G
RX 470 32470 15.95% 4G
RX 480 14753 7.25% 8G
RX 460 3980 1.96% 2G
RX 550 3968 1.95% 2G
RX 560 2314 1.14% 4G
203522 100.00%
Nvidia (19695)
P104-100 31494 22.34% 4G
GTX 1060 6G 19800 14.05% 6G
GTX 1070 19529 13.85% 8G
P106-100 18690 13.26% 6G
P102-100 14339 10.17% 5G
GTX 1060 3G 13952 9.90% 3G
GTX 1050 Ti 9598 6.81% 4G
GTX 1070 Ti 5909 4.19% 8G
GXT 1080 Ti 4682 3.32% 11G
GTX 1080 2976 2.11% 8G
140969 100.00%

grin-miner now features the 5.5 GB CUDA mean miner for cuckARoo29,
and two 11 GB CUDA mean miners for cuckAToo31, a general one for Pascal cards, and a specialized one for Turing cards, that benefits from using 64 KB of shared memory.

Is this no longer a desired benefit or is it somehow no longer feasible?

There is no effective way to exclude 6 GB GPUs…

1 Like

Hello! On page https://github.com/mimblewimble/docs/wiki/GPU-Mining-Stats I see RX580 (which are maxed at 8Gb, obviously) on Cuckatoo31 OCL miner, making quite good numbers. Is that true? How is that possible?

They use lean mining by the cuckatoo_lean_cuda_31 plugin.

The page should really be categorised by plugin name to avoid confusion…

I’m getting Errored when I run cuckatoo_mean_cuda_rtx_31 with my RTX 2080 Ti. I’m already driving my display from the motherboard instead of the GPU. It’s already using 398MiB when the card is idling when i check with nvidia-smi. Is this normal?

No problem when I use cuckatoo_lean_cuda_31.

Anyway, where can I see the logs to find out more about the error?

±----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208… Off | 00000000:01:00.0 Off | N/A |
| 0% 48C P8 17W / 250W | 398MiB / 10989MiB | 5% Default |
±------------------------------±---------------------±---------------------+

No; on the Linux system I use, running with integrated graphics, I see

| 1 GeForce GTX 108… Off | 00000000:03:00.0 Off | N/A |
| 0% 38C P5 16W / 250W | 0MiB / 11178MiB | 0% Default

You need to find out what’s occupying that 398MiB, it’s nothing to do with grin or grin-miner. Do you have other NVIDIA diagnostics?

1 Like

My bad, I was running in GUI mode and even though my monitor was connected to the mobo, the gpu was still automatically running processes for the GUI.

I wasn’t sure how to disable the processes so I resorted to booting in text mode to resolve my issue.

The purpose of ASIC resistance is to reduce the likelihood of a single party controlling the bulk of the hashrate by having access to something others do not, right?

Asic resistance purpose is a dream of not having an asic race; for example in bitcoin there was a period of time where miners where being made top of the line to useless in months; a handful of company would pre-sell miners, mine on them until they become unprofitable then ship to avoid lawsuit for their scam.

However bitcoins asic race has slowed down and “asic resistance pows” have started to get asic’s when they get a market share and sha wasn’t asic friendly it merely existed without reference to the possibility of asic’s; I would strongly argue that whole thing was a mistake.

Update regarding C31 on Tesla V100:

use the cuckatoo_mean_cuda_gtx_31 plugin with expand = 2 uncommented.
Edit NEPS_A and NEPS_B to 133 / 88 in cuckoo-miner/src/cuckoo_sys/plugins/CMakeLists.txt for a significant GPS increase to eliminate the slight loss in solutions (see @tromp comment below) as follows:

build_cuda_target("${AT_MEAN_CUDA_SRC}" cuckatoo_mean_cuda_gtx_31 "-DNEPS_A=133 -DNEPS_B=88 -DPART_BITS=1 -DEDGEBITS=31")

See here: https://github.com/mimblewimble/docs/wiki/GPU-Mining-Stats

Big (very big) thanks to @tromp for taking a chunk out of his day to help chase down the issues I was having with the V100.

2 Likes

It does nothing for GPS but eliminates the slight loss in solutions that were needed to fit in 11 GB.

Oh, gotcha! Edited above.

4GB was never a problem. Maybe only for the public miners. I expect private miners at big farms to do just fine with Cuckoo Cycle.