Solving Cuckatoo with less memory

The current Cuckatoo C31 mean solvers uses 8GB of RAM (or more). The way these solvers work is that they split edges into buckets and then do lean phase on each bucket separately. The solvers I’ve seen use 4 bytes per edge in bucket thus the total memory usage is 2^31 * 4 = 8GB.

However, there is a better way to store edges in the bucket. If edges in the bucket are sorted, then the difference between two consecutive edges is usually small and the distribution of expected values depends only on the number of buckets. If you write and read them sequentially you can store just the difference bits. You need less than 15bits to store each individual diff using some efficient bitpacking. See https://en.wikipedia.org/wiki/Arithmetic_coding for example of such coding scheme.

I’ve implemented a simple encoding as a proof of concept: https://github.com/dzetkulic/ct/blob/master/a.cc#L38

The code above only does Phase1 of edge trimming. It’s not a full solver but it can do Phase1 with 3.7GB of RAM. That’s 0.47x of what’s usually needed.

Interestingly this is orthogonal to Slean mining (https://forum.grin.mw/t/slean-mining-mining-cuckatoo3x-using-low-memory-gpus/) and you can use two approaches together for even lower memory usage. I.e. If you do Slean with K=4 and bitpacking then you only need 2*256MB + (12GB/4)*0.47 = 1.91GB.

The bad news is that I don’t see a good way to implement this efficiently on GPUs and on CPU it’s quite slow. The good news is that I think that it would work just fine on ASIC where you can afford to do bit manipulations efficiently.

In the future we may see ASICs that use 2GB of DRAM with Slean+bitpacking instead of utilizing huge SRAM.

1 Like

Some simpler form of that technique is used in the cpu mean miner.
In https://github.com/tromp/cuckoo/blob/master/src/cuckatoo/mean.hpp#L696-L700 we have the code fragment

    for (; readbig < endreadbig; readbig += SRCSIZE) {
// bit        43..37    36..22     21..15     14..0
// write      UYYYYY    UZZZZZ     VYYYYY     VZZZZ   within VX partition
      const u64 e = *(u64 *)readbig & SRCSLOTMASK;
      uxyz += ((u32)(e >> YZBITS) - uxyz) & SRCPREFMASK;

where successive edges are assumed to differ by at most 1 in the most significant 7 bits UX of the U endpoint. These missing bits are recovered with the rather subtle bit manipulations on the last line, which basically replace the 22 least significant bits of the previous U endpoint, and result in a carry into the upper 7 bits if the replacement is a smaller 22 bit value.

It’s much harder to make ordering assumptions with the high levels of multi threading that we have on GPUs.

1 Like

Right. As I mentioned this is slow on CPU and does not work on GPU. However I think this may work on ASIC. Example ASIC for C32:

8x 1GB DRAM. Each of DRAM chips stores:

  • 64MB for edges bitmap.
  • 128x 7.5MB for buckets.

There will be 1024 buckets total each holding ~4194304 values on average i.e. there is 15bits for edge on average. The encoding in my link should be able to fit this.

You can then do Slean mining with K=1:

  1. Stream edge bits, split them into 1024 streams in hardware (depending on the 10 highest endpoint bits), bitpack streams in hardware and stream into buckets. With 1024 streams 15 bits is more then enough to fit values into buckets.
  2. Have N * 256kB SRAM for N-way per-bucket parallelism. For each bucket:
    a) Memset SRAM to zeros.
    b) Stream bucket and set random bits in SRAM
    c) Stream again and remember edges to drop in bucket.
  3. Stream 1024 streams back, merge and reset bits in edge bitmap.
  • All DRAM accesses are sequential.
  • Decoder/Encoder logic seems to be super easy to do in silicon.
  • You need only few MB of SRAM

If 8GB is still too much, you can do K>1 Slean…

I get 2^22 bits, or 512 KB of SRAM, per bucket.
I presume by N-way you mean processing N buckets simultaneously.
Why is that better than just making your buckets N times larger?

Yes; this is a pretty neat approach!

That has the big downside of needing the full 512 MB of SRAM for the node bitmap, since your buckets contain only a fraction 1/k of all edges that will determine the final values of the node bitmap. Curiously, this is exactly what my question to Wilke at Grin Amsterdam was about.

See the slides at 1:27:50 and 1:28:00 in the talk:

There is a big arrow that shows copying data from 512MB DRAM to small SRAM and back. You need to do those copies (k-1) times, but those can be sequential accesses to the DRAM.

You’re right. I meant to say it needs the full 512 MB for k > 1.
But it doesn’t need to be all SRAM.

For k=1 however you don’t need 512 MB. Only N * (512 MB / #BUCKETS) for simultaneous processing of N buckets.
So that’s a big incentive to keep k at 1.

The idea is to have all DRAMs busy all the time so you would probably set N=8 in my example and process 8 buckets in parallel one from each DRAM chip. If your SRAM latency cannot keep up with DRAM throughput then you can set N=16 to better utilize available DRAM throughput (for the price of more SRAM needed on chip).

The math is simple: The more DRAM chips you have and the smaller SRAM you want to utilize the more buckets you need. But with less buckets you need less bits for bitpacked streams. I don’t have all the data to compute the best parameters, but you have the idea.

Right. Going from k=1 to k=2 comes with the price of much more complex logic (multiple pass) and 512MB requirement. Going to higher k just increases latency further. Options k=1 and k=4 look the best to me. k=2 comes with cost that is not offset with enough benefits and k>4 comes with diminishing returns.