Cuckatoo32 feasibility

When setting the ASIC friendly PoW to be cuckatoo32+, the expectation was that ASICs would be able to search a cuckatoo32 graph within one second, so that mining remains mostly progress-free (wasting less than 1% of hashpower during the expected 1 minute block interval).
Initial feedback from ASIC designers however suggest that achieving such speed is looking very difficult, and mining devices should be able to handle cuckatoo33 as well, as cuckatoo32 is planned to be phased out after only 1 year.

I’d like to discuss two possible ASIC designs in detail to clarify the challenge,
focussing just on lean edge trimming. Assume N=2^32 and we have
N/2^k bits of SRAM for node degree counting.
Each round of trimming consists of 2^k passes to cover all nodes on either side of the graph. For full speed we would like k=0.
The fraction of remaining edges shrinks as follows in the first 29 rounds:

i fraction          cumulative fraction
0 1                 1
1 0.632108942954801 1.6321089429548
2 0.296168498345651 1.92827744130045
3 0.175267872633412 2.10354531393386
4 0.116738814278506 2.22028412821237
5 0.0836605401709676 2.30394466838334
6 0.0630388165591285 2.36698348494247
7 0.0492773308651522 2.41626081580762
8 0.0396187112201005 2.45587952702772
9 0.0325690492754802 2.4884485763032
10 0.0272613512352109 2.51570992753841
11 0.0231630356283858 2.5388729631668
12 0.0199300068197772 2.55880296998657
13 0.0173341241898015 2.57613709417637
14 0.0152178194839507 2.59135491366033
15 0.0134690549457446 2.60482396860607
16 0.0120066914241761 2.61683066003025
17 0.0107714419718832 2.62760210200213
18 0.00971851137001067 2.63732061337214
19 0.00881334650330245 2.64613395987544
20 0.00802958302665502 2.6541635429021
21 0.00734643696341664 2.66150997986551
22 0.00674712518230081 2.66825710504781
23 0.00621856655925512 2.67447567160707
24 0.00575009267777205 2.68022576428484
25 0.00533277518115938 2.685558539466
26 0.0049593688454479 2.69051790831145
27 0.00462393509224057 2.69514184340369
28 0.00432150613050908 2.6994633495342
29 0.00404788786545396 2.70351123739965

After 29 rounds, about 1 in 250 edges remain.

...
60 0.00101631530560553 2.76487068936694

After 60 rounds, about 1 in 1000 edges remain.
From these numbers, we see that each edge is processed on average in under 2.8 rounds. Each pass iterates over remaining edges once to count node degrees, and once more to test node degrees and possibly kill edges.
The number of random node bit accesses is thus N * 2.8 * 2^k * 2.

For a single pass (k=0) ASIC design with external SRAM chips running on a 200 MhZ interface, this takes 2^32 * 2.8 * 2^0 * 2 / 200 Mhz = 115 chip seconds.
Or in other words, we’d need 115 separate SRAM chips to trim a graph within 1 second. If we round that up to 128, then they’d collectively need 128 * (32-7) = 3200 address IO pins to the central Cuckoo ASIC. For cuckatoo33 these numbers would double. That does seem to border on the infeasible.

Before we turn to ASICs with no external SRAM, let’s consider the bandwidth requirements on the external 512 MB DRAM holding the edge bitmap. Each pass in each of the first 16 rounds needs to read all of it twice and write all of it once. (After 16 rounds, the edge bitmap becomes sparse enough that we might benefit from representing remaining edges as a list of indices instead.)
This implies a DRAM bandwidth of 0.5 GB * 16 * 2^k * 3.

For an ASIC holding 128MB of SRAM, we would need at least 4 passes, i.e. k=2, requiring an impressive 96 GB/s, which would require High Bandwidth memory.
But when migrating to cuckatoo33, we need to double both the DRAM and the number of passes, which would push the bandwidth to a formidable 386 GB / s.

While the goal of cuckatoo with the upgrade schedule is to ultimately require such huge amounts of memory bandwidth, we need to start off with manageable parameters at launch.

I wanted to rule out single-chip ASICs that can hold both the edge bitmap and node degree counters on-chip, but cuckatoo31 probably suffice for that, since even with 4 passes that requires 256 + 64 = 320 MB of on-chip memory.

I welcome any feedback on this very crude feasibility study, especially by memory technology experts.

1 Like

I agree C32 lean miner infeasible for a single chip. Going off-chip for SRAM is a non-starter. The whole point of on-chip SRAM is the incredible bandwidth and parallelization possible. Once you hit pins, the logic board IO becomes a choke point. You also need n log(n) wires for inter-chip communication. More importantly, your power cost scales with number of computations which goes like 2^k. Not to mention all the siphash units, which is a lot of die area you’re ignoring.

It was discussed in another thread, but I think mean-based ASICs are the way to go with current parameters. I wouldn’t even look at lean unless it’s single-chip; then it becomes ASIC friendly.

C32 requires at least 640 MB SRAM, so that can be easily ruled out.
But do you agree that C31 is infeasible for a single chip?

Also, do you agree that Cuckaroo as proposed in https://forum.grin.mw/t/choice-of-asic-resistant-pow-for-gpu-miners/ prevent efficient single-chip ASICs?

320 MB SRAM is about 75mm^2 in 7nm which is feasible, but you also need lots of siphashes and I don’t have a size for that.

Yes, CuckaROO should rule out lean miners altogether because of the power requirements. I can’t see it being worthwhile to recompute siphashes every round. Hashes per watt would be terrible.

Just to clarify, Grin will now have two mining algos? One ASIC-resistant and the other friendly?

What’s the emission/block production look like for each? 50/50?

7nm is not really available. It exists, but production is tight and limited to only established foundry partners. If your goal is to be ASIC “friendly” I would suggest 16nm or even 28 be the target.

Also, for a mean miner to do all u-v siphashes up-front for CuckaRoo-32, that’s 2^32 words of 32-PARTBITS each, certainly more than 2^31 * 32-bits which is more than 8GB. That excludes tons of GPUs, so maybe a GPU would have to calculate UV’s twice?

currently planned Grin Mainnet PoWs are

  • for GPUs: Cuckaroo on 2^29 edges
    • Variant of Cuckoo that enforces Mean mining
    • Takes 7 GB of memory (or half with some slowdown)
    • CPUs run about 10x slower, needing only 2.5 GB
    • Tweaked every 6 month to maintain ASIC resistance
    • 90% of rewards at launch, linearly decreasing to 0 in 2 years
  • for ASICs: [UPDATED] Cuckatoo on 2^31 or more edges
    • Variant of Cuckoo that simplifies ASIC design
    • Takes 512 MB of memory, with random single-bit accesses to half
    • Takes 2^k GB after 2^(1+k) years, by phasing out smaller sizes
    • 10% of rewards at launch, linearly increasing to 100% in 2 years
2 Likes

Why do you mention Cuckaroo32?
The only cuckaroo planned so far is cuckaroo29 for GPUs…

OH sorry. Yes that would make sense.

Thanks for the summary! I glossed over that pesky 2^29 detail for Cuckaroo :slight_smile:

Thanks for the summary as well! I can see the rationale here but is there a thread somewhere I’ve missed on it?

Did you perhaps miss
https://forum.grin.mw/t/choice-of-asic-resistant-pow-for-gpu-miners
?

1 Like

You generally quote edges now instead of nodes, right? 2^29 edges is what you would previously have called Cuckoo-30 as in 2^30 nodes, right? Same size? So Cuckatoo32 means 2^33 nodes, yes?

Yes, it’s more about edges than about nodes. So I decided the original naming was a mistake. So now Cucka[rt]ooN indeed has 2^N edges and 2^(N+1) nodes.

In yesterday’s governance meeting, it was decided to relax the Asic Friendly (AF) PoW from cuckatoo32+ to cuckatoo31+, while keeping the linear growth over time. I edited the above post and added an [UPDATED] mark to reflect this.

for ASICs:

  • Takes 2^k GB after 2^(1+k) years, by phasing out smaller sizes

Does it mean that Cuckatoo will increase edges to 2^(31+k) after 2^(1+k) years?

No; after 2^k years, cuckatoo(31+k) starts getting phased out; and after completing phase out in 7 to 8 months, cuckatoo(31+k+1) will be the new minimum.

This isn’t very ASIC friendly, bricking hardware designed for a smaller graph.

it’s done to ensure that the PoW remains bottlenecked by memory IO.

Early generation ASICs are likely to be obsoleted by subsequent generations, and thus have a limited shelf life anyway. One would hope that an ASIC handles at least one size up, similar to how ethash miners had to cope with an increasing DAG size.

The bricking problem will lessen over time as the size bumps are exponentially spaced…

Increasing the graph size does not change the memory-to-compute ratio at all. No matter the graph size, the number of trimmed edges to/from memory is a fixed percentage of the number of siphashes generated.

Doubling the number of siphashes approximately doubles the die area of the chip and doubles the number of DRAM chips required. Bandwidth is related to the number of DRAM chips and the number of pins you can fit on an ASIC, which is also related to die area. So you’re not really changing anything other than doubling everything evenly. It does not make things any more or less “memory bound.”

These are completely different situations. The DAG grows slowly and only requires a little planning to have enough total DRAM on board. This is cheap and easy, and the fundamental on-chip algorithm doesn’t change. You don’t have to print new chips.

For a lean-mining ASIC, doubling the graph size means you throw away your old hardware and go through the entire fabrication process again, with new chips that are 2x the size, possibly on a different process. Moving from 16nm to 7nm for example is not trivial! It requires considerable physical design work, even if the algorithm and RTL doesn’t change.

Increasing the DAG has no impact on ASIC’s, which need only a variable for “DAG size” in them. Increasing Cuckoo’s graph size completely bricks a lean-miner ASIC because it needs to double its SRAM for trim counters. There is absolutely no way anyone would put double the SRAM on their chip to support future graph sizes. It’s hella expensive and those future sizes are not even possible with current fabrication processes.

Why do you care about being memory bound anyway? Requiring large amounts of bandwidth to DRAM is a technique for preventing ASICs, not encouraging them.

If you really want to be ASIC-friendly, why not just use FNV hash? It’s soooo simple; even a junior developer can write an optimized miner for it. Same for ASIC design. Simple solvers mean a low barrier to entry, which means a highly commoditized market where miners, both GPUs and ASICs, have low costs and low dev fees.