When setting the ASIC friendly PoW to be cuckatoo32+, the expectation was that ASICs would be able to search a cuckatoo32 graph within one second, so that mining remains mostly progress-free (wasting less than 1% of hashpower during the expected 1 minute block interval).
Initial feedback from ASIC designers however suggest that achieving such speed is looking very difficult, and mining devices should be able to handle cuckatoo33 as well, as cuckatoo32 is planned to be phased out after only 1 year.
I’d like to discuss two possible ASIC designs in detail to clarify the challenge,
focussing just on lean edge trimming. Assume N=2^32 and we have
N/2^k bits of SRAM for node degree counting.
Each round of trimming consists of 2^k passes to cover all nodes on either side of the graph. For full speed we would like k=0.
The fraction of remaining edges shrinks as follows in the first 29 rounds:
i fraction cumulative fraction
0 1 1
1 0.632108942954801 1.6321089429548
2 0.296168498345651 1.92827744130045
3 0.175267872633412 2.10354531393386
4 0.116738814278506 2.22028412821237
5 0.0836605401709676 2.30394466838334
6 0.0630388165591285 2.36698348494247
7 0.0492773308651522 2.41626081580762
8 0.0396187112201005 2.45587952702772
9 0.0325690492754802 2.4884485763032
10 0.0272613512352109 2.51570992753841
11 0.0231630356283858 2.5388729631668
12 0.0199300068197772 2.55880296998657
13 0.0173341241898015 2.57613709417637
14 0.0152178194839507 2.59135491366033
15 0.0134690549457446 2.60482396860607
16 0.0120066914241761 2.61683066003025
17 0.0107714419718832 2.62760210200213
18 0.00971851137001067 2.63732061337214
19 0.00881334650330245 2.64613395987544
20 0.00802958302665502 2.6541635429021
21 0.00734643696341664 2.66150997986551
22 0.00674712518230081 2.66825710504781
23 0.00621856655925512 2.67447567160707
24 0.00575009267777205 2.68022576428484
25 0.00533277518115938 2.685558539466
26 0.0049593688454479 2.69051790831145
27 0.00462393509224057 2.69514184340369
28 0.00432150613050908 2.6994633495342
29 0.00404788786545396 2.70351123739965
After 29 rounds, about 1 in 250 edges remain.
...
60 0.00101631530560553 2.76487068936694
After 60 rounds, about 1 in 1000 edges remain.
From these numbers, we see that each edge is processed on average in under 2.8 rounds. Each pass iterates over remaining edges once to count node degrees, and once more to test node degrees and possibly kill edges.
The number of random node bit accesses is thus N * 2.8 * 2^k * 2.
For a single pass (k=0) ASIC design with external SRAM chips running on a 200 MhZ interface, this takes 2^32 * 2.8 * 2^0 * 2 / 200 Mhz = 115 chip seconds.
Or in other words, we’d need 115 separate SRAM chips to trim a graph within 1 second. If we round that up to 128, then they’d collectively need 128 * (32-7) = 3200 address IO pins to the central Cuckoo ASIC. For cuckatoo33 these numbers would double. That does seem to border on the infeasible.
Before we turn to ASICs with no external SRAM, let’s consider the bandwidth requirements on the external 512 MB DRAM holding the edge bitmap. Each pass in each of the first 16 rounds needs to read all of it twice and write all of it once. (After 16 rounds, the edge bitmap becomes sparse enough that we might benefit from representing remaining edges as a list of indices instead.)
This implies a DRAM bandwidth of 0.5 GB * 16 * 2^k * 3.
For an ASIC holding 128MB of SRAM, we would need at least 4 passes, i.e. k=2, requiring an impressive 96 GB/s, which would require High Bandwidth memory.
But when migrating to cuckatoo33, we need to double both the DRAM and the number of passes, which would push the bandwidth to a formidable 386 GB / s.
While the goal of cuckatoo with the upgrade schedule is to ultimately require such huge amounts of memory bandwidth, we need to start off with manageable parameters at launch.
I wanted to rule out single-chip ASICs that can hold both the edge bitmap and node degree counters on-chip, but cuckatoo31 probably suffice for that, since even with 4 passes that requires 256 + 64 = 320 MB of on-chip memory.
I welcome any feedback on this very crude feasibility study, especially by memory technology experts.