As most of you know by now, Grin will feature two separate PoW in its first 2 years.
The “primary” one is Cuckatoo32+, a variant of Cuckoo Cycle designed to simplify ASIC design, as explained in https://forum.grin.mw/t/cuckoo-cycle-weakness-and-possible-fix-cuckatoo-cycle
This will start out getting only 10% of the reward at launch, linearly climbing up to 100% after 2 years. At that time there will hopefully be ASIC designs from multiple manufacturers engaged in a healthy competition. This PoW is ASIC Friendly (AF) in the sense of allowing huge efficiency improvements over a GPU by using hundreds of MB of SRAM, while remaining bottlenecked by memory IO.
Complementing this AF PoW will be an AR (ASIC Resistant) one, to be called Cuckaroo, aimed at GPU miners. That one will start out getting 90% of rewards at launch, dropping linearly over the course of 2 years down to nothing. ASIC resistance will mainly come from scheduled hard-forks every 6 months (Monero style), and to a lesser degree from also being bottlenecked by memory IO.
The mean solver for Cuckoo on 2^29 edges is quite suitable for GPUs, based on the sorting of multiple GBs of data (somewhat similar to Equihash 144,5). But allowing lean solving is a big risk as this size is small enough to be mined by a single-chip ASIC, which would avoid the memory IO bottleneck and enjoy a huge efficiency advantage. Previously we explored ways of penalizing lean mining in https://forum.grin.mw/t/aes-based-hash-function-for-graph-generation-in-cuckoo-pow
Thanks to a suggestion by Igno, we since found a much better approach to prevent lean mining. We still verify a 42-cycle on the Cuckoo graph, but we compute edge endpoints differently. Let HASH be a keyed hash function that transforms a suitably large state. We assume that the key commits to a header/nonce combination. Let K be some 2 power.
Then, to compute the endpoints of edges m*K+n, 0 <= n < K, we compute
State_0 = m
State_{i+1} = HASH(State_i) for i=0...K-1
EdgeState_n = State_{n+1} XOR ( n+1 == K ? 0 : State_K )
Edge_n = extract 2 endpoints from EdgeState_n
The idea is that a long sequential computation is required to compute the endpoints of a whole block of edges. The mean solver would do this only once and store all results, one 64-bit int per edge. Thereafter, remaining trimming rounds and cycle-finding would be exactly as in plain Cuckoo, but recovery of edge indices of any 42-cycle found would again use the blocked edge generation. Altogether this requires limited changes to our already reasonably optimized mean CUDA miner.
Meanwhile, the lean solver would be forced to repeat these computations for every phase of every trimming round, and thus end up with a large latency and energy penalty, leaving it inferior to mean mining.
Possible settings are for example K = 64, HASH = siphash24,
requiring a total of 64 * (2+4) = 384 siphash rounds per block.
Another option is HASH = 4-round-AES-128, which is a little slower for GPUs, but much faster for common CPUs. Compared to siphash24, this reduces the state size from 256 to 128 bits, which could weaken security.
So that’s our plans for the AR PoW, with some details yet to be decided.
Please let us know what you think…