AES-based hash function for graph generation in Cuckoo PoW

I’m starting this topic to summarize the ideas from our discussion with Photon and to have a public place where anyone interested in the topic can share his ideas.

1.) My main motivation is that modern CPUs already have hardware implementation of the AES encryption rounds (so called Intel AES-NI instruction set; supported by most desktop and server CPUs since ca. 2010. ARM CPUs found in mobile devices have similar instruction set). You can see it as an AES ASIC of sorts.

Compared to AES, the current hash function used to generate the cuckoo graph, Siphash-2-4 doesn’t have hardware implementation on common CPUs and its not expected that there would be one in the near future (if ever).

2.) Which AES scheme? AES-NI provides HW implementation of AES-128 encryption round. The standard prescribes 10 rounds with specific round keys generated from the 128 bit wide AES key. There are few open questions here.

  • Since we currently use 256 bit key to encrypt each vertex index in the graph (to generate each of the directed edges), how do we convert it to the 128 bit key used by AES-128?

  • How many rounds are necessary for our purposes? According to the AES design proposal, (128 bit block/key length): “Two rounds of Rijndael provide “full diffusion” in the following sense: every state bit depends on all state bits two rounds ago, or, a change in one state bit is likely to affect half of the state bits after two rounds.”

3.) What is the possible gain? My first experiments showed roughly halved time with 7 AES encryption rounds on the lean miner compared to Siphash on a CPU where we have the advantage.

Photon has been testing on GPU using highly optimized software AES code (from a Cryptonight miner) and found 70 ms time to generate edges using 4 AES rounds vs 60 ms for siphash-2-4. Increasing the number of rounds gives linear increase in the time needed – 102 ms for 6 rounds and 170 ms for 10 rounds. So AES with lower rounds count is competitive even on GPUs.

That’s all that comes to my mind so far. Your questions and ideas are welcome!

I am actually curious what attack vectors does potentially “weaker” edge generator present. Assuming the conversion from 256 bit header hash to 2^N graph is uniform and statistically indistinguishable from siphash24. There are 2^256-ish possible graphs in both cases. I remember John Tromp mentioning that zcash did not like siphash and wished to see sha or something in its place.

I saw you linking https://github.com/tevador/RandomJS yesterday. I think the parallelism in single chip asic could be significantly limited by creating a unique per-graph function that would change siphash24/AES message input from just the nonce to nonce + result of pseudo-randomly generated function. Where the function is generated from sipkeys and consist of different fp, int operations of various precision (32,64) commonly found on cpu, gpu. Does not need many instructions executed, just the instruction pool needs to be big enough.

so that instead of

edge = sipnode(i, 0),
       sipnode(i, 1)

it would become something like

edge = sipnode(i | (rndFn(i, gKey) << EDGEBITS), 0),
       sipnode(i | (rndFn(i, gKey) << EDGEBITS), 1)

If conditional operations, short loops or lookups are included, it may exponentially hurt lean asic in two ways. First, each pipeline now needs to have a CPU/FPU and second, it becomes impossible to feed an edge into the pipepile every clock cycle. The obvious weakness is that miner can brute-force header nonces to search for easy to execute program, but this could be preventable.

But I fully understand this also goes slightly against DRAM asics that are not so bad and increases the complexity while reducing validation speed. Just throwing it in.

I find the idea below with recent blocks/kernels very interesting as well, assuming the required input would grow to order of GBs quickly, somehow. I think asic life-cycle is something like: stealth “testing” for 3 months, sell, make new generation and repeat so the chip could be a throw-away, highly profitable, design from the start (and a brick after a year).

1 Like

I think a more interesting variation on computing edge endpoints would be as follows:

Currently the u-endpoint of the i’th edge is u_i = siphash(i | 0) % NEDGES.
We could instead take it to be U_i = Cyclehash(u_i’th most recent block) % NEDGES.
Similarly for the v-endpoint v_i = siphash(i | 1) % NEDGES.

This means lean mining no longer enjoys a huge memory savings, and ASICS are forced to use many GBs of memory.

Unfortunately, it will take a millennium to grow the blockchain to 2^29 blocks.
One could also take the u_i’th most recent kernel. Once our blocks fill up and we have a thousand kernels per block, we reach the required entropy in about a year.

Another option is to take U_i = siphash(u_i | (u_i % height)'th most recent header)) % NEDGES, which requires (linearly growing) storage of all headers.

1 Like

At first sight, it appears that the desire to mix in GBs worth of Grin history into the mining process conflicts with the desire for simple verification. We cannot expect light clients to maintain much data beyond a month or so worth of headers, which is perhaps 16 MBs worth of data.

There are at least two ways to bridge this gap. One is to use the kernel/outputs/rangeproofs data that is committed to in the headers, and to augment the proof of work with Merkle proofs for the data used for each of the 42 edge indices. This is rather costly though, as a single Merkle proof is easily 640 bytes, so 42 of them is over 26 KB, way beyond our header size.

Perhaps a better approach is to take inspiration from Ethereum’s ethash. Our limited header data is like a cache C which we could expand into a much larger full dataset D. The expansion must be slow enough that miners will want to store the full dataset in memory but fast enough for verification with just C.

Let H be a hash function that is dependent on the last cyclehash but independent of the nonces tried by miners (perhaps simply siphash with the last cyclehash as sipkeys). Then D[i] could be the result of K iterations of mapping x to H(C[x mod Csize] | i), with K neither too small nor too large (somewhere between 64 and 4096).

Can you write a pseudo-code how would single edge generation look like when working directly from dataset C? Essentially what would pow verifier or single-chip miner need to do each time.

u64 sipnode(siphash_keys *keys, edge_t edge, u32 uorv) {
  u64 Di = siphash24(keys, 2*edge+uorv) % DSize;
  return siphash24(keys, readD(Di)) & EDGEMASK;
}

u64 readD(u64 i) {
  u64 x = 0;
  for (int k=0; k<K; k++)
     x = siphash24(cyclehash, C[(x+i) % CSize]);
   return x;
}

The miner would only compute readD once for each of the DSize possible inputs and store the results in an array.

So there would now need to be p*K additional sipblocks where p is chip paralellism. And each block is also stalled by a random SRAM read from C (at least for few cycles). With large enough K, C and D, that could be a lean death.

return siphash24(keys, readD(Di)) & EDGEMASK;

Shouldn’t be readD(Di) combined again with edge index when DSize is smaller than total number of edges?

The random access to recomputed D would be harsh on gpu miner too. To the point where doing it only once is a must. But storing the whole edge from the start takes all the memory on 8GB cards already (for cuckoo30).

With large enough K, C and D, that could be a lean death.

Yeah, the whole point of this is to kill lean mining.

when DSize is smaller than total number of edges?

I was assuming DSize == NEDGES.
Unfortunately, this whole approach is fatally flawed. If two edges e1, and e2, share an endpoint in the current graph (have the same Di), then they will also in this new graph.
So one can just keep looking for cycles in the old graph and ignore D :frowning:

Here an attempt to fix the above flaw. I chose to skip the random access to array D,
since there’s no need to introduce more latenc, and to allow a vairable number of 64-bit arguments to siphash24. I’ll have to think more about whether some of those can be combined to achieve the same effect. Also, the D lookups are applied on one side of the graph only, to minimize the slowdown:

u64 sipnode(siphash_keys *keys, edge_t edge, u32 uorv) {
  return (uorv
     ? siphash24(keys, readD(edge), edge)
     : siphash24(keys, edge)
   ) & EDGEMASK;

}

u64 readD(u64 i) {
  u64 x = i;
  for (int k=0; k<K; k++)
     x = siphash24(cyclehash, C[x % CSize], x, i);
   return x;
}

On second thought, serial access to D is no good either, since a bunch of Cuckoo Cycle ASICs could then trivially share the use of a single copy of D. And I’m not sure about single sided lookups. So the standard mix-in approach looks like

u64 sipnode(siphash_keys *keys, edge_t edge, u32 uorv) {
  u64 diredge = 2 * edge + uorv;
  u64 node = siphash24(keys, diredge) & EDGEMASK;
  return siphash24(keys, readD(node), diredge)) & EDGEMASK;
}

But as noted by photon, the random lookups will hurt GPU performance a lot.
To soften the blow, D could be partitioned into 128 or 256 byte segments, and
every 32 consecutive edges could mix in parts of the same random segment.

I did some simple tests in CUDA and D generation time can swing from 500ms to 50sec (1GB/32KB/128 to 1GB/16MB/1024 [D/C/K]) on 1080Ti. With 1 minute block time, this should stay low even on slow devices, but remain complex enough to deter lean. Maybe keep D valid for more than one block and/or stop at (height - 4*)?

Another idea is to use K AES rounds in D array (to honor the thread title :slight_smile: ) loop instead of siphash and prepare the next D array in advance on CPU as we have (4*) unused headers in queue so we can pre-make D for future (4*) blocks.

What if reads are sequential, but with an offset?

siphash24(keys, readD(edge), edge) ->

siphash24(keys, readD( (edge+HEADER_NONCE_HASH) % NEDGES ), edge)

Edit: the above would not work if miner can brute force header nonce hashes with small offsets?

Considering the initiative on major rewrite of cuckoo PoW, now might be a good time to design a custom hash function for our needs (not necessary AES-based).

Another thought: If we can modify cuckoo to reduce the computation time relative to memory usage (by solving a simpler problem), then we can have a larger graph within the current time limits…

I think current proposals are reasonably minor. All the bounties and claims still hold because it only adds to siphash input. Also depends on final algorithm, because with some proposed additions, faster hash function would not make a large difference overall.

Thinking more about it, with large C around 16 MB+, one might as well use the entire cache line in L1 and make K smaller.

As shown here:

Edit:

Original version was flawed. To get below 5s on i7, I have to expand 4MB -> 1GB with K=64.

Continuation from gitter, just another crazy idea.

“One more I can think of is to start with highly asic resistant cuckoo, but gradually (based on height) remove extra bits (and/or instructions) that come into siphash24 until at some point zero is added and we are left with pure cuckoo30 again.”

Lean miner like that needs to read 64 bytes per edge initially (up from one bit in stock cuckoo). First 32B to set counters and then again to read counters and trim edges. Compared to GPU that reads/writes approx. 24 bytes per edge after the initial lookup. Eventually this goes down more and more until asic need to read same amount of data, but is otherwise much more effective. Asic maker needs to strategically plan optimal time to deploy lean miner.

To make it work on 8GB GPU and leverage their architecture, only one edge endpoint could do this and the other might execute short program with 16KB of chain data lookup. This simple measure prevents asic from doubling its speed for free. Number of instructions would slowly decrease to zero, turning both endpoints to pure cuckoo for easy verification once chain gets larger and more popular.

EDIT: Instead of a random program on one side of the edge, %1MB and %16KB D lookups can be used instead.

One can also choose to mix in chain data only to a fraction of edges. For example, with only 1/8 of edges affected, the odds of a 42 cycle not having an affected edge is (1-1/8)^42 = .00366723
So in order not to miss the vast majority of 42-cycles, an ASIC would still need to mix in the chain data. The fraction cannot be much reduced though; with 1/32 of edges affected, an ASIC just ignoring them will still find over a quarter of all 42-cycles.

Is this to have faster block validation or to preserve current mining speeds?

In case it is the latter, something like this appers to slow down trimming only by 20ms. The extra L2 and L1 lookups are for free on CPU/GPU, but would not be free on lean asic.

U = siphash24(nce,0)

if   (U % 8 == 0) X = D [U>>3 % 1GB]
elif (U % 4 == 0) X = D [U>>3 % 1MB]
else              X = D [U>>3 % 16KB]

V = siphash(nce,X,1)

It’s mostly to avoid a big slowdown. The verification speedup is just a bonus.
Btw, it may be better to affect each endpoint with some probability. With 84 endpoints, we could lower the probability to 1/16, since (1-1/16)^-84 > 226,
enough deterrent for ignoring affected endpoints.

So perhaps something like this:

u64 sipnode(siphash_keys *keys, edge_t edge, u32 uorv) {
  u64 diredge = 2 * edge + uorv;
  u64 node = siphash24(keys, diredge);
  if (node % 16)
     return node/16 & EDGEMASK;
  return siphash24(keys, readD(node/16), diredge)) & EDGEMASK;
}

Since the gradual version is being considered as one of few possible anti-asic actions, here is my complete version that takes ideas from all above posts. Meant as a template only.

C chain data buffer

Recalculate C=>D every 1440 blocks?
Size: 8MB

D lookup buffer

Size: 2GB
Use 128 bit AES with 512 bit C input blocks in K loops
Use 256 bits of D for each siphash input

D mixing

Reduce required D data input gradually and automatically (no forks) from block 0 to block 1 000 000.
Every D lookup fetches 256 bits (two AES outputs) from D.

Start with probabilities:

V nodes:
256/256 D % 2GB // 256 of every 256 edges do D lookup
U nodes
256/256 V&1 ? D % 16KB : D % 1MB

A year in:

V nodes:
128/256 D % 2GB // 128 of every 256 edges do D lookup (on average)
U nodes
128/256 V&1 ? D % 16KB : D % 1MB

Two or so years in - pure cuckoo:

V nodes:
0/256
U nodes
0/256

Verification

Pools and chain sync would keep first 1MB of D in CPU cache as well as 8MB of C. HW AES allows reasonably fast verification. As the D does not contribute to PoW security and does not stay unchanged for long time - like ethash - it does not need to use slow stuff like SHA, in theory.