Choice of ASIC Resistant PoW for GPU miners

As most of you know by now, Grin will feature two separate PoW in its first 2 years.

The “primary” one is Cuckatoo32+, a variant of Cuckoo Cycle designed to simplify ASIC design, as explained in https://forum.grin.mw/t/cuckoo-cycle-weakness-and-possible-fix-cuckatoo-cycle

This will start out getting only 10% of the reward at launch, linearly climbing up to 100% after 2 years. At that time there will hopefully be ASIC designs from multiple manufacturers engaged in a healthy competition. This PoW is ASIC Friendly (AF) in the sense of allowing huge efficiency improvements over a GPU by using hundreds of MB of SRAM, while remaining bottlenecked by memory IO.

Complementing this AF PoW will be an AR (ASIC Resistant) one, to be called Cuckaroo, aimed at GPU miners. That one will start out getting 90% of rewards at launch, dropping linearly over the course of 2 years down to nothing. ASIC resistance will mainly come from scheduled hard-forks every 6 months (Monero style), and to a lesser degree from also being bottlenecked by memory IO.

The mean solver for Cuckoo on 2^29 edges is quite suitable for GPUs, based on the sorting of multiple GBs of data (somewhat similar to Equihash 144,5). But allowing lean solving is a big risk as this size is small enough to be mined by a single-chip ASIC, which would avoid the memory IO bottleneck and enjoy a huge efficiency advantage. Previously we explored ways of penalizing lean mining in https://forum.grin.mw/t/aes-based-hash-function-for-graph-generation-in-cuckoo-pow

Thanks to a suggestion by Igno, we since found a much better approach to prevent lean mining. We still verify a 42-cycle on the Cuckoo graph, but we compute edge endpoints differently. Let HASH be a keyed hash function that transforms a suitably large state. We assume that the key commits to a header/nonce combination. Let K be some 2 power.

Then, to compute the endpoints of edges m*K+n, 0 <= n < K, we compute

State_0 = m
State_{i+1} = HASH(State_i) for i=0...K-1
EdgeState_n = State_{n+1} XOR ( n+1 == K ? 0 : State_K )
Edge_n = extract 2 endpoints from EdgeState_n

The idea is that a long sequential computation is required to compute the endpoints of a whole block of edges. The mean solver would do this only once and store all results, one 64-bit int per edge. Thereafter, remaining trimming rounds and cycle-finding would be exactly as in plain Cuckoo, but recovery of edge indices of any 42-cycle found would again use the blocked edge generation. Altogether this requires limited changes to our already reasonably optimized mean CUDA miner.

Meanwhile, the lean solver would be forced to repeat these computations for every phase of every trimming round, and thus end up with a large latency and energy penalty, leaving it inferior to mean mining.

Possible settings are for example K = 64, HASH = siphash24,
requiring a total of 64 * (2+4) = 384 siphash rounds per block.

Another option is HASH = 4-round-AES-128, which is a little slower for GPUs, but much faster for common CPUs. Compared to siphash24, this reduces the state size from 256 to 128 bits, which could weaken security.

So that’s our plans for the AR PoW, with some details yet to be decided.

Please let us know what you think…

5 Likes

I really like the idea, I dismissed the full sequential computation because of slow verification, but doing it in blocks solves the problem.

So an ASIC would need to unroll HASH up to K times in each pipeline to retain full speed if I understand it correctly. With well selected parameters, this could be tough to manufacture.

The only improvement I can think of is to spice up HASH function with bit of math that takes lots of transistors to implement, but still executes quickly on CPU/GPU. Inspiration could be recently released monero cryptonight v2 where they use 64bit integer division and integer square root in the AES loop to introduce computation latency to the loop (and they also do other stuff like using full cache lines).

tromp where do you get your coding dialect(for lack of a better word) from? Its rather strange to me. What do “?” and “:” mean without a context?

I’m guessing it takes a bool from the left to decide between the two statements; but thats a guess; even if I’m right you probably shouldn’t be using it for pseudo code

So an ASIC would need to unroll HASH up to K times in each pipeline to retain full speed if I understand it correctly.

I don’t think so, I would guess graph generation and edge trimming would be waterfalled anyway.

Assuming its still doesn’t hold a candle to edge trimming time, they could probably have a dinky little chip generating graphs rather slow and cheaply, for a few dozen edge trimming pipelines and just have a warm up time of a few minutes.

I don’t know what your waterfall means, but if it means that it would do first edge in each block for 500 blocks in parallel and then second etc, then it would run just as fast as normal cuckoo initially and stay at that speed until last trim.

So instead of taking approx 3 units of time total it would need to run 1*N_ROUNDS units of time. With 150 trimming rounds, it would be 50 times slower compared to normal cuckoo asic.

Not sure if additional horizontal (block level) pipelining could be a thing. That would make the chip fast again. Inserting the DIV-SQRT combo could act as a safeguard. I would not risk it for permanent AR PoW, but for hard-forking one the complexity is probably too high for anyone to bother with it.

It comes from such obscure programming languages as C / C++ / C# / Java / Perl / PHP / Ruby …
b ? x : y is equivalent to if b then x else y.

2 Likes

I have comments about ASIC resistance but is this the best place or GitHub?

IMO Lean Miner was never viable for ASIC’s even before the Cuckatoo change. You may be overestimating the SRAM available to a single chip, even at 7nm, but the real killer is power. Even in the Cuckoo formulation, the power requirements of lean miner make it a non-starter.

Right now, mean miner is the only viable published approach, and I would expect to see mean miner based ASICs, making Grin ASIC’s look most like Equihash ASICs: DRAM bound with smallish chip dies produced at 16nm not 7. Most of the chip will be GDDR IO not siphash.

IMO neither PoW is ASIC “friendly”

Also about the ASIC market:

ASIC economics involve a large upfront cost called the NRE (nonrecurring engineering) plus a very small per-unit cost. This means that ASIC production is more like a step function, where the projected revenue needs to support a certain volume of unit production before the NRE is covered and break-even is achieved.

By ramping up from 10%, you are ensuring that the large ASIC producers are first to enter the market. When the total revenue from mining is small, the volume of chips produced in the first batch must be very high in order to cover the NRE. If you have a much bigger revenue base, then the the NRE is relatively small relative to revenue, and unit volume is less of a factor. A large market has room for smaller players but a small market does not.

Basically, Bitmain can afford the NRE even when the mining is only 10%, because they have deep pockets and can write-off the NRE at a loss if necessary. They can also produce very large batches to amortize the NRE and gain a low per-unit net cost. If the market is small because of the 10% number, the NRE is a bigger factor, favoring large batches and big capital investments.

An ASIC startup like ours cannot do this. It’s basically impossible for me to pitch investors on giving us millions of dollars when the upside is very limited at Genesis and ramps up slowly. No investor wants that. They’ll prefer to wait until after launch, see what things look like, see the other ASICs on the market from big companies, then maybe take a chance on us. If the revenues from mining were 100% from the start, it would be much easier for small companies like ours to raise the necessary capital, and to produce a small batch size that that still covers the NRE.

Basically, your ramp-up has the opposite of its intended effect, unfortunately.

Signed,
A startup that can’t enter the Grin ASIC market because of the economics

1 Like

We know from existing Equihash miners they have at least 144 MB available, in a 10nm process. So 256 MB looks within the realm of possibilities for 7nm.

So are you considering entering the SHA256 ASIC market?
Or the BEAM ASIC market, for which you can be the first ASIC on the scene?

This could me coming from a stance of just learning about GRIN, but I’m not sure I’m fully grasping the concept here. I understand the thought process that at some point ASIC resistance becomes a losing game, but having a set time frame where you expect the network to move from GPU to ASIC seems a bit risky since you’re assuming that it will always be possible for both GPU and ASIC miners to be profitable and it’s possible to control that inflection point.

What if it’s not profitable for ASIC vendors to design a GRIN Miner (based on timolson’s comments it seems like that’s a possibility) and after 2 years the GPU miners are pushed out, leaving the network with low hashing power? Also, do you run the risk of there being some point where it’s not profitable for either GPU or ASIC miners to mine GRIN, what do you do then?

If you truly want the network to have ASIC resistance over a 2 year period, wouldn’t it make more sense to scrap the concept of an ASIC friendly algo and go with the ASIC resistant algo only? Change it 3 times at 6 month intervals, and then just leave it alone from 18 months on. At that point if it’s profitable for a GRIN ASIC to be developed, they’ll appear organically and slowly take over the network, but if it’s not, you maintain the GPU miners?

1 Like

For some reason I was thinking it required 512MB. One bit for liveness + one bit for a counter means 2^32 Mbits for Cuckatoo-32, which is an enormous 110mm2 in 7nm. Am I missing a factor of two?

256MB is still a stretch because 7nm is not super-reliable yet and the error rate on that size SRAM is quite high. Maybe this year it will become reasonable, which is in your timeline I guess.

The real problem is power. Lean miner takes wayyyy more power per hash than mean miner, so no one should use that approach in an ASIC, or on any platform for that matter.

My prediction is for mean miner based ASICs only, produced by the usual suspects of Bitmain, Halong, Bikal, but unfortunately not us, not at first. Like I said, we can’t compete when the money bucket is small, because the NRE is large.

1 Like

No offense but who are you - which startup do you represent ?

What is your stance on Obelisk’s launchpad approach to the NRE costs ?

I’ve been a serial entrepreneur for a long time and entered cryptocurrency in 2013. My crypto-related startups include Epic Scale which was the largest Monero miner by 2015, Hedgy for which I’ll soon receive a blockchain patent, and Zeropond which had one of the first Equihash miners and operated a crowdfunded GPU farm for Zcash.
https://linkedin.com/in/olsontim

At altASIC, we’ve open sourced a CryptoNight ASIC design that’s 5x better than the Bitmain X3.

By way of disclosure, I have personal anger at Obelisk for ruining our Decred ASIC crowdfund. They either lied about their performance numbers, or they were utterly incompetent at building an ASIC miner. I’m pretty sure it’s the latter. I think David Vorick is basically honest, but he subcontracted everything, which is a disaster.

Launchpad is a way for Obelisk to guarantee themselves an ASIC monopoly while still being incompetent at building miners. I thought we all didn’t want ASIC monopolies? Especially from manufacturers who are terrible at building ASIC’s! Why would you trust David Vorick any more than Jihan Wu? Absolute power corrupts absolutely. At least Bitmain can build a product which does what it says. If anything, Obelisk has only proven they can’t be trusted.

I would be happy to compete on merit against any ASIC company in the world. Clearly Obelisk is afraid of competition. In the Decred ASIC fight, we were bested by PR bullshit. I tried to argue we had technical talent in-house and that Obelisk was just subcontracting. Regular consumers don’t understand the problem with that, and someone actually said to me “But David Vorick was in Forbes,” as if that means anything about his (in)ability to produce an ASIC miner.

We were attracted to Grin because Cuckoo Cycle is quite complex, and that leaves plenty of room for smart miner writers to find hidden optimizations. As a startup, we need some kind of edge to get going. See my article for comments about complex PoW’s:

2 Likes

So your investors calculated that at launch 10% reward, unit ROI would be so long that it would not sell well to general public. So at what percentage of block reward your investors think that unit ROI would be so attractive it would sell like cakes and pay for NRE? Also what do they think AF hashrate and grin-USD price will be 3 month into launch, because it would seem you need those variables.

You can never predict hashrate, but it’s a really good bet that hashrate only goes up. That’s enough information to want in at the beginning and not later. If the beginning is only 10% ASIC, by the time we get to 100% is the hashrate only 10x? Zcash hashrate was 10-20 MH in the first few months, and two years later it’s 1000x that number. We may assume an ASIC launched at genesis is more profitable per-unit even at 10% than an ASIC launched 2 years later. If 10% at genesis is not attractive enough to cover the NRE, then what do you expect later on?

Raising money for a startup is already hard because of the many unknowns and risks around a coin launch. Reducing the potential early revenue by 10x just adds insult to injury. If an investor had any doubt, then simply quoting the 10% number quickly pushes them to “no” There is no risk for an investor to wait and see.

Sigh. When we’re not releasing the AR PoW fast enough, you complain because you don’t have enough time to write an optimized GPU solver. When we start talking about it, you complain nothing is ASIC-friendly enough, the share is too small, and now you want to build an ASIC. And then you complain that while Cuckoo is good for you because you think you can optimize it better, no one else can be trusted.

Obviously you don’t care about grin, you only care about what you can get out of it. All I read in your posts is bikeshedding, resentment and posturing. So why should we give you any further attention?

6 Likes

Thanks for releasing the AR PoW, and thanks for making it a Cuckoo Variant. It’s a smart choice that will make miner development much easier. I’ll probably enter a GPU miner in the race.

Please don’t be upset just because what I have to say about ASICs contradicts what you thought to be true. I’m offering insight from a position of knowledge and experience in the ASIC miner market. You can take it or leave it. I only hope the additional information helps you guide the project.

I do care about Grin and love Mimblewimble. Throughout my career I have only chosen projects I believe in: Monero, Zcash, and Decred so far. Why would I put effort into a shitcoin I just sell off? It’s absolutely fantastic that you’re putting all this effort into implementation and having a fair launch. Thank you.

FWIW, I’ve just spent a few weeks also harassing the Monero PoW team about RandomJS. They didn’t like to hear what I had to say either, but based on my feedback they did adjust their PoW to be somewhat more ASIC resistant. Revealing my tricks was directly against my self-interest, but I care about Monero.

If you have any questions about ASIC development for Cuckoo, feel free to ask me here or via email to tim altasic.com. Otherwise I’ll just shut up. Didn’t mean to fluster you.

3 Likes

P.S. I only commented on Obelisk and the 10% share because I was directly asked about those topics. I’ve tried to keep everything else concise and relevant.

1 Like

You can make do with 1/2^k as much SRAM by doing 2^k times more passes over all edges. See the PART_BITS setting in the lean miner.

So that becomes 1 liveness bit + 2^-k node adjacency counter bits.

Are you saying that 2^30 random 1-bit accesses to 64 MB of SRAM takes way more power than 2^31 mostly sequential 32-bit accesses to 7 GB of DRAM?

I was considering the “full-table” case, not Andersen’s TMTO. The TMTO increases power-per-hash by roughly the partition factor, and since lean miner is already power hungry, adding another 2x or 4x power-per-hash in order to save memory isn’t beneficial.

If by 64MB you’re referring to the TMTO, see above. It would be 2^30 accesses times the partition factor. Partitioning by 8x requires >8x siphashes plus >8x memory accesses, plus additional time, memory, and power for the chained hashtables. Andersen’s paper hides this with big-O notation, but in my own tests you need at least an 8-deep hashtable chain which is big, slow and another 8x(lookup_time) coefficient “hidden” by big-O.

If I understand correctly, you are only considering the first pass, not the full trimming:

The clearest power difference between lean and mean is due to siphashes. A full-table lean ASIC must regenerate all siphashes every pass**, so for a 20-pass trim, lean requires 20N*** siphashes vs only 1N siphashes for mean. Owch!

In terms of memory power, mean miner reduces the amount of memory accesses during each pass Kn (K0 > K1 > K2 …) which “converges” to a pretty small number, whereas lean miner requires a full K0 * N. For 20 passes, the number of memory accesses for all passes totals 20 N for lean vs what, 2-3 N for mean? I don’t have the exact mean number handy. Even if DRAM and SRAM cost the same power (they don’t) it wouldn’t be close.

Again, the TMTO only makes this worse. “Time” to a computer scientist basically translates directly to power in this mining context. You definitely don’t want to increase power-per-hash in order to save memory. If anything, use more memory to save power, which is what mean miner does.

** all siphashes must be regenerated assuming you don’t have room for an edge liveness table. That would cost an additional 1N SRAM bits on the ASIC.

*** By N I mean 2^N

An aside: when reading SRAM blocks, an ASIC will still read at least full byte then mask off the required bit. Even though you could technically wire each bit separately, you would end up with a huge routing tree for addressing, creating a tangled wiring nightmare for place & route. So in practical terms, one byte is still the smallest limit on memory word size. Unless the algorithm is suffering memory access collisions (which would be rare in lean miner), an even wider word will be more efficient.

As you know, DRAM reads entire rows. 32 bits is less than the row size, but mean miner would read more than that sequentially. Reading a whole row of DRAM is significantly more efficient per-bit than reading a word of SRAM.