Choice of ASIC Resistant PoW for GPU miners

It drops by an “absolute” 1%, right? This is a linear function not geometric: 90.00% 89.00% 88.00%…

Yes, absolute 1%, as implied by “linearly decreasing from 90% to 0% over 2 years”.

What’s the rationale behind putting up a 3-4GB requirement?
There are many in the mining community with old GPUs or very many in the Monero community with low power low end 2GB GPUs that work just perfect with Monero (in some cases better than Vegas in terms of hash/watt). It would actively hurt decentralization & egalitarianism by not including every possible hardware - if feasible to implement.

The requirement for cuckaroo29 is actually 7 GB.
The rationale is that Cuckoo Cycle is a memory hard PoW,
and as such wants to exercise as much memory as possible.

The lower the memory requirements, the higher the risk of FPGAs and ASICs getting a large advantage.

There are on the order of a million GPUs available that can mine cuckaroo29,
and we hope to attract a decent fraction of them.

So by lowering the spec requirement to about ~1GB instead of 7GB and therefore allow many more devices to participate (mobile GPUs in phones, notebooks & low to mid-end GPUs) what exactly would it make that much easier to implememt an ASIC?
I don’t see a big problem with FPGAs by the way, if they can implement it with 1GB they can also do it with 7GB, I’d expect the costs to grow linearly on both specialty hardware & high-end GPUs?
Please correct me if I’m wrong. I am not a hardware expert by any means.
But I think the benefit might outweigh potential drawback by allowing every kind of low end device to also participate instead of specialized high-end 8GB Vega/1080 miners (I have both a Vega farm & a 2GB low end GPU farm).
Just imagine all the additional people that this might attract & introduce to the project. I only got into Bitcoin back in 2012/2013 because I was able to mine it on my Laptop Nvidia GPU :slight_smile:

I also don’t think ASIC manufacturers would get involved too much since you’re taking a clear anti-ASIC stance and take such a risk when the reward drops by 1% every week (0 reward after 2 years). It’s just a temporary measurement and personally I believe getting as many people as possible into the project is the best way forward for Grin. After all, cryptocurrencies can only establish themselves if the community is big & diverse.

By the way what’s the limiting factor with cuckaroo29? Mem bandwidth? Ops/s (timings/clocks)?

Existing FPGAs have limits on amount of memory supported. There will generally be many more FPGAs able to support 1 GB than ones able to support 8 GB.
ASICs, on second thought, should not be a problem as long as the DRAM required doesn’t fit on chip, which is certainly the case at >= 1GB.

There is another potentially large benefit to requiring more than 6GB. Many huge mining farms are dominated by <= 6GB GPUs. Excluding them makes it easier for hobby miners to compete.

Lowering memory requirements allows more people to mine with their current equipment, but most won’t be able to do so profitably.

That said, we could in principle support a range of acceptable cuckaroo sizes rather than the single 2^29 size, if deemed beneficial to adoption.

I think the most popular FPGA in crypto-mining is the BCU-1525 with 16GB of RAM. And from what I’ve heard the Mineority group hosts/and/or owns over 5000 of these (correct me if I’m wrong), so if they decide to write Bitstreams for it they’ll pretty much have no competition.
I think that’s another reason why it’s important to open up mining to as many people and categories of devices as possible (as you are doing).
I also believe there are as many 8GB mining farms as there are 4GB mining farms, the added expense from manufacturer’s side to either go with 4GB or 8GB is minimal (like 20 bucks when I last phoned with an AMD card manufacturer’s manager) and since most big farms are Ethereum farms I’m sure they were smart enough to pay $20 more and not be out-DAG’ed.

Personally I believe it’s smaller farms & hobbyist miners that went with smaller RAM sizes since they don’t plan to run it for years out, like a big icelandic farm with cheap power would.
Don’t exclude the hobbyists & small farms from participating with their existing infrastructure. The only argument is that it makes it harder for FPGAs (going with 6-7GB), but there’s probably going to be FPGAs mining on cuckaroo whether you like it or not. So at least open it up a bit.

Just because you have 16GB on that BCU-1525 doesn’t mean you also have GPU-like bandwidth. I’d be surprised if that thing comes even close to GTX1070 when it comes to cuckaroo. It will eat less power, that is for sure. They all should be mining cuckatoo31 with those.

I believe the AR PoW aims to deter any asic, including those with DRAM by any means. More memory chips, complex memory interface and large chunks of data to process are all part of the puzzle. Cuckaroo benefits from complex GPU designs, if you enable all eth farms on cheap power in, your old AMD card won’t even pay for the power.

You can still mine AF PoW on any GPU with 1GB for a loss if you want.

  1. So you think ETH farms around the world all run on 4GB? You will have farms with cheap power either way.
  2. My point is offering fairness and giving people a chance to participate. If somebody hears they can mine this algo profitably on their mobile GPU, think they won’t just try printing magic internet money for shits & giggles? That’s how Bitcoin took off, through being mined by people at their homes and spreading the word.

By enabling only 8GB you are excluding a large potential userbase and their network. Giving all power to 8GB GPU miners and $4000 FPGA owners.

A normal dude mining overnight on 2gen old AMD 2GB GPU will likely never even hit pool payout threshold. Large farms are businesses run for profit, if they got 5 wagons of RX460 2GB on sale, that’s cool, but I think their existence should not compromise PoW by having even smaller memory footprint compared to eth (<2GB).

speculation

So RX460 (still a modern card) mines at best 0.6 XMR per year. If you run it only overnight (30%) and consider how much less efficient it would be on cuckaroo compared to other cards (20-50%), then I get one year to reach 0.1 XMR minimum payout so in grin where AR PoW goes down to zero, it is not that crazy to think such person would give up long before the threshold. Forget about mobile GPUs.

1 Like

FPGA’s by themselves are no threat. FPGA’s generally use DDR3/4, and none can interface with GDDR5 unless there’s an additional, separate controller on the board. For PoW’s with any significant memory requirements, this is crushing. There are some high-end SoC’s layered on HBM2, but these units cost $10k to $25k each.

FPGA’s can certainly accelerate the computation of siphash, but stuffing 7GB worth of siphashes into DDR takes a long time. Using an FPGA with HBM is prohibitively expensive. These FPGA’s are generally used for things like missiles, medical equipment, and physics research, and the market (over)prices them accordingly.

Missiles and medical equipment, overpriced? What, are you saying that the healthcare and government contracting industries have easy access to money (perhaps that of the taxpayer) and might be working in concert with politicians and are therefore incentivized to create overpriced monstrosities that do nothing but plague humanity and every other unfortunate creature who happens to be stuck on this planet?

:thinking: That’s interesting, I’ll have to think about it.

tromp, I implement the cuckaroo lean base on your cuckoo lean and cuckaroo mean, and found that :

  1. cuckaroo lean is only twice as slow as cuckoo lean
  2. cuckaroo lean has 10 times as many hashes as cuckoo
  3. the number of cuckaroo lean read and write memory is close to cuckoo
  4. x1 + y1 = (x2 + y2) / 2, x1 and x2 is hash times of cuckoo and cuckaroo, y1 and y2 is visit memory times of cuckoo and cuckaroo, then x2 = 10x1, y1 = y2. we can get y1 = 8x1. This shows that the time spent on hash is very low. Then, is the cuckaroo really ASIC resistant?

That’s hard to believe.
Where can I verify your lean cuckaroo implementation?

__device__ u64 block (word_t n) const
        {
                u64 r = *(u64 *) & bits[n / 32];
                //  return ~bits[n/32;
                  return ~r;
        }
__global__ void count_node_deg (cuckoo_ctx * ctx, u32 uorv, u32 part)
{
        shrinkingset & alive = ctx->alive;
        twice_set & nonleaf = ctx->nonleaf;
        siphash_keys sip_keys = ctx->sip_keys;  // local copy sip context; 2.5% speed gain
        int id = blockIdx.x * blockDim.x + threadIdx.x;
        u64 buf[64];
        for (u32 block = id * 64; block < NEDGES; block += ctx->nthreads * 64)
        {
                u64 alive64 = alive.block (block);
                u64 last = 0;
                if(alive64)
                {
                        last = dipblock (sip_keys, block, buf);
                }
                for (u32 nonce = block - 1; alive64;)
                {                                               // -1 compensates for 1-based ffs
                        u32 ffs = __ffsll (alive64);
                        nonce += ffs;
                        alive64 >>= ffs;

                        u64 edge = buf[nonce - block] ^ last;
                        u32 u = (edge >> (uorv ? 32 : 0)) & EDGEMASK;

                        if ((u & PART_MASK) == part)
                        {
                                nonleaf.set (u >> PART_BITS);
                               
                        }

                }
        }
}
__global__ void kill_leaf_edges (cuckoo_ctx * ctx, u32 uorv, u32 part)
{
        shrinkingset & alive = ctx->alive;
        twice_set & nonleaf = ctx->nonleaf;
        siphash_keys sip_keys = ctx->sip_keys;
        int id = blockIdx.x * blockDim.x + threadIdx.x;
        u64 buf[64];
        for (u32 block = id * 64; block < NEDGES; block += ctx->nthreads * 64)
        {
                u64 alive64 = alive.block (block);
                u64 last = 0;
                if(alive64)
                {
                        last = dipblock (sip_keys, block, buf);
                }
                for (u32 nonce = block - 1; alive64;)
                {                                               // -1 compensates for 1-based ffs
                        u32 ffs = __ffsll (alive64);
                        nonce += ffs;
                        alive64 >>= ffs;

                        u64 edge = buf[nonce - block] ^ last;
                        u32 u = (edge >> (uorv ? 32 : 0)) & EDGEMASK;

                        if ((u & PART_MASK) == part)
                        {
                                if (!nonleaf.test (u >> PART_BITS))
                                {
                                        alive.reset (nonce);
                                }
                        }
                }
        }
}

Above is the key codes.

In GPU lean, core is sitting idle. You can do lots of stuff in there with no slowdown. ASIC single-chip is the other way around, memory is instant, but 10x more hashes mean 10x more heat and 10x slower. And it gets bricked after 6 months by some non-trivial changes.

I cannot run that code easily. Do you have a repo with your implementation?


lean2.cu is the code of lean cuckaroo.

It somehow computes different cycles from cuckaroo/cuda29, but indeed is only 40% slower than cuckoo lcuda.
thanks for taking the trouble to code this.

i had forgotten how much idle time the cores have with all the memory latency in lean cuda, as photon noted.
you would see an order of magnitude slowdown from all the dipblock calls if you used fast SRAM for memory.

This will mean about 10x power as well. Power per hash is often a more important metric than hashes per second.

This equation does not hold. GPUs can calculate hashes while memory accesses are in-flight, so lean miner is able to hide its extra hashes under the memory latency. Mean is (mostly) calculating hashes up-front and has more of the additive relation you claim here.

None of this affects the anti-ASIC claim of Cuckaroo much. As Tromp said, SRAM has a very different performance characteristic than DRAM. IMO, it’s the power requirements that kill lean ASIC’s for Roo.