Choice of ASIC Resistant PoW for GPU miners

A normal dude mining overnight on 2gen old AMD 2GB GPU will likely never even hit pool payout threshold. Large farms are businesses run for profit, if they got 5 wagons of RX460 2GB on sale, that’s cool, but I think their existence should not compromise PoW by having even smaller memory footprint compared to eth (<2GB).

speculation

So RX460 (still a modern card) mines at best 0.6 XMR per year. If you run it only overnight (30%) and consider how much less efficient it would be on cuckaroo compared to other cards (20-50%), then I get one year to reach 0.1 XMR minimum payout so in grin where AR PoW goes down to zero, it is not that crazy to think such person would give up long before the threshold. Forget about mobile GPUs.

1 Like

FPGA’s by themselves are no threat. FPGA’s generally use DDR3/4, and none can interface with GDDR5 unless there’s an additional, separate controller on the board. For PoW’s with any significant memory requirements, this is crushing. There are some high-end SoC’s layered on HBM2, but these units cost $10k to $25k each.

FPGA’s can certainly accelerate the computation of siphash, but stuffing 7GB worth of siphashes into DDR takes a long time. Using an FPGA with HBM is prohibitively expensive. These FPGA’s are generally used for things like missiles, medical equipment, and physics research, and the market (over)prices them accordingly.

Missiles and medical equipment, overpriced? What, are you saying that the healthcare and government contracting industries have easy access to money (perhaps that of the taxpayer) and might be working in concert with politicians and are therefore incentivized to create overpriced monstrosities that do nothing but plague humanity and every other unfortunate creature who happens to be stuck on this planet?

:thinking: That’s interesting, I’ll have to think about it.

tromp, I implement the cuckaroo lean base on your cuckoo lean and cuckaroo mean, and found that :

  1. cuckaroo lean is only twice as slow as cuckoo lean
  2. cuckaroo lean has 10 times as many hashes as cuckoo
  3. the number of cuckaroo lean read and write memory is close to cuckoo
  4. x1 + y1 = (x2 + y2) / 2, x1 and x2 is hash times of cuckoo and cuckaroo, y1 and y2 is visit memory times of cuckoo and cuckaroo, then x2 = 10x1, y1 = y2. we can get y1 = 8x1. This shows that the time spent on hash is very low. Then, is the cuckaroo really ASIC resistant?

That’s hard to believe.
Where can I verify your lean cuckaroo implementation?

__device__ u64 block (word_t n) const
        {
                u64 r = *(u64 *) & bits[n / 32];
                //  return ~bits[n/32;
                  return ~r;
        }
__global__ void count_node_deg (cuckoo_ctx * ctx, u32 uorv, u32 part)
{
        shrinkingset & alive = ctx->alive;
        twice_set & nonleaf = ctx->nonleaf;
        siphash_keys sip_keys = ctx->sip_keys;  // local copy sip context; 2.5% speed gain
        int id = blockIdx.x * blockDim.x + threadIdx.x;
        u64 buf[64];
        for (u32 block = id * 64; block < NEDGES; block += ctx->nthreads * 64)
        {
                u64 alive64 = alive.block (block);
                u64 last = 0;
                if(alive64)
                {
                        last = dipblock (sip_keys, block, buf);
                }
                for (u32 nonce = block - 1; alive64;)
                {                                               // -1 compensates for 1-based ffs
                        u32 ffs = __ffsll (alive64);
                        nonce += ffs;
                        alive64 >>= ffs;

                        u64 edge = buf[nonce - block] ^ last;
                        u32 u = (edge >> (uorv ? 32 : 0)) & EDGEMASK;

                        if ((u & PART_MASK) == part)
                        {
                                nonleaf.set (u >> PART_BITS);
                               
                        }

                }
        }
}
__global__ void kill_leaf_edges (cuckoo_ctx * ctx, u32 uorv, u32 part)
{
        shrinkingset & alive = ctx->alive;
        twice_set & nonleaf = ctx->nonleaf;
        siphash_keys sip_keys = ctx->sip_keys;
        int id = blockIdx.x * blockDim.x + threadIdx.x;
        u64 buf[64];
        for (u32 block = id * 64; block < NEDGES; block += ctx->nthreads * 64)
        {
                u64 alive64 = alive.block (block);
                u64 last = 0;
                if(alive64)
                {
                        last = dipblock (sip_keys, block, buf);
                }
                for (u32 nonce = block - 1; alive64;)
                {                                               // -1 compensates for 1-based ffs
                        u32 ffs = __ffsll (alive64);
                        nonce += ffs;
                        alive64 >>= ffs;

                        u64 edge = buf[nonce - block] ^ last;
                        u32 u = (edge >> (uorv ? 32 : 0)) & EDGEMASK;

                        if ((u & PART_MASK) == part)
                        {
                                if (!nonleaf.test (u >> PART_BITS))
                                {
                                        alive.reset (nonce);
                                }
                        }
                }
        }
}

Above is the key codes.

In GPU lean, core is sitting idle. You can do lots of stuff in there with no slowdown. ASIC single-chip is the other way around, memory is instant, but 10x more hashes mean 10x more heat and 10x slower. And it gets bricked after 6 months by some non-trivial changes.

I cannot run that code easily. Do you have a repo with your implementation?


lean2.cu is the code of lean cuckaroo.

It somehow computes different cycles from cuckaroo/cuda29, but indeed is only 40% slower than cuckoo lcuda.
thanks for taking the trouble to code this.

i had forgotten how much idle time the cores have with all the memory latency in lean cuda, as photon noted.
you would see an order of magnitude slowdown from all the dipblock calls if you used fast SRAM for memory.

This will mean about 10x power as well. Power per hash is often a more important metric than hashes per second.

This equation does not hold. GPUs can calculate hashes while memory accesses are in-flight, so lean miner is able to hide its extra hashes under the memory latency. Mean is (mostly) calculating hashes up-front and has more of the additive relation you claim here.

None of this affects the anti-ASIC claim of Cuckaroo much. As Tromp said, SRAM has a very different performance characteristic than DRAM. IMO, it’s the power requirements that kill lean ASIC’s for Roo.

Hi John,

Been reading your stuff on the Aeternity blog, very cool!

I’m wondering why the chain would not dynamically read the currency unit price (decentralized oracles ideally) and adjust mining rewards up or down as required in a feedback loop to have “as much as necessary but not more” hash/graphs - beam doesn’t need 100k 1080ti equivalent right now to securely process user transactions since there aren’t any yet so the way grin does it is better with the linear inflation.

1 Like

It would be great to see such an experiment elsewhere. At the moment I am not aware of any robust way to run a decentralized oracle, are you?

You can require multiple oracles to validate a claim using multisig. That’s as decentralized as you want. Both parties can agree to a set of oracle arbitrators before entering into the agreement.

1 Like

Why yes, that’s why I’m asking…

Basically the idea involves creating a prediction market for the hash rate. Let’s say I create an erc20 token called “HASH” and create a simple smart contract on eth that just accepts votes from anyone who pays the gas and, crucially, a vote price that’s payable in HASH tokens, say .001. Every minute, the average of the votes is tallied and whoever got closest to the average gets 1 HASH token. Creating value for the token is a separate part but to keep this brief let’s assume I have that covered for now - could be as simple as just pre-loading the contract to pay out real ETH to winners periodically to get it started and then have it grow into self sustainability.

These independent for-profit actors competing to guess the hash rate (however they do it) will enable it to serve a “decentralized oracle” function without actually structuring it as such. Does this make sense?

This idea just occurred to me the other day but I’m sure someone else has thought of it too. Are there any projects out there (within crypto or without) that I could learn from?

1 Like

I wouldn’t touch a coin with … “side effecty”(?) smart contracts until after good smart contracts are a thing that are super proven, and pro-tip this is not erc20 it would be more like simplicity

years after its proven.

Smart contracts should not touch the miner state unless its the best idea ever, bitcoin script being sub-turing was the correct move in my opinion; and eth ico market is a train wreck. Smart contracts shouldn’t be smart or contracts, its just a bit of engineer naming for a simple ideal of having lightly complex transactions being trustless and user level designable.

Apparently a new prediction market called Veil is launching with Grin markets in mind… hopefully gets some use.

built on ethereum

So for ASIC, which algorithm should they target? C31 mean or C31 lean? or C32 ?

At first I was like “really?” then

built on ethereum

I was like “…sigh… really.”

So some time has passed since November. Both @tromp said that lowering the requirement for GPUs to 1GB would not grant ASICs an efficiency advantage vs. 7GB and @timolson saying FPGAs do not pose a threat to Grin.

There are some high-end SoC’s layered on HBM2, but these units cost $10k to $25k each .

FPGA’s can certainly accelerate the computation of siphash, but stuffing 7GB worth of siphashes into DDR takes a long time. Using an FPGA with HBM is prohibitively expensive

With that being established(?) I see no reason to not lower the barrier and allow a much broader array of devices to participate in the Grin network.

The argument that “all big farms run on 4GB” is not a good argument imo. You’ll have farms running all kinds of hardware, from low end Sea Island family to high end RTX 2080 and 1080Ti farms. By not allowing everything to participate you give a certain advantage to a certain kind of hardwarem which might seem unfair and not decentralized.

According to ethos (the OS used by most hobbyist miners around the world) gives a good image of the distribution:
http://ethosdistro.com/versions/
You’ll want to get the attention of as many of these as possible.

Has a decision been made? Obviously I would strongly stand in favour of it.

Open disclosure: I own both a big amount of RX 550 2 GB/4GB GPUs mining Monero and a significant amount of 8GB RX Vega 56 GPUs.

Kind Regards
MoneroCrusher