While it can’t be cheap, it also gives you access to high performant FPGAs at zero fixed costs. As there presumably are no ASICs for Cuckoo yet, is it possible that this would be performant enough in the early days? Would performance of these high end FPGAs make it worth while compared to GPUs and “home-kit” FPGAs?
If it’s profitable enough to run on AWS, it would be a money printing machine. Who’d give it away for free?
I guess I just did, no? (; Hoping some kind soul who has the project’s best interest at heart would be willing to help us test this and benchmark it. I don’t have any OpenCL programming expertise, and don’t plan to be a miner, so it doesn’t make sense for me to invest in learning this at this point. But it seems like anyone who’s thinking of mining with GPUs using OpenCL should at least look into this option, even if it’s only in order to dismiss it as non-feasible.
Apart from siphash, cuckoo cycle is still very memory bound. I can imagine some FPGA specific tuning for cuckoo in OpenCL that would help, but the FPGA would still need over 512GB/s bandwidth and then be only 2-3x faster compared to 1080 Ti. Sure it would draw less power, but the rental cost might be more expensive than electricity.
If I read the product page right, the
f1.16xlarge would have 8 FPGA cards with 976 GiB instance memory and thanks to dedicated PCI-e have the FPGAs share memory space and communicate with each other at up to 12 Gbps in each direction. Is that the same bandwidth that you are referring to? And what impact would that have on mining? Is it a linear increase?
The vu9p boards on amazon have 4 DDR4 channels each (4x12 = 48 Gb/s). Also 346 Mb embedded RAM on the chip.
It needs one of those models with 4GB HBM memory. Embedded 32MB of SRAM is good for buffering. But I would still question if it is worth the effort given the insane price tags.
UltraScale+ boards feature integrated HBM2 memory with a bandwidth of 460 GB/s. For a single graph, an leaner mean solver needs to write (and read) 12 GB of data (each of 2^29 edges is written an expected 6 times, and takes only 4 bytes). So in the very best case, you could run 19 graphs per second. But in practice, you’d probably do very well to achieve half of that.
Yea, I miscalculated, the existing GPU OpenCL code could also be modified to transfer only approx 13GB using the large SRAM buffer on UltraScale+. So theoretical max speedup is 7x over 1080 Ti with real being less (3-4x ?).
Some have HBM2, but not the dev kits (that cost almost $5k) or the ones on AWS.