Optimizing C31.0 for Nvidia Tesla V100

Thanks @tromp for the work you’ve done on building higher fidelity C31 solvers. I’m consistently getting between 1.6 ~ 1.7GPS, which is down from 1.9-2.0 GPS before, but of course, the fidelity is way better.

I was wondering what are some of the ways to improve C31 solvers for Tesla V100. For example, V100 has higher L1 Cache combined with shared memory, configurable up to 96KB. Is it the correct way to increase the shared memory size from 64KB to 96KB in the C31.0 solver by updating intmaxbytes on line 531 - cuckatoo/mean.cu from intmaxbytes = 0x10000 to intmaxbytes = 0x18000? Would appreciate some help!!

To all the miners using V100, how many GPS are you able to achieve and what are your current settings?

There is no benefit to be gained from 96 KB over 64 KB in the current solvers, which are restricted to using a power-of-2 number of bits. You can avoid most of the slowdown of the latest update by compiling with -DNRB1=32 and command-line-passing -E 2 (expand = 2 in grin-miner).

1 Like

Hi tromp, and mayank,

i’ve tested with a V100, and my best configuration hashed @1.96GPS,
if i remember correctly it was:

plugin_name = “cuckatoo_mean_cuda_rtx_31”
device = 0
expand = 1 # I’M SURE ABOUT THIS PART!
cpuload = 1
ntrims = 176
genablocks = 4096
genatpb = 128
genbtpb = 128
trimtpb = 512
tailtpb = 1024
recoverblocks = 1024
recovertpb = 1024

I didn’t compile it with your options tromp, will try as soon as i get the VPS back
Take care.

Ah, I see. Thanks! Got a nice 7% bump in GPS by compiling with -DNRB1=32 and setting expand = 2.

Hi josevora,

I used the same settings before @tromp released the higher fidelity C31 solver, and got about the same GPS as yours. However, the fidelity was around 0.7, and hence the effective rate was about 1.35 GPS.

Since the new higher fidelity solver, the GPS came down to about 1.5-1.6 GPS with much better fidelity (0.98+), but by tuning it to the following settings, I raised it to about 1.8 GPS:

  1. Compiling with -DNRB1=32 -DNEPS_A=150 -DNEPS_B=100
  2. Setting expand = 2 (as @tromp suggested), ntrims = 352, genablocks = 16384 and trimtpb = 1024

On a side note, @tromp do you believe, or do you think it makes sense for the Tesla V100 (or any GPU with 16GB+ memory as a matter of fact) to run both C31 and C29 simultaneously (as they require 10.5GB and 5.5GB memory respectively)?

I don’t think it makes sense, as I expect running two instances increases memory contention so much that running one after the other should run faster. But you can try and see for yourself…

Hello again guys,
well I have tried 2 instances before… and i have two cases:

Same Plugin:

  • AMD (Radeon Vega FE) - I have a slight 5-15% increase in GraphRate, but fidelity drops a lot more than that so not worth it!
  • NVIDIA (Tesla V100) - I Have both decrease in GPS and Fidelity… :frowning:
  • CPU (ThreadRipper 1950x) - 5% increase in GPS, ~10% decrease in fidelity…

Different Plugins:

  • AMD (Radeon Vega FE) - INSTABILITY in GraphRate, like a lot… i have Deltas of 70% for each Plugin, fidelity drops even more than for using same Plugin!
  • NVIDIA (Quadro P6000) - couldn’t test because even by having a 24GB card at my disposal, the miner accused “out of memory”…
  • CPU (ThreadRipper 1950x) - more stable than the others, but still not worth it…

my conclusion:
doable, NOT WORTH IT…
you are better of by (probably) doing it directly within the same plugin (increasing the used memory or something like that), than splitting the work in two… because in my view, you are bottlenecking by adding a second call to Read/Write the memory, with no guaranties of different outcomes…