Several people have reported huge amounts of bogus 2-cycles found on the CUDA cuckaroo miner running on a 2080 (Ti) GPU.

I’ve spent some time tracking down this issue, with ssh access to a 2080 GPU generously provided by Quentin & Blockcypher.

In the latest commit to https://github.com/tromp/cuckoo there is now a cudabug make target in the src/cuckaroo directory.

Running on a 1080 Ti, this gives:

```
$ make cudabug
nvcc -std=c++11 -o cuda19 -DSYNCBUG -DPROOFSIZE=2 -DEPS_A=4 -DEPS_B=3 -DIDXSHIFT=2 -DEDGEBITS=19 -arch sm_35 mean.cu ../crypto/blake2b-ref.c
./cuda19 -U 64 -Z 64 -z 64 -m 4
GeForce GTX 1080 Ti with 10GB @ 352 bits x 5505MHz
Looking for 2-cycle on cuckaroo19("",0) with 50% edges, 64*64 buckets,
4 trims, and 64 thread blocks.
Using 28MB of global memory.
nonce 0 k0 k1 k2 k3 a34c6a2bdaa03a14 d736650ae53eee9e 9a22f05e3bffed5e b8d55478fa3a606d
group 2f2 size 124
group 2f2 size 124 lid 22 localIdx 1
group 2f2 size 152
group 2f2 size 152 lid 18 nflush 8
group 2f2 size 152 lid 22 localIdx 0
Seeding completed in 1 + 3 ms
round 0 group ad6 edges 260
round 1 group ad6 edges 89
round 2 group ad6 edges 54
round 3 group ad6 edges 28
61340 edges after trimming
findcycles edges 61340 time 3 ms total 13 ms
Time: 14 ms
0 total solutions
```

The interesting output is right after “group 2f2 size 152”.

This is one of the 64*64 buckets being processed in the second call to SeedB. It contains 152 edges for which endpoint 0 has the least significant 6 bits being 0x0b. These edges are being bucket sorted on the next group of 6 bits. Those with the 12 lsb bits equal to 0x5ab will go to group 2d6.

There are 8 such edges, and they are “flushed” in lines 205-206 of mean.cu

Line 210 sets the counter for currently held 0x5ab edges back from 8 to 0, and there are no more edges to be moved in the leftovers loop at lines 215-227. On the 2080 however, thread 18 makes it to the leftover loop before thread 22 reaches the flush of 8 in the main loop, and it ends up storing these 8 edges twice in group 2d6:

```
$ make cudabug
nvcc -std=c++11 -o cuda19 -DSYNCBUG -DPROOFSIZE=2 -DEPS_A=4 -DEPS_B=3 -DIDXSHIFT=2 -DEDGEBITS=19 -arch sm_35 mean.cu ../crypto/blake2b-ref.c
./cuda19 -U 64 -Z 64 -z 64 -m 4
GeForce RTX 2080 with 7951MB @ 256 bits x 7000MHz
Looking for 2-cycle on cuckaroo19("",0) with 50% edges, 64*64 buckets, 4 trims, and 64 thread blocks.
Using 28MB of global memory.
nonce 0 k0 k1 k2 k3 a34c6a2bdaa03a14 d736650ae53eee9e 9a22f05e3bffed5e b8d55478fa3a606d
group 2f2 size 124
group 2f2 size 124 lid 22 localIdx 1
group 2f2 size 152
group 2f2 size 152 lid 22 localIdx 8
group 2f2 size 152 lid 18 nflush 8
Seeding completed in 0 + 0 ms
round 0 group ad6 edges 268
round 1 group ad6 edges 89
round 2 group ad6 edges 54
round 3 group ad6 edges 28
61360 edges after trimming
2-cycle found
2-cycle found
2-cycle found
2-cycle found
2-cycle found
2-cycle found
2-cycle found
2-cycle found
findcycles edges 61360 time 3 ms total 4 ms
Time: 5 ms
Solution 64f04 64f04
FAILED due to edges not ascending
Solution 65124 65124
FAILED due to edges not ascending
Solution 64b9d 64b9d
FAILED due to edges not ascending
Solution 64d4d 64d4d
FAILED due to edges not ascending
4 total solutions
```

The two __syncthreads(); calls in the main loop are supposed to prevent this. Or am I missing something?!

ADDENDUM: https://docs.nvidia.com/cuda/turing-tuning-guide/index.html mentions this new primitive __syncwarp for Turing:

“Applications that assume reads and writes are implicitly visible to other threads in the same warp need to insert the new __syncwarp() warp-wide barrier synchronization instruction between steps where data is exchanged between threads via global or shared memory. Assumptions that code is executed in lockstep or that reads/writes from separate threads are visible across a warp without synchronization are invalid.”

But that appears to be weaker than syncthreads; and the bug persists after changing to

```
__syncwarp(); __syncthreads();
```

Any thread syncing experts want to weigh in?