As of today, the cuckarood_cuda_29 plugin of grin-miner runs cycle finding on the GPU.
The code for doing this turns out to be amazingly simple:
https://github.com/tromp/cuckoo/blob/master/src/cuckarood/kernel.cuh#L336-L43 and https://github.com/tromp/cuckoo/blob/master/src/cuckarood/photon.cu#L244-L264
It consists of 21 modified edge trimming rounds. Initially, edges are tagged with (15 bits of) one of their endpoints. These tags are then relayed from edge to connecting edge every round. That means that in a 42 cycle, tags originating at one node travel along both arms to arrive at the opposite node in 21 rounds. Only these edges that connect into identically tagged edges in the final tag-relay round are passed back to the CPU.
They usually number less than 100, and very rarely more than 150. The CPU almost instantly finds all 2, 6, 14, and 42-cycles, as these length all divide evenly into 42.
I did some testing on a 2080Ti to determine the optimal number of trimming rounds before starting the tag relay. This led to the new default of ntrims = 31 (its parity needs to match half the cycle length).
This commit takes the graph time down from about 115ms to about 101ms.
ASIC designers may find this approach beneficial to their future design efforts.
It can be easily applied in a lean miner when the number of edges has been trimmed 64-fold or more, allowing the node-bitmap memory to be repurposed tor tag-relay on 64-bit edges.
Thanks to Wilke Trei (aka Lolliedieb, author of lolMiner) for substantial help in implementing this!