Finding a cycle in trimmed edges (help)

My output now matches your output. Also compiles on linux, but I still need to make some changes for linux.

On the image below, I often miss 314 cycle in iteration two. So that leads me to a question. When you follow a path from currently inserted edge to its root, what happens when that path runs into itself in a loop. We just terminate path method and if that happened to be shorter path, it gets reversed, but this is destructive, because there is no free root node for the clean reversal. Maybe the implementation differs here.

When you follow a path from currently inserted edge to its root, what happens when that path runs into itself in a loop

That’s a bug, since findcycles() maintains a directed forest, i.e. a loop free graph.
It does need to run single threaded to maintain this invariant though…

Looking very good! 2.2x faster than my CUDA solver.
Is this also running on a 1080Ti?

1070, I don’t have any other GPU. I expect final speedup to be 4x and with the complex variant that will come later 5x, absolute maximum 6x. I do wonder how 1080 will take it.

For reference, how fast is my solver on your 1070?

For the same nonce as my example above. Looks faster compared to results posted in the other thread so not quite 4x speedup, but close. But 1080 might go slightly over 4x speedup. Compiled with cuda 9.1.

So if your speedup is preserved from 1070 to 1080Ti, then we’re looking at solve times of only 0.270ms. Very impressive indeed. You’re very close to collecting a double bounty, just like xenoncat did.

1 Like

Running on same system in linux now. Still losing some cycles, likely a memory access bug hidden in cuda code. In 10 runs I have not lost that 42 cycle solution once so it is not critical. It is bugging me though.

The cuda (later opencl) executable is as simple as possible, accepts header hash on the input and spits out around 80k edges. The rest is done in dotnet process(es). The idea is to have this program sitting on a windows/linux/nvidia/amd PC and connect over local network with either grin node or pool.

I will check the code again to see if I spot anything fishy that would explain missed cycles.

Hi John, the miner has been published at https://github.com/mozkomor/GrinGoldMiner

Tested on 10k iterations on both win and linux for stability, passed.

It may still be slightly buggy and lose some cycles, but we need to focus on something else for at least a week so publishing now. I hope you’ll manage to get it running. At the end I just made a small ramdisk in “edges” subfolder and use normal files to get edges out, crude but effective until better solution is made for linux.

Thanks for making your solver publicly available. I will start testing it tonight or tomorrow and let you know if I have trouble getting it to run. I just noticed it requires 8 GB. What would remain of your performance improvement if you had to target 4 GB or 6 GB cards?

Windows steals over 1GB on every GPU so in reality it can work on something around 7GB. I never bothered to target less memory, because with 6GB GPU and widnows stealing portion of it, might as well go all in on 8GB cards only.

Compare my screens above the linux one and windows one. Look at the initial free memory. Windows is retarded, it does this even on headless GPUs.

You’d have to copy a lot more around and with 4GB cards having less compute processors as well, wouldn’t be that great. Might still be fast, but I’m not gonna even try.

6GB cards on linux might work without any performance impact later on as the more advanced optimizations also save quite a lot of memory.

If your solver can fit in 4GB card on windows and it uses that matrixy stuff to save maximum space, I image you could apply some of the individual optimizations, but you would not see anywhere near 4x speedup. More like little bit, not sure, I don’t actually fully understand how the mean miner works.

Anybody reading this thread who has GTX 1070 Ti or TITAN V and is willing to test the solver in above mentioned repository, please come forward and share your results.

Unfortunately, I only fit two of those three criteria. I lack only the possession. :slight_smile:

GTX 1070 Ti on Ubuntu 16.04 has error:

$ ./Theta -r 100 -n 0
Starting CUDA solver process...
Currently available amount of device memory: 7086145536 bytes
Total amount of device memory: 8513585152 bytes
Allociating buffer 1
Allociating buffer 2
status: out of memory
cudaMalloc failed buffer B 3GB!
CUDA terminating...
CUDA launch error
Finished in 2042ms

Do I need to adjust some parameters for memory?

Some windows software stole memory from the GPU, you can just find it and kill, reboot or make the buffers smaller by lowering DUCK_SIZE_A and B compile time constants one by one until it fits (if you compiled from source).

Edit: Sorry I missed the ubuntu… definitely some process took over 1.4GB of your memory.

Thanks, sorted it out. Using Ubuntu 16.04, GTX 1070 Ti + ramdisk.

=======================================================
$ ./Theta -r 1000 -n 0

Trimming iteration 61 of 1000
Trimming: 4178123b6607deb3 596e493a0fe04022 685fbcfc1d315fe 7cf66796fc0083c1
2-cycle found
16-cycle found
12-cycle found
6-cycle found
26-cycle found
8-cycle found
Trimmed to: 78176 edges
Trimmed in 400ms
Trimming iteration 62 of 1000
Trimming: 544f44b2b17afc97 4ba38ecebc2fa72c 21e2c32bba7f6196 4d8886ccba77435b
4-cycle found
32-cycle found
56-cycle found
302-cycle found
314-cycle found
Trimmed to: 77508 edges
Trimmed in 404ms
Trimming iteration 63 of 1000
Trimming: 7d4f06d5f68dc772 331017080ac63322 e62926ee68af70ed cf2efe7e2f4dbc16
Trimmed to: 76856 edges
Trimmed in 395ms
Trimming iteration 64 of 1000
Trimming: 89f81d7da5e674df 7586b93105a5fd13 6fbe212dd4e8c001 8800c93a8431f938
42-cycle found!
84-cycle found
320-cycle found
192-cycle found
Trimmed to: 69937 edges
Trimmed in 400ms
Trimming iteration 65 of 1000
Solution: nonce:3F k0:7D4F06D5F68DC772 k1:331017080AC63322 k2:E62926EE68AF70ED k3:CF2EFE7E2F4DBC16
16-cycle found
288-cycle found
554-cycle found
226-cycle found
23ECE, 27E0856, 2AD8C27, 2CBB0B5, 3694CDD, 477A095, 64DE6FC, 64E1C92, 68E624D, 6AA4C6F, 6B1D0C2, 76F07D2, C273122, C2E38ED, C655CDE, C97BA17, E708130, EC8890D, ECB9932, F28D66D, F577AFF, 104D8441, 116DE91F, 116E61CB, 1178EA28, 11840F8A, 11CE10B0, 12792630, 12AE2388, 140AE893, 1439B9FD, 146A3047, 1538D93C, 176CB068, 17E01C9B, 1876EE0A, 1C871774, 1D37D976, 1D6FA785, 1D9C1669, 1D9D015E, 1DB85F7E
Trimming: aad9475236944448 af6b3569368015a e032c834e65b9b87 f168dfa67498e7c8
Trimmed to: 73503 edges
Trimmed in 495ms
Trimming iteration 66 of 1000
Trimming: bb8e0d9c86092685 5cc27ab9541d15f 745df9f37b43b2ce 9ebdebee66aa3f74
20-cycle found
324-cycle found
118-cycle found
Trimmed to: 84426 edges
Trimmed in 400ms

CUDA terminating…
Finished in 405892ms

=======================================================

So 2.46 h/s on 1070 Ti. :fire:

So you get 10% boost over regular 1070 with the only difference being 4 extra multiprocessors. I guess memory becomes fully saturated on 1070 Ti. Thank you for testing this.

You can also have fun with overclocking, both core and memory, for my 1070 overclocking core had larger impact compared to overclocked memory. But on 1070 Ti core overclock alone may not be that effective anymore.

There are processes that would eat up that much without user intention?

Is it the wm?

Why isn’t turning off whatever it is common knowledge in any gaming community?

For me, it was chrome. Not sure why it was holding GPU memory but I did have some youtube links open so maybe it was cached video content. After closing chrome I had almost 100% GPU memory available and was able to run the test.