Change file_log_level
to Debug
at your grin-server.toml
then share your log file grin-server.log
go to peers and sync and check the download speeds
I suggest the quickiest way is to use the stable version, then replace the grin executable file with latest beta.
I also had some issues like you but not sure why.
i run on Ubuntu 22 also, almost same config as yours. There is no problem. It works fine.
I installed the same OS image on a laptop and all synced up first time. Likely something with the VM causing an issue. Download speed was also so much faster. I didn’t get any inbound peers though despite port 3414 being open (I can see inbounds on g++). Perhaps a config I need to look at?
Which phase you’re having sync issue? is it PIBD after finishing heading sync?
You can see the screen shots above of where it got stuck after syncing headers. I won’t spend anymore time on that issue as it seems environment related. I tried using a VM for convenience but I actually prefer dedicated hardware for a node.
Yeah, you’re not alone, although I built from PIBD - Revert earlier pibd_aborted flag in sync logic by yeastplume · Pull Request #3757 · mimblewimble/grin · GitHub but not better.
I gave it up by using stable alpha version to sync the node, then copy the beta version to overwrite alpha version…
5.2 beta 2 has been released , time for more testing
Start new testing from now: 2 windows, 2 ubuntu with default settings
Testing on Windows 10, four runs in total.
I encountered a forced shutdown at step 4/7. Also some observations on speed in different steps, where it can be possible improved
Step 1) Downloading headers 0-2 MB/s, seed fluctuates. Sometimes drops to 0 kb/seconds (I monitored a graph of the internet speed). Based on disk usage over time, effective download speed is around 0.75 MB/s. Since I am connected to 24 peers and have at least 10 times the used download bandwidth, I suspect there is some unnecessary waiting before making new requests. Maybe check that new PMMR’s are already requested while old ones are being validated?
Step 2, p2p protocol started, around 500 Kb according to wallet speed, with 24 peers
According to wallet, step 2, chain state for sync is 1314(MB), took around 17:48:11-17:52 (so only 4 minutes for more than 1 GB around 6 MB/s based on disk space increase , and according to network monitor around 50 MB/s
Now that is the speed we also want in step 1.
Step 3), preparing for chain-sate validation, from 17:52-17:56, took 4 minutes XXX, CPU at 7.5%, so apparently a lot of calculations, done
Step 4) Sync step 4/7: Validating chain state - range proofs:, 17:56 - 17:57, grin not responding, CPU, still one core full power working. 7.5 %, high power usage on and off, giving it some time, since it appears not hanging. Looking at logs, just a lot of verify_kernel_signatures. This all looks good except that it in the end results in a shutdown, probably because the normal p2p connection is not maintained/prioritized. Will run another test to see if this shutdown is a one time problem.
Update: 2nd run everything went smooth, around 1.5 hour for sync.
Update: 3rd run everything went smooth, no crashes, sync in 1 hour and 15 minutes.
Update: 4th run, enabled 16 threads instead of default 4. Sync time 1 hour and 37 minutes, so to many HTTP threads might even slow down syncing
Update: 5th run, enabled 32 thread, 32 peers, Sync time was 1 hour and 18 minutes. More peers and more threads is not faster it appears, or at least it is saturated with lower numbers
Update: 6th run, use default 8 peers, 4 threads.
20230629 18:02:15.353 ERROR grin_p2p::peers - connected_peers: failed to get peers lock
20230629 18:02:15.353 ERROR grin_p2p::peers - connected_peers: failed to get peers lock
20230629 18:02:15.353 DEBUG grin_servers::common::adapters - send_block_request_to_peer: can’t send request to peer PeerAddr(107.174.186.153:3414), not connected
20230629 18:02:15.353 ERROR grin_p2p::peers - connected_peers: failed to get peers lock
20230629 18:02:15.353 DEBUG grin_servers::common::adapters - send_block_request_to_peer: can’t send request to peer PeerAddr(192.3.254.163:3414), not connected
20230629 18:02:15.353 ERROR grin_p2p::peers - connected_peers: failed to get peers lock
20230629 18:02:15.353 DEBUG grin_servers::common::adapters - send_block_request_to_peer: can’t send request to peer PeerAddr(45.77.150.172:3414), not connected
20230629 18:02:15.353 ERROR grin_p2p::peers - connected_peers: failed to get peers lock
20230629 18:02:15.353 DEBUG grin_servers::common::adapters - send_block_request_to_peer: can’t send request to peer PeerAddr(121.43.174.18:3414), not connected
20230629 18:02:15.353 DEBUG grin_p2p::conn - Shutting down reader connection with 121.43.174.18:3414
20230629 18:02:15.353 ERROR grin_p2p::peers - connected_peers: failed to get peers lock
20230629 18:02:15.353 DEBUG grin_servers::common::adapters - send_block_request_to_peer: can’t send request to peer PeerAddr(23.94.105.34:3414), not connected
20230629 18:02:15.353 DEBUG grin_p2p::conn - Shutting down reader connection with 23.94.105.34:3414
…More of the same, waiting then dropping peers, then a complete Shutdown!
20230629 18:02:15.356 DEBUG grin_p2p::peer - Waiting for peer PeerAddr(107.175.127.117:3414) to stop
20230629 18:02:15.356 DEBUG grin_p2p::conn - waiting for thread ThreadId(923) exit
20230629 18:02:15.356 DEBUG grin_p2p::conn - waiting for thread ThreadId(924) exit
20230629 18:02:15.356 WARN grin_servers::server - Shutdown complete
RESTARTING NODE
After the restart, we are at step 7) this takes a bit (hanging at 99%, probably downloading the lats <48h of blocks) done after 1 hour and 30 minutes of syncing, most time consumer by getting headers, more than an hour. Not bad, but I think in step 1 the speed can be made faster by making sure there are always requests made and new PMMR’s are stored in some buffer.
Full log in DEBUG mode can be found here:
https://github.com/Anynomouss/grin_PIDB_testing/blob/main/grin-server-run1%2C%201.5h.log2
Thoughts on PIDB:
Step 1, syncing headers, is the slowest (0.75 MB/s write speed). After reading up it turns out this step is not part of PIBD. Not sure why it is slow, probably because it is single thread, or requests for 32 headers is to small.
Step 2 shows the potential speedup, in my case 50 MB/s
Step 3 is fast
Step 4-5 (validating range proofs and kernels) takes some time since it involves a lot of calculations to validate the state, this is fine and cannot be avoided.
Step 6 is super fast
Step 7, downloading the last 48 hours of blocks, a bit slow, takes around 10-15 minutes.
My suggestions:
- Check in step 1 if there is any waiting/validating before requesting new PMMR’s, eliminate waiting, use the full download bandwidth speed either with more requests or with larger requests.
- Step 7 takes a bit of time, 10-15 minutes. My question is, do we strictly have to wait with downloading those last 48 hours worth of blocks? I think we can download them during step 3-6, just keep them in RAM and add them/validate them after step 6.
If the above speedups are possible, I think we can easily cut the sync time in half since basically 1 hour out of 1.5 hour is spend on step 1. Also waiting to download the last blocks in step 7 is not needed and can be done in parallel.
Edit: Phase 1 is just the normal requesting of Headers, not even part of PIDB. This should be sped-up, as it says in the RFC:
Since nodes already maintain a block header MMR, we could re-use the existing segment infrastructure to introduce block header segments. Although further investigation is required, this has the potential to speed up the header sync, which is the first step in the synchronization process.
Either the above, or we can increase the request size which is now 32 blocks per request, see if making it bigger could lead to a speed up.
Some tests of mine:
- Windows 10: Strange that node auto-closed at startup, had to re-open. The sync time is around 4 hours (compare to 2 hours in old alpha version)
- Ubuntu: Also took around 4 hours, seems more stable than windows
One special trick that I found: If the node is stuck during the sync or the peer numbers suddenly dropped, try to close the node, then remove the directory chain_data/peer. Re-open the node, the sync resumes better.
I also found another bug in windows, in TUI ‘Peers and Sync’, if we have many entries and we use mouse scroll, grin node will crash and close. I tested in linux and it did not happened. There is nothing special in log:
20230702 09:00:52.106 ERROR grin_util::logger -
thread 'main' panicked at 'attempt to add with overflow': /rustc/90c541806f23a127002de5b4038be731ba1458ca\library\core\src\ops\arith.rs:111 0: <unknown>
1: <unknown>
2: <unknown>
3: <unknown>
4: <unknown>
5: <unknown>
6: <unknown>
7: <unknown>
8: <unknown>
9: <unknown>
10: <unknown>
11: <unknown>
12: <unknown>
13: <unknown>
14: <unknown>
15: <unknown>
16: <unknown>
17: <unknown>
18: <unknown>
19: <unknown>
20: <unknown>
21: <unknown>
22: <unknown>
23: <unknown>
24: <unknown>
25: <unknown>
26: <unknown>
27: <unknown>
28: <unknown>
29: <unknown>
30: <unknown>
31: <unknown>
32: <unknown>
33: <unknown>
34: <unknown>
35: <unknown>
36: <unknown>
37: <unknown>
38: BaseThreadInitThunk
39: RtlUserThreadStart
My two last comments were tested in beta 2
Bump ! Testing time to time will be better.
@noobvie @Anynomous @waynegeorge @l33d4n @ardocrat @trab @Trinitron
My PIBD sync gets stuck at 100% on 2-months old node, there is no high CPU load and any disk write/read. I am debugging to find a reason/solution.