Identifying bottlenecks in PIBD
I will use this topic as public notebook to log the results of experiments to fine tune parameters used in PIBD to reduce the sync time for grin rust node, and hopefully soon also for Grin++ nodes. I will modify and add new results to this post for each experiment.
My motivation for this little exploration is simple. When I tested the sync speed for grin rust this morning I was surprised how long to took to sync all chain data. I believe there many low hanging fruits, such as parameters that can be fine tuned without even a need to change any of the real messaging code for PIBD. For this post I will limit my exploration to Step 1 syncing the headers, since this step is not resource intensive while being responsible for >70 of the downloaded data, it is by fart the most easy and worthwhile to optimize sync step.
First observation, Step 1 Downloading block headers - accounts for 70% of all data downloaded and took around 3 hours to download 3.2 GB’s of stored data (ignoring overhead). That is ~300 KB/second of downloaded data speed which is rather underwhelming. From there on I started doing experiments changing some of the default parameters for PIBD to see which one’s are causing the bottlenecks and slow downloading speed.
Warning: These are just some exploratory experiments performed on particular hardware with a specific internet speed. Any finding here might require much more testing to see if they do not cause problems when performed with different hardware and internet speed. E.g. slow peers and mobile nodes might require their own specific set of parameters.
STEP 1: Experiment 1A - Increasing HEADER_BATCH_SIZE from 32 to 256, increased HEADER_IO_TIMEOUT from 5000ms to 20000ms.
This led to decrease in syncing time of approximately a factor 3x for Step 1: Syncing headers. With 32 headers per request, it took around 3 hours ~ 180 minutes to download all Header data. With an increase to 256 headers per request, the download time for the headers was reduced to ~55 minutes. Note that this step is not resource intensive, CPU <1%, around 15MB Ram usage.
Changing this parameter led to a reduction of 3X in sync time for step 1 which is promising.
STEP 1: Experiment 1B - Same as experiment 1, but increasing the HEADER_IO_TIMEOUT to 60000 ms:
Results appear a slight drop in download speed, therefore increasing timeout appears counter productive. Note that I switched to much more detailed logging which might reduce the speed a bit, also 1/8 nodes was Grin++ which does not support PIBD yet, meaning speed should drop by 12.5% (1/8*100).
STEP 1: Experiment 1C - Reducing the HEADER_IO_TIMEOUT again to the default of 5000 ms, increasing HEADER_BATCH_SIZE to 512
- Time from requests to receiving is still low, e.g. 100-1000ms.
- Some periodic warnings about headers refused by chain, not found, check this error. I did not notice this error much before, perhaps reducing HEADER_IO_TIMEOUT is the cause, should increase it again to 20000ms and test.
- A small increase in resource consumption, CPU now peaking at times at 3%
- Not faster despite request size increase, perhaps even slower than before. Increase HEADER_IO_TIMEOUT and test again.
STEP 1: Experiment 1D - Increasing the HEADER_IO_TIMEOUT to 30000 ms, increasing HEADER_BATCH_SIZE to 512
- Does not appear to be faster
STEP 1: Experiment 1E - Same as 1D but increasing peers from 8-16 and blocking the specific peer that sends empty blocks
- Surprisingly this appears to be slower, especially at the start. I think it is because nodes need to be filtered, some, like and old 5.0 node that sends empty blocks, need to be filtered out.
Speed per peers appears to drop significantly, but it unclear why. I see quite a few 5.1 and 5.2 nodes, my guess is that they are not syncing well for some reason.
SUMMARY of findings
- Step 1 is not resource intensive and accounts for 70% of block data, most worthwhile to optimize.
- Best result so far is a 3X decrease in download time for this step.
- Increasing the HEADER_BATCH_SIZE leads to significant reduction of download time but requires an increase of the HEADER_IO_TIMEOUT to avoid “header reused by chain …DB Not Found Error: BLOCK HEADER: XXXXXXXXXXXX” errors
- Increasing HEADER_BATCH_SIZE beyond 256 does not lead to a decreased sync time on this specific hardware/network.
- successfully hydrated (empty) block: 0003f4ac2f8b at 3175504 (v3), leads to block not found error. Investigate what these “empty blocks are”. Note, it is always one single peer that cause this error, investigate:
20250204 15:44:18.815 DEBUG grin_servers::common::adapters - Received compact_block 00002059283d at 3175499 from **162.19.139.184:13414** [out/kern/kern_ids: 1/1/0] going to process.
Useful resources: