I have spent the past couple of weeks getting back up to speed after taking some time off in January.
I am focused on “block archival support” currently and have made progress with some ancilliary work necessary to support this.
Tracking issue for “block archival support” -
Investigation into block archival identified a performance issue related to block sync and how we identify “missing” blocks that need requesting from peers -
This is resolved here -
Also uncovered an issue with banning peers when requesting too many blocks (i.e. during archival sync) due to “abusive” peer behavior (p2p msg rate limit exceeded).
As part of block archival support this has been reworked to introduce “rate limiting” to maintain acceptable p2p message rates without exceeding the limit and risking getting banned.
Fix is here -
Also tracked down the “block not found” error that we occasionally encounter during node restart.
This is actually related to some PIBD initialization code and we were not fully accounting for a likely edge case here related to how we rewind based on the most recent archival period (every 720 blocks). The fix is minor and went ahead and merged this.
I have some code running locally that successfully begins requesting historical blocks (currently from a known set of “preferred” archival peers).
I am hoping to get some of this work into a PR’able state over the next couple of days. Roughly speaking it suppresses the “state sync” (txhashset.zip) when running in “archival mode”, forcing the block sync to begin syncing missing blocks from height 1 onwards. Now that we can successfully rate limit the p2p requests we can simply continue block sync for all missing blocks up to the current height.
To release this fully we will need to support peer selection/filtering based on archival node support so there is still some work to do there. I suspect we may still want to take advantage of “preferred peer” support to ensure we have good connectivity when starting a new archival node up. But I think that’s acceptable if a node opts-in to archival mode and wants to reliably sync full history - you need to know at least a couple of archival nodes to sync from.
I am aiming for some kind of “beta” of this in the next week or so and hopefully we can be doing some wider testing of this functionality soon after that.
Continuing to make progress on “full archival sync”.
Tracking issue here -
I have a full archival sync running locally and it runs to completion with some caveats.
Hoping to have this out for a wider beta release later this week.
The big caveat is the chain compaction process is slow when used alongside archival sync.
This led down a bit of a rabbit hole exploring various options for making chain compaction more efficient (or doing it less often, or limiting volume of data being compacted etc.)
This is still a continuing investigation.
Plan for the next few days is to park the compaction investigation and move “archival sync” forward so others can test it out. And then pick up the compaction investigation after that.
Ideally archival sync runs (albeit slowly) without a significant pause at the end for chain compaction.
Why even run chain compaction if node is “archival”? What is there to compact?
An archival node maintains a full block history in the local db.
Compaction allows us to maintain the PMMR data structures in an efficient way (we can prune and compact historical spent outputs). The block history means this PMMR data can still be pruned and it is desireable to do so as this is effectively redundant data on an archival node (and takes up significant disk space beyond the blocks db).
As long as it is less than Bitcoin’s full node, you will not hear me complain. Archive nodes are probably only for those interested in investigating the blockchain for which they will have space available. E.g. I would probably export/convert it to a graph database, not exactly memory saving
Continuing to make progress on “archival sync” support.
As part of this I cleaned up and improved our peers_allow and peers_preferred configuration options to make these more robust and useful during server restart.
These configuration options have been invaluable during “archival sync” development and testing and will continue to be useful on mainnet for early adopters of “archive mode”. We have a “bootstrapping” problem here as we can only archival sync from other archive nodes and we need to both identify these and direct archive block requests to them robustly.
Allowing nodes to configure lists of “preferred” peers is very useful here as we can increase connectivity with a known population of archive nodes. Over time this will become less critical.
It turns out legacy v2/v3 block serialization support does not play nicely with archive sync.
While an archive node maintains a full history of full blocks it still cannot rewind back beyond the horizon. And there is no way to reconstruct a v2 full block based only on a v3 block.
We no longer need to support v2 blocks now that we are past the final HF4. PR to clean this up and remove the old support -
This makes archive nodes quietly drop legacy block requests from old nodes on the network (pre-HF4, even pre-HF3) that still hang around periodically asking for old blocks.
I think ideally we will start more aggressively banning these stale nodes but for now we can just ignore them.
The above PRs are the final changes needed to allow us to enable archival sync support.
The task of enabling this is in PR -
Proposing to discuss actual rollout of this as part of a 5.1.0 release, potentially in the next couple of weeks. This would be dependent on commitment from enough archive node operators willing to upgrade shortly after the release to help with the “bootstrapping” issue (discussed in the PR).
Some people (specifically @quentinlesceller) are seeing really slow archive sync performance.
I suspect investigating this is outside the scope of this initial PR and looks limited to specific local environment - slow disk specifically. But we do need to investigate and there is definitely room to improve performance here over time.
Recommendation would be to only run an archival node if you have reasonably fast hardware and a decent disk (and sufficient disk space).
Seriously late posting a status update, as I have not posted anything since before the emergency HF…
Most of my focus has been on post-HF mitigation and improvements.
A lot of it is summarized here -
Currently testing a “manual” approach to invalidating headers and “resetting chain state” here -
Which in conjunction to the improvement with the “sync_head” tracking should get us to a far more stable place if we ever need to navigate a large fork (emergency or otherwise) again in the future -
Q1 came to a close a week ago and my funding period has now expired.
I dropped the ball on keeping this running uninterrupted and totally forgot it was end of March until we actually hit the end of March…
On my todo list is getting it together and posting a follow on funding request for Q2 to the forum. I hope to have this posted by the end of the week, hopefully in time to discuss in the governance meeting next week.
I’m planning to write at least a minimal retrospective of my attempt at a “deliverable based funding request” which did not necessarily go 100% to plan (given the intrusion from the emergency HF etc.)