Lessons learned from our first public testnet.
One testnet down, many more to come
A week ago we announced that we were launching a large public testnet. The testnet launched successfully and ran for a week, proving for the first time that production-spec eth2 testnet is feasible.
When we launched the testnet we stated "We're going to start trying to crash this testnet and I suspect we'll be successful." The testnet did indeed crash, twice. Once on Saturday morning and again on Monday morning (Sydney time). We managed to recover the testnet the first time (after > 100 epochs without finalization!) but we decided to let it die the second time.
When we say "crash", we mean the testnet consistently failed to finalize. The
reason it did not finalize is because more than 1/3 of the validators went
offline. This testnet was not designed to stay up, instead it was designed to
fail fast and loud. The backbone of the testnet were four AWS
vCPU, 4gb RAM, 32gb SSD) instances, each functioning as a public boot node and
holding 4,096 validators. In fact, we're quite impressed that it lasted as
long as it did; this is a hefty load for a few small machines and all it takes
is for two of those nodes to go offline to prevent finality.
We've analysed these crashes and learnt a lot (details in later section). The team is back in development mode and looking forward to launching a new testnet next week (or perhaps the following week, if holiday commitments interfere). You can follow our progress at the v0.1.1 milestone.
The primary cause of the crash
The immediate cause of the testnet crash was a loop in the networking stack which saw the same attestation published again and again. This loop occurred in two of the four primary nodes and exhausted their resources, preventing them from producing blocks and attestations. This single issue was the immediate cause of both crashes.
We've updated our gossipsub implementation such that each message is addressed by its contents. Meaning that if we receive two messages of identical content, the gossipsub protocol will ignore the second instance. We've also added checks for duplicate messages in the lighthouse code itself which also prevents duplicate messages being sent/received.
Secondary affects of the crash
After two of the primary nodes went down, finalization became impossible (50% of validators were out of action). However, the remaining nodes continued to publish and receive blocks. This is what they're supposed to do, however without finality they were unable to prune and compact their databases, causing their databases to fill up at several GB per hour. Because we restricted our testnet nodes to 32gb drives (including the OS), they eventually filled up and stopped accepting new blocks. This resulted in the remaining two nodes going offline.
We were able to bring the testnet back up by doubling the capacity of the drives and simply restarting Lighthouse. We were quite pleased with how the nodes recovered after this crash, some nodes with large drives did not even go down throughout all this chaos.
As I write this, Michael is building a solution to this problem which will reduce the database inflation by a factor of 32. Whilst we were pleased to see the nodes revived after 100 epochs without finality, this only equates to ~10 hours of survivable network instability for nodes with drives less than 64gb. Resilience is very important to Lighthouse and Michael's new changes should extend this to 13 days.
We also saw our fork choice times extend out to 8 seconds. In our eyes this is
unacceptable and needs to be addressed. We understand that these times are due
to loading the
BeaconState from disk excessively, we already have a
PR in the works to address this
Feedback from the community
It was great to see people getting involved with Lighthouse and running their own validators. We saw 400+ validators join the network! We appreciated your feedback and we took note of the following recurring suggestions:
- Faster sync times: we're working on this! Expect 1.5-2x gains in v0.1.1!
- Better docker docs: Scott is working to improve these docs and the new testnet will be deployed using docker (i.e., we'll be dog-fooding docker).
- More stable eth1 node: we provided a public eth1 node to make things easier for users and it turned out this node was letting some people down. For the next testnet, we'll spin up a few nodes in a few different geographical regions and load-balance it.
- More API endpoints: the beaconcha.in team reached out and asked for some more API endpoints for their block explorer. We have these in a PR that will be included in v0.1.1.