Post-Mortem Report for Network Outage | The Radix Blog | Radix DLT
On Sunday, December 3rd, 2023, the Radix Network experienced an outage starting at 22:26 UTC. Following adoption of a patched node release by the validator community, network service resumed at approximately 09:45 UTC, for a total outage time of about eleven and a half hours. No transactions, funds, or state was lost.
We will start with an abbreviated summary of what happened and what was learned from the incident, then dive into a full timeline of the events with a lightly technical description of what occurred, and finish with a completely non-technical analogy to describe it.
Executive Summary
At the conclusion of mainnet epoch 51816 all validators were performing a normal state cleanup activity before beginning the next epoch. Due to a bug in the node code which mishandled an edge case, all validator nodes mistakenly deleted some information from their execution state store. 5 minutes later, the epoch changeover caused all validators to halt operations when the corrupted state store was automatically recognized. Consensus was therefore halted, and 2 minutes later all running Gateways ceased operation as designed, at which point the outage was noticeable to all wallets, explorers, and other consumers of ledger data.
Automated alarms roused members of the development team who then followed an incident response process and convened a war room. The root cause of the incident was identified about 4 hours after the outage began, and work on a fix was undertaken, resolving the edge case bug as well as allowing all validators to restore the mistakenly deleted state using their store of transaction receipts.
Node v1.0.5 was built and released after the fix was successfully verified, and immediately recommended for uptake to the validator community.
Approximately 1 hour after the release, validators comprising over 2/3rds of stake had patched and come back online, and network liveness was restored.
Lessons Learned
While it was a tough blow for Radix to finally lose its perfect uptime record after almost two and a half years of operation, the chief takeaway was that both the team’s incident handling processes and the consensus & engine design worked quite well when faced with an emergency.
- The incident occurred at just about the worst imaginable day and time for team readiness, and service was still restored within 12 hours.
- The automated & manual escalation processes got the right people out of bed and working.
- Nodes correctly identified that the state store was inaccurate within 1 epoch.
- Consensus correctly halted.
- Gateways correctly halted.
- No transactions were lost.
- Validators did not need to accept any external state information or any kind of “rollback point” to resume operation.
- The validator community was extremely responsive and quickly adopted the updated node version.
- Liveness was automatically restored as soon as 2/3rds of stake came back online.
Having said that, there were some definite bumps in the road.
- The automated alarms had a misconfiguration and didn’t correctly get the “second wave” of respondents online. Fortunately, this was easily addressed with some manual effort.
- Not enough people had the ability to update https://status.radixdlt.com, resulting in an unnecessary delay before the incident was posted.
- The overall community communication plan wasn’t sufficiently clear (what should be posted where, and when, and by who).
- A test network running “ahead” of mainnet on epochs would have identified this problem before it occurred.
In response to this, there will be a tightening up of the incident handling process and alert system, and dry runs of both the automated and manual alert processes will be conducted. Additionally, future test networks will start with randomized epoch numbers, and long soak test networks will be initiated with a very short epoch duration in order to remove the possibility that mainnet is the first place to ever experience a particular epoch change.
Timeline
This contains some technical detail for those curious about the particulars.
2023/12/3 22:26 UTC – End of mainnet epoch 51816
The end of epoch 51816 is proposed and confirmed shortly after, bringing about the start of epoch 51817. As part of the commit of this state version, validator nodes start a cleanup process which involves deleting partition 255 of the transaction tracker entity.
Due to a bug in the handling of deletion at the exact boundary partition number of 255 (the maximum value of one byte), this triggered the undesired deletion of some of the entities and substates in the node’s current execution store. The store is used for transaction execution, and these missing entities/substates at execution time will cause crashes in the engine when executing most transactions.
Note: This bug was not previously encountered in test networks due to the fact that partition deletion only occurs as part of a cleanup process every 100 epochs, and to reach partition 255 requires 255 * 100 epochs, which is roughly 90 days. Test networks are always created starting at epoch 1, but the Babylon mainnet upgrade inherited Olympia’s end epoch of 32718, thus it had a “head start” on getting to the troublesome partition number and was the first network to encounter the problem.
At this point, the merkle tree and receipts are all still correct; but some of the execution state store is now not aligned with reality.
Epoch 51817 progresses
Through a quirk in where entities live in the database, standard round change transactions could progress unhindered, but full execution of user transactions resulted in the state store inaccuracy being detected and the validator aborting execution to prevent the possibility of further damage.
As such, the epoch is able to proceed, with running validators agreeing to empty proposals, but no user transactions are committed.
2023/12/3 22:31 UTC – Last commit near end of epoch 51817
The last committed transaction before the network ground to a halt was a RoundUpdate transaction.
After this point, the epoch had been running for 5 minutes, so the execution of any further RoundUpdate now triggers a change of epoch. The change of epoch adds a lot of additional processing during the RoundUpdate transaction, which hits a path which observes the execution store’s corruption. Therefore, every validator, after receiving the latest proposal, will attempt to “prepare” a vertex containing this RoundUpdate, notice the corruption, and abort its process to prevent further damage. The entire validator set is now stuck in a reboot loop.
Non-validating nodes, oblivious to the fact that an epoch change is due, stay up but detect that they’re out of sync.
2023/12/3 22:33 UTC – Gateway Service stops providing ledger data to queries
After 2 minutes without state updates, the Gateway activates its failsafe and returns error responses to all clients, to prevent apps relying on out-of-date information. The outage is now noticeable to all wallets and explorers.
Behind the scenes, automated alarms are firing to the 24/7 on-call rota, and an escalation call tree soon has various RDX Works team members woken up and gathered in a war room call to begin debugging.
2023/12/4 02:30 UTC – Root cause first identified; fix begins
Following various lines of investigation, a code path is discovered related to partition deletion which may cause the observed symptoms. Over the coming hour, further analysis uncovers the exact mechanism and details of the issue, and discussion turns to how to solve the issue.
Technical work is divided into three tracks: fixing the bug, fixing the state store, and preparing a means to validate the fixed store.
Fixing the bug requires adjusting how the partition delete operation was mapped to the RocksDB range delete. Fixing the store requires creating a job to run on node bootup to use the state changes from the transaction receipts to rebuild the deleted parts of the execution store. Validation of the fix requires comparing the resultant execution store database with a database obtained without the bug, as well as validation against the merkle tree.
2023/12/4 07:45 UTC – Solution verified
Code from the three tracks has been carefully reviewed and merged together and the fix tested and verified. Packaging of version 1.0.5 of the node starts.
2023/12/4 08:46 UTC – Patched node release
Node v1.0.5 is announced, and recommended as an immediate update to node runners.
2023/12/4 09:01 UTC – Initial validator response
Validators respond very quickly, and within 15 minutes of the announcement, 25% of stake is running v1.0.5.
2023/12/4 09:45 UTC – Network liveness restored
Over 66% of validator stake is running v1.0.5 and network liveness is restored. The network is now running, but with multiple unpatched validators still being called upon to lead rounds (and subsequently timing out), round times are slow and there is noticeable commit latency.
2023/12/4 10:15 UTC – Gateway Service restored
Following updates to the full nodes feeding the Foundation Gateway, the Gateway Service comes back online. The Radix Wallets and Dashboard are functional once again. Shortly after this, the incident resolution process is determined to be concluded. The war room call is closed and most participants drop off to go to sleep, with a second shift remaining for continued monitoring.
2023/12/4 15:00 UTC – Network operating at normal speed
Enough validators are back online that rounds are progressing at a good rate, and epoch length has returned to 5 minutes. The network is back to feeling snappy when transactions are submitted, and the extra monitoring staff go back to their normal activities.
Non-Technical Analogy
Imagine you work at a school, and your job is to keep track of all the after-school activities that the students are involved in on a big chalkboard. All throughout the day, people walk in and look at the chalkboard to figure out what different students are doing on a particular day.
Students can drop a slip of paper into your inbox which provides their name and an update about their participation in an activity. E.g., a series of slips might say “Alice, now playing hockey on Mondays,” “Bob, now playing football on Tuesdays,” “Alice, now playing hockey on Fridays instead.”
You have an assistant who faithfully pulls slips from that inbox one by one, numbers each one, and then hands it to you.
You stand at the chalkboard, and whenever your assistant hands you a new slip you adjust the big list of students on the left side of the chalkboard to reflect their latest signup status. So, in our example, you would start by writing two new lines, “Alice – Hockey – Mondays” and “Bob – Football – Tuesdays”. When you saw that Alice switched the day of the week she plays hockey, you would scrub out some of your line for Alice and rewrite it to say “Alice – Hockey – Fridays”. The master list doesn’t care that Alice used to play hockey on Mondays…the important thing is that she now plays on Fridays. As soon as you have updated the list, you write down the number of the slip on the right side of the chalkboard, to show that you have handled it, and give the slip back to your assistant, who safely stores it in a filing cabinet.
Students are constantly shuffling their activities around with new slips and there’s only so much space on the right side of the chalkboard, so every night before your assistant leaves they erase most of the old “already handled this slip” records.
However, one night your assistant gets confused and, while erasing the right side, they also erase a few rows of the master list. The next morning when you and your assistant get to work, you see that there are some erased rows. There shouldn’t be any erased rows! You stop accepting any new slips, and you lock the door so nobody can come in and look up the students’ activities…you first want to make sure that it accurately reflects reality.
Turns out, if you look closely, you can actually still see the names of the students whose rows were erased, so there isn’t too much work needed to get things back in order. You and your assistant go to the filing cabinet which contains the ordered set of change slips, and go through all the ones affecting kids who were erased, and replay the history until the master list is again correct. You then erect a little barrier between the left and right sides of the chalkboard, so your assistant can’t make the over-erasing mistake again, and the two of you unlock the room and announce that you’re back in business.
To relate this little tale to what happened, here’s the conceptual mapping:
- The slips of paper are transaction receipts.
- The chalkboard is the node’s execution state store.
- The assistant’s over-erasing is the implementation bug that caused the trouble.
- Locking the inbox and the room is the nodes and Gateway correctly halting after recognizing that a problem exists.
- The process of reconstructing the erased rows is how the state store was restored to the correct contents.
- Unlocking the room is the network resuming operation after sufficient validators have taken on the fix so consensus can safely proceed.
Closing Thoughts
There are additional internal post-mortem discussions still scheduled, which will doubtless lead to some further learnings and actions, but we didn’t want to delay getting this initial report out to the community.
The bug itself was an implementation detail and not an endemic architectural or design problem. Edge cases, such as the boundary condition encountered here, receive extra attention when implementing code, reviewing PRs, and creating test cases, but this one slipped through. It’s a fact of life that such things can occur in any sufficiently complex code base, and fortunately the rigorous work already done on detecting and reacting to state corruption was able to prevent any impact beyond the service outage.
Lastly, it bears repeating that the responsiveness of the Radix validator community is the reason that service was able to be restored so quickly after the underlying issue was resolved. Thank you all for exhibiting your usual professionalism, and keep up the great work!