An analysis of Ethereum’s recent chain split
December 02, 2020
A few weeks ago, the Ethereum network suffered a chain split that subsequently led to outages of various services. Following up to that, I spent a considerable amount of time trying to understand how this really happened and if it could have been prevented. This is not only because being a software engineer I am naturally curious, but also because I am part of the engineering team building Corda, a distributed ledger platform that can be perceived as a competitor of some ‘enterprise’ versions of Ethereum. As a result, I am always interested in identifying areas for improvement and I also believe different platforms in this space can learn a lot from each other. In this post, I’ll try and analyse the incident a bit more and reflect on the underlying factors that contributed to it.
First of all, let’s start with how users started experiencing the issues. Some users started seeing two different chains, depending on their vantage point. For instance, Etherscan and Blockchair were showing two different chains after block 11234873. This led some exchanges, such as Binance, to temporarily disable withdrawals.
Another big provider – Infura – suffered an outage that caused a delay in price feeds of ether (ETH) and ERC-20 tokens and a general disruption in other services in the wider DeFi space.
A quick analysis of the incident
Let’s have a look at how the incident unfolded in a bit more detail. Fortunately, many of the parties that were involved have already performed very thoughtful analyses¹ that we can synthesise here. I will try and take it slow, so that people who are not extremely familiar with Ethereum can follow along.
An Ethereum network consists of a set of Ethereum nodes. Each one of those nodes is executing a piece of software, which is known as a client. There are many implementations of Ethereum clients, but they all have to follow a formal specification that defines the proper behaviour of a client. This is essential so that all these clients can agree on whether a transaction is valid or not, thus ensuring they will reach the same decision when operating on the same data. A year ago, one of these clients – Geth – released a version that contained a bug, which meant this version of the client was not fully compliant with the specification². This meant Geth and other client implementations could reach a different decision on whether a transaction is valid for a very specific category of transactions that could trigger this bug. That bug remained dormant in the codebase of Geth for almost a year until it was reported on 20th July, 2020. Soon after that, the developers of Geth fixed the bug and released a new version, v1.9.17. This aligned Geth with the other available clients that conformed to the specification, but now Geth clients from this version and onwards could potentially disagree with Geth clients of previous versions. And this is what happened.
The developers of Geth did not publicise the fact that the v1.9.17 release contained a critical security bug fix that could affect consensus for obvious reasons. The main argument was the following:
In this particular instance, the consensus bug was dormant in the code for over 1 year. The probability after all that time for someone to accidentally trigger it is tiny. Opposed to that, the probability of someone maliciously triggering it if highlighted as a security issue is not insignificant. The Geth team made the conscious decision not to mention it, hoping that people eventually upgrade to versions that contain the fix and the issue is gradually ejected from the network.
However, luck plays funny games! Some people building on top of Geth came across this bug and decided to do an experiment and submit a transaction that would trigger it on the Ethereum mainnet! Most of the nodes in the network that were using the Geth client had upgraded, but there were still some that hadn’t upgraded yet. As a result, these nodes that hadn’t upgraded yet fell out of sync with the rest of the network.
Infura was predominantly using the Geth client and they hadn’t upgraded to v1.9.17 yet, since they didn’t know earlier versions had a serious bug and they were following their regular upgrade cadence. As a result, a consensus failure happened at the block containing this transaction that led to a complete sync halt affecting several of their systems. Clients that were unable to use these services started retrying increasingly, which overloaded other services and forced the team to temporarily disable them too. The team tried to upgrade the Geth version they were using, which appeared to be more complicated than expected due to the fact they were actually using a forked version of Geth. As soon as the upgrade to the new version of Geth was completed, the corresponding nodes were able to switch back to the right chain and their systems recovered to normal operation.
A quick reflection
Incidents with such a big impact are a good opportunity for reflection. The Ethereum community has already started this process to understand how they can prevent these issues in the future. From my point of view, I can’t help but use this opportunity to understand how these risks translate in a permissioned setup and how a platform like Corda could help mitigate issues like these.
- As I explained above, the bug that caused this issue was very low-level and had to do with an incorrect implementation of specific Ethereum Virtual Machine (EVM) opcodes:
RETURNDATASIZE
andRETURNDATACOPY
. It is not only hard to implement a full virtual machine, but it has also proved to be hard to use and reason about it in the past judging from the many smart contracts that were built in an insecure way, with the DAO hack probably being the most popular so far. For this reason, Corda makes use of the battle-tested Java Virtual Machine (JVM), which provides higher-level APIs that are easier and safer to use. - Anyone that has experience with writing software probably knows that completely eliminating human mistakes that lead to software bugs is just wishful thinking. It is thus crucial to have a process to recover from them as smoothly as possible. Following the analysis above, it is easy to observe that the Geth developers were caught in the dilemma of whether or not they should inform users of the client software about the security fix. This is not so much a technical problem, but a problem of sociopolitical dynamics inherent to open-source software. Anyone can download and run the software without the developer necessarily knowing about it, so any communications have to be performed on a broadcast medium that can potentially reach unintended audience. However, a commercial distribution, such as Corda Enterprise³, allows the entity developing the software to establish formal relationships with users that have commercially valuable workloads, so that disclosure can be completed in a more secure way.
- Another important contributing factor to this incident was the permissionless nature of the Ethereum network, where anyone can run a miner and validate transactions for the network. This has some natural consequences: it is difficult to establish communication channels and coordinate deployments of new releases that fix critical security bugs with the miners of the network to prevent exploits and it’s also harder to get any guarantees by the miners around things like update cadence of the client software. In the Corda world, the notary service that is responsible for providing consensus on the network and prevent things like double-spends is entrusted to specific parties that will typically be called to provide some SLA guarantees to the rest of the network. There can also be many different notaries available in a network and application developers or users can choose which ones to use, so they can hold them accountable for their actions and switch to those who do do their jobs best.
- Another interesting aspect is the data model itself. The Ethereum blockchain is essentially a linear chain of blocks of transactions, where every block refers to the previous block. As a direct consequence, a single problematic transaction can cause a problem for the whole blockchain, as confirmed by Infura’s outage. In contrast, the ledger of a Corda network consists conceptually of multiple “chains”, where each chain only contains interrelated transactions. Apart from privacy and performance benefits, this also provides some nice fault-tolerance characteristics as a single problematic transaction can only cause issues for the parties involved in the associated chain, while the rest of the network can keep functioning as normal.
With all that said, I have to acknowledge that this incident also highlighted some positive aspects of the Ethereum ecosystem. The model of a single formal specification that allows many diverse implementations of the software provides a good degree of fault tolerance, since it reduces single points of failure. This was evident by the fact that even though a fault of a single client had considerable impact, the majority of the network was still left unaffected, which is what mattered more in this case. This is also one of the reasons why the Corda protocol has always been defined by the open-source codebase as we share the belief that “the more eyes on the code, the better”.
[1]: You can find Infura’s post-mortem here and Geth’s post-mortem here.
[2]: The version was v1.9.7 and the part of the specification that got broken was EIP-211, which is related to a gas charging mechanism for functions that return data of arbitrary size.
[3]: For those of you that don’t know, Corda is itself open-source and thus subject to similar dynamics. However, we also maintain Corda Enterprise, which is an interoperable commercial distribution that allows us to mitigate some of the issues described.
Want to learn more about building awesome blockchain applications on Corda? Be sure to visit https://corda.net, check out our community page to learn how to connect with other Corda developers, and sign up for one of our newsletters for the latest updates.
— Dimos Raptis is a Software Engineer at R3, an enterprise blockchain software firm working with a global ecosystem of more than 350 participants across multiple industries from both the private and public sectors to develop on Corda, its open-source blockchain platform, and Corda Enterprise, a commercial version of Corda for enterprise usage.