Today let’s investigate how storage requirements have evolved over blockchain’s eleven-year-old history, discuss the scalability problem and what proposed solutions would mean for blockchain storage growth rates. Plus a view of how advanced blockchain analytics are the key to making sense of ever-growing transaction histories.
Finally, we’ll discuss how the architecture of PARSIQ and it’s infrastructure has been carefully designed to cope with the future demands of the fast-growing, scalable blockchains we expect the future to hold.
Blockchains store transactions in a linked list data-structure. With each new block, transaction history grows, and more storage space is required to store the blockchain. Data is added at regular intervals with an average frequency defined by the blockchain’s protocol. Each full node in the blockchain has to keep track of transaction history and to provide the storage necessary to do so. In the early days of Bitcoin, storage requirements were negligible but have since become significant. General, purpose blockchains, such as Ethereum, have grown much faster, and keeping up with the blockchain’s latest state and providing sufficient storage space has become a series issue for node operators.
On the other hand, growth is conditioned by transaction throughput, and scalability is currently one of the limiting factors holding back blockchain adoption for many applications. Of course, solving the scalability problem would mean higher transaction throughput and an even further increases in storage requirements.
Let’s start by looking at the factors that influence the amount of storage required for the transaction history of a blockchain.
The main factors to consider are the rate at which blocks are created and the amount of data that is stored in each block. There is an upper bound on how much data can be stored with each block. The block creation frequency (or block time) and the block size limit mark a maximum rate at which blockchains can grow in terms of storage. This means that storage requirement grows linearly at maximum capacity without modifying at least one of these parameters.
Let’s take Bitcoin as an example. The protocol has been designed to keep block times at ten-minute intervals, meaning, on average, a new block is created every ten minutes. The block size is capped at 1MB, but as some data is stored outside of this official block limit, the blockchain grows a little over 1 MB every ten minutes if blocks are filled completely. The Bitcoin network has been at or near full capacity for a while now, so the linear growth rate applies.
This means that the Bitcoin blockchain grows by roughly 6 MB every hour, which adds up to 52 GB per year.
At the time of writing (March 2020) the Bitcoin transaction history came in at 266 GB of storage.
The above chart shows how Bitcoin’s growth has largely stabilized to a linear growth rate, due to the network operating near the full capacity.
Bitcoin Cash, on the other hand, has significantly increased the block limit to 32 MB, so theoretically it could grow much faster. However, the network currently does not operate anywhere near full capacity, so it actually grows more slowly.
The above numbers still seem fairly manageable. However, if we look at Ethereum, the numbers dramatically change. The Ethereum network started in 2015 as a more general blockchain platform supporting small contracts. This means that to be useful, Ethereum needs to create blocks at a much faster rate. Block times are usually just below 15 seconds on average. The growth rate is not limited by a fixed size limit but through a block gas limit. Basically, this means that the amount of transactions that can be included in a block depends on the sum of transaction fees. Nevertheless, under most circumstances, this should keep storage growth fairly linear. Miners actually have the power to modify the block gas limit, which they have done occasionally, in order to effectively increase the block size. Currently, block size averages between 20 and 30 KB. Although this block size is much lower than in Bitcoin, these blocks are created at a much higher frequency. This means that Ethereum has grown at a much faster rate and quickly overtook Bitcoin in terms of storage requirements.
Ethereum’s so-called world state, consisting of account balances and smart contract storage, is not included in the actual blocks. While this state is stored separately, it may change over time, so not all nodes need to keep a full history of the old state.
There are two types of Ethereum nodes that store blockchain state. Archive nodes store the full history of the blockchain, whereas ordinary full nodes keep track of the current blockchain state (plus a little bit of history). Of course, a certain number of archive nodes are required for full security. Currently, such a node has store around 3.5 TB of data.
The above chart shows the development of Ethereum archive node storage requirements during the last 12 months. Note, that the format in which data stored has an impact, which can be seen by the different storage requirements of the Geth and Parity node implementations and the changes that occurred during a Geth software update that improved storage structure significantly.
However, besides these implementation details, the numbers clearly show that running a full Ethereum node has become very resource-intensive. The amount of storage required is complicated by the fact that disk IO speed is also very important to keep up with the blockchain. This means that the type of storage required is very expensive. A fast solid-state hard drive is a must.
While the above numbers already highlight the problem with ever-growing blockchain storage, the problem becomes even more evident when limitations of blockchain scalability are taken into account. The current parameters of the Bitcoin protocol limit the network to an average between 4 to 7 transactions per second (tps). This is clearly not enough for mass adoption and seriously limits the usefulness of the platform for many applications. Ethereum does not fare much better, with an average of 15 tps.
To understand this problem properly, let’s have a look at what Vitalik Buterin, Ethereum’s co-inventor, termed the scalability trilemma.
Basically, the scalability trilemma means that trade-offs have to be made between scalability, security and decentralization. Let’s assume for example we want to increase scalability by modifying the parameters of the protocols. If we increase the block size or reduce block time, storage requirements will increase, meaning fewer nodes will be able to store the full blockchain and security decreases. The same is true for processing power and disk IO capabilities. Even if we accept that only very powerful nodes can maintain the blockchain, there is a natural limit due to physical communication overhead and latency. Some blockchains address this problem by limiting transaction verification to a small number of powerful nodes. At the extreme end, EOS has just 21 block producers, leading to a very centralized platform.
In Ethereum’s case, it can be argued that the network has been operating close to its limit for a while now and that major architecture changes are required to improve scalability without relaxing security too much. This is exactly what the Ethereum 2.0 re-design proposes.
Bitcoin, on the other hand, still has some wiggle room at the protocol level.
It has become clear though that, in order to fully work around the scalability trilemma, solutions need to be employed that do not rely on growing the number of on-chain transactions. So-called second layer solutions move certain types of processing off the blockchain and use the first layer ledger for settlement. These solutions include the Lightning Network for Bitcoin, state channels and Raiden for Ethereum, zero-knowledge proofs and stateless clients. We will not go into detail about these solutions in this article. It is sufficient to notice that these solutions will have to be employed to keep the size and throughput of the blockchain at a reasonable level.
Making Sense of Blockchain State – Big Data
From the above, it should be obvious that blockchains involve an ever-growing amount of data, which updates at a relatively high frequency. Even though there are natural limits to blockchain salability, the amount of data to be processed and transaction throughput is likely to increase.
This means that tracking transactions and detecting blockchain events in real-time is a big data problem.
Manual processing is clearly off the cards, and blockchain analytics platforms have to be carefully designed to be efficient and, above all, future proof. It is not at all trivial to allow a large number of users to trace transactions and monitor addresses. Reacting to events in real-time, such as with Parsiq’s smart triggers is even harder.
The architecture of PARSIQ
To solve this problem, a lot of thought has gone into the architecture of the PARSIQ blockchain analytics system. Let’s analyze the steps taken to make PARSIQ run efficiently.
There are four main features to the Parsiq architecture aimed at ensuring functioning in the presence of high transaction throughput and large amounts of data to process.
- Scalable microservices architecture: Parsiq components are implemented in modern architecture in the form of cloud-hosted microservices. This allows the platform to scale easily at a very granular level. Without such a design, horizontal scaling would be very difficult to achieve to the level expected to be required soon due to increasing transaction throughput and blockchain data size.
- Intelligent multi-level filtering: Monitoring events that occur on a blockchain is a demanding process. Because so much is happing within accumulating new blocks, it is impossible to process all events fully in real-time. Fortunately, most events are not relevant to Parsiq users and intelligent filtering is used to extract those events that do matter. This filtering is performed in several layers, the most relevant being Bloom filters. Bloom filters are ideal for filtering events, as they are very efficient. They also favor false positive over false negative, meaning that while it is possible to include the occasional non-relevant event, no important event will ever be missed.
- Efficient sharded database design: The database used for storing the Parsiq database is necessarily large. This means, that the actual database implementation needs to be severely optimized. To this end, Parsiq implements a sharded database design, splitting up the database horizontally across tables to improve performance.
- In-memory user-data: Data associated with each user has to be retrieved frequently and timing is crucial in this endeavor. This is due to the cascading effect of smart triggers. A single event may trigger thousands or even millions of smart triggers. To deal with this large amount of data in real-time, in-memory data storage is used. Redis services running on separate hosts are used to act as intermediary in-memory database caches.
As it turns out, running a highly functioning blockchain analytics service precisely is quite a complex endeavor. Big data still requires big data techniques and a suitable architecture.
While blockchains are expected to grow and provide higher throughputs, PARSIQ has been designed to scale with the systems the platform is monitoring.