Why blockchain performance is hard to measure

by Joseph Bonneau

Performance and scalability are much-discussed challenges in the crypto space, relevant to both Layer 1 projects (independent blockchains) and Layer 2 solutions (like rollups and off-chain channels). Yet we don’t have standardized metrics or benchmarks. Numbers are often reported in inconsistent and incomplete ways, making it difficult to accurately compare projects and often obscuring what matters most in practice. 

We need a more nuanced and thorough approach to measuring and comparing performance – one that breaks performance down into multiple components, and compares trade-offs across multiple axes. In this post, I define basic terminology, outline challenges, and offer guidelines and key principles to keep in mind when evaluating blockchain performance. 

Scalability vs. performance

First, let’s define two terms, scalability and performance, which have standard computer science meanings that are often misused in blockchain contexts. Performance measures what a system is currently capable of achieving. As we’ll discuss below, performance metrics might include transactions per second or median transaction confirmation time. Scalability, on the other hand, measures the ability of a system to improve performance by adding resources.

This distinction is important: Many approaches to improving performance do not improve scalability at all, when properly defined. A simple example is using a more efficient digital signature scheme, such as BLS signatures, which are roughly half the size of Schnorr or ECDSA signatures. If Bitcoin switched from ECDSA to BLS, the number of transactions per block could go up by 20-30%, improving performance overnight. But we can only do this once — there isn’t an even more space-efficient signature scheme to switch to (BLS signatures can also be aggregated to save more space, but this is another one-off trick).

A number of other one-off tricks (such as SegWit) are possible in blockchains, but you need a scalable architecture to achieve continual performance improvement, where adding more resources improves performance over time. This is the conventional wisdom in many other computer systems as well, such as building a web server. With a few common tricks, you can build one very fast server; but ultimately, you need a multi-server architecture that can meet ever-growing demand by continually adding extra servers.

Understanding the distinction also helps avoid the common category error found in statements like, “Blockchain X is highly scalable, it can handle Y transactions per second!” The second claim may be impressive, but it’s a performance metric, not a scalability metric. It doesn’t speak to the ability to improve performance by adding resources.

Scalability inherently requires exploiting parallelism. In the blockchain space, Layer 1 scaling appears to require sharding or something that looks like sharding. The basic concept of sharding — splitting state into pieces so that different validators can process independently — closely matches the definition of scalability. There are even more options on Layer 2 which allow adding parallel processing — including off-chain channels, rollup servers, and sidechains.

Latency vs. throughput

Classically, blockchain system performance is evaluated across two dimensions, latency and throughput: Latency measures how quickly an individual transaction can be confirmed, whereas throughput measures the aggregate rate of transactions over time. These axes apply both to Layer 1 and Layer 2 systems, as well as many other types of computer systems (such as database query engines and web servers).

Unfortunately, both latency and throughput are complex to measure and compare. Furthermore, individual users don’t actually care about throughput (which is a system-wide measure). What they really care about is latency and transaction fees — more specifically, that their transactions are confirmed as quickly and as inexpensively as possible. Though many other computer systems are also evaluated on a cost/performance basis, transaction fees are a somewhat new axis of performance for blockchain systems that doesn’t really exist in traditional computer systems.

Challenges in measuring latency

Latency seems simple at first: how long does a transaction take to get confirmed? But there are always several different ways to answer this question.

First, we can measure latency between different points in time and get different results. For example, do we start measuring latency when the user hits a “submit” button locally, or when the transaction hits the mempool? And do we stop the clock when the transaction is in a proposed block, or when a block is confirmed with one follow-up block or six?

The most common approach takes the point of view of validators, measuring from the time a client first broadcasts a transaction to the time a transaction is reasonably “confirmed” (in the sense that real-world merchants would consider a payment received and release merchandise). Of course, different merchants may apply different acceptance criteria, and even a single merchant may use different standards depending on the transaction amount.

The validator-centric approach misses several things that matter in practice. First, it ignores latency on the peer-to-peer network (how long does it take from when the client broadcasts a transaction to when most nodes have heard it?) and client-side latency (how long does it take to prepare a transaction on the client’s local machine?). Client-side latency may be very small and predictable for simple transactions like signing an Ethereum payment, but can be significant for more complex cases like proving a shielded Zcash transaction is correct.

Even if we standardized the window of time we’re trying to measure with latency, the answer is almost always it depends. No cryptocurrency system ever built has offered fixed transaction latency. A fundamental rule of thumb to remember is:

 

Latency is a distribution, not a single number.

 

The networking research community has long understood this (see, for example, this excellent talk by Gil Tene). A particular emphasis is placed on the “long tail” of the distribution, as a highly elevated latency in even 0.1% of transactions (or web server queries) will severely impact end users.

With blockchains, confirmation latency can vary for a number of reasons:

Batching: most systems batch transactions in some way, for example into blocks on most Layer 1 systems. This leads to variable latency, because some transactions will have to wait until the batch fills up. Others might get lucky and join the batch last. These transactions are confirmed right away and don’t experience any additional latency.

Variable congestion: most systems suffer from congestion, meaning more transactions are posted (at least some of the time) than the system can immediately handle. How congested can vary when transactions are broadcast at unpredictable times (often abstracted as a Poisson process) or when the rate of new transactions changes throughout the day or week, or in response to external events like a popular NFT launch.

Consensus-layer variance: Confirming a transaction on Layer 1 usually requires a distributed set of nodes to reach consensus on a block, which can add variable delays regardless of congestion. Proof-of-work systems find blocks at unpredictable times (also abstractly a Poisson process). Proof-of-stake systems can also add various delays (for example, if an insufficient number of nodes are online to form a committee in a round, or if a view change is required in response to a leader crashing).

For these reasons, a good guideline is:

 

Claims about latency should present a distribution (or histogram) of confirmation times, rather than a single number like the mean or median.

 

While summary statistics like the mean, median, or percentiles provide a partial picture, accurately evaluating a system requires considering the entire distribution. In some applications, the mean latency can provide good insight if the latency distribution is relatively simple (for example, Gaussian). But in cryptocurrency, it is almost never this way: Typically, there is a long tail of slow confirmation times.

Payment channel networks (e.g. Lightning Network) are a good example. A classic L2 scaling solution, these networks offer very fast payment confirmations most of the time, but occasionally they require a channel reset which can increase latency by orders of magnitude.

And even if we do have good statistics on the exact latency distribution, they will likely vary over time as the system and demand on the system change. It also isn’t always clear how to compare latency distributions between competing systems. For example, consider one system which confirms transactions with uniformly distributed latency between 1 and 2 minutes (with a mean and median of 90 seconds). If a competing system confirms 95% of transactions in 1 minute exactly, and the other 5% in 11 minutes (with a mean of 90 seconds and a median of 60 seconds), which system is better? The answer is probably that some applications would prefer the former and some the latter.

Finally, it’s important to note that in most systems, not all transactions are prioritized equally. Users can pay more to get a higher priority of inclusion, so in addition to all of the above, latency varies as a function of transaction fees paid. In summary:

 

Latency is complex. The more data reported, the better. Ideally, complete latency distributions should be measured under varying congestion conditions. Breakdowns of latency into different components (local, network, batching, consensus delay) are also helpful.

 

Challenges in measuring throughput

Throughput also seems simple at first glance: how many transactions can a system process per second? Two primary difficulties arise: what exactly is a “transaction,” and are we measuring what a system does today or what it might be able to do?

While “transactions per second” (or tps) is a de facto standard for measuring blockchain performance, transactions are problematic as a unit of measurement. For systems offering general purpose programmability (“smart contracts”) or even limited features like Bitcoin’s multiplex transactions or options for multi-sig verification, the fundamental issue is:

 

Not all transactions are equal.

 

This is obviously true in Ethereum, where transactions can include arbitrary code and arbitrarily modify state. The notion of gas in Ethereum is used to quantify (and charge fees for) the overall quantity of work a transaction is doing, but this is highly specific to the EVM execution environment. There is no simple way to compare the total amount of work done by a set of EVM transactions to, say, a set of Solana transactions using the BPF environment. Comparing either to a set of Bitcoin transactions is similarly fraught.

Blockchains that separate the transaction layer into a consensus layer and an execution layer can make this more clear. At the (pure) consensus layer, throughput can be measured in bytes added to the chain per unit of time. The execution layer will always be more complex.

Simpler execution layers, such as rollup servers which only support payment transactions, avoid the difficulty of quantifying computation. Even in this case, though, payments can vary in the number of inputs and outputs. Payment channel transactions can vary in the number of “hops” required which affects throughput. And rollup server throughput can depend on the extent to which a batch of transactions can be “netted” down to a smaller set of summary changes.

Another challenge with throughput is going beyond empirically measuring today’s performance to evaluate theoretical capacity. This introduces all sorts of modeling questions to evaluate potential capacity. First, we must decide on a realistic transaction workload for the execution layer. Second, real systems almost never achieve theoretical capacity, especially blockchain systems. For robustness reasons, we hope node implementations are heterogeneous and diverse in practice (rather than all clients running a single software implementation). This makes accurate simulations of blockchain throughput even more difficult to conduct. 

Overall:

 

Claims of throughput require careful explanation of the transaction workload and the population of validators (their quantity, implementation and network connectivity). In the absence of any clear standard, historic workloads from a popular network like Ethereum suffice.

 

Latency-throughput tradeoffs

Latency and throughput are usually a tradeoff. As Lefteris Kokoris-Kogias outlines, this tradeoff is often not smooth, with an inflection point where latency goes up sharply as system load approaches its maximum throughput.

Zero-knowledge rollup systems present a natural example of the throughput/latency tradeoff. Large batches of transactions increase proving time which increases latency. But the on-chain footprint, both in terms of proof size and validation cost, will be amortized over more transactions with larger batch sizes, increasing throughput.

Transaction fees

Understandably, end users care more about the tradeoff between latency and fees, not latency and throughput. Users have no direct reason to care about throughput at all, only that they can confirm transactions quickly for the lowest fees possible (with some users caring more about fees and others more about latency). At a high level, fees are affected by multiple factors:

  1. How much market demand is there to make transactions?
  2. What overall throughput is achieved by the system?
  3. How much overall revenue does the system provide to validators or miners?
  4. How much of this revenue is based on transaction fees vs. inflationary rewards?

The first two factors are roughly supply/demand curves which lead to a market-clearing price (though it has been claimed that miners act as a cartel to raise fees above this point). All else being equal, more throughput should tend to lead to lower fees, but there is a lot more going on.

In particular, points 3 and 4 above are fundamental questions of blockchain system design, yet we lack good principles for either of them. We have some understanding of the advantages and disadvantages of giving miners revenue from inflationary rewards vs. transaction fees. However, despite many economic analyses of blockchain consensus protocols, we still have no widely accepted model for how much revenue needs to go to validators. Today most systems build in an educated guess about how much revenue is enough to keep validators behaving honestly without strangling practical use of the system. In simplified models, it can be shown that the cost of mounting a 51% attack scales with rewards to validators.

Raising the cost of attacks is a good thing, but we also don’t know how much security is “enough.” Imagine you’re considering going to two amusement parks. One of them claims to spend 50% less on ride maintenance than the other. Is it a good idea to go to this park? It might be that they’re more efficient and are getting equivalent safety for less money. Perhaps the other is spending more than what’s needed to keep the rides safe to no benefit. But it could also be the case that the first park is dangerous. Blockchain systems are similar. Once you factor out throughput, blockchains with lower fees have lower fees because they are rewarding (and therefore incentivizing) their validators less. We don’t have good tools today to assess if this is okay or if it leaves the system vulnerable to attack. Overall:

 

Comparing fees between systems can be misleading. Even though transaction fees are important to users, they are affected by many factors besides the system design itself. Throughput is a better metric for analyzing a system as a whole.

 

Conclusion

Evaluating performance fairly and accurately is hard. This is equally true for measuring the performance of a car. Just like with blockchains, different people will care about different things. With cars, some users will care about top-speed or acceleration, others about gas mileage and still others about towing capacity. All of these are non-trivial to evaluate. In the US, for example the Environmental Protection Agency maintains detailed guidelines just for how gas mileage is evaluated as well as how it must be presented to users at a dealership.

The blockchain space is a long way from this level of standardization. In certain areas, we may get there in the future with standardized workloads to evaluate throughput of a system or standardized graphs for presenting latency distributions. For the time being, the best approach for evaluators and builders is to collect and publish as much data as possible, with a detailed description of the evaluation methodology, so that it can be reproduced and compared to other systems.

web3 with a16z

a show about building the next internet, from a16z crypto (more)