Measuring SNARK performance: Frontends, backends, and the futureJustin Thaler
A SNARK (Succinct Non-interactive Arguments of Knowledge) is an important cryptographic primitive finding applications to blockchain scalability (e.g., L2 rollups) and privacy. SNARKs let someone prove to an untrusting verifier V (e.g., the Ethereum blockchain) that they know some data (e.g., a batch of valid transactions). A naive way to prove this would be to send the data to V, who can then directly check its validity. A SNARK achieves the same, but with better costs to V. In particular, a SNARK proof should be shorter than the naive one comprising the data itself. (For more detail, see the draft of my textbook, Proofs, Arguments, and Zero Knowledge. For an introduction to SNARKs, see Sarah Meicklejohn’s presentation at the a16z crypto Summer Research Series.)
The verification costs for SNARKs can vary, but are well-understood and often quite good. For example, PlonK proofs cost about 290,000 gas to verify on Ethereum, while StarkWare’s proofs cost about 5 million gas. SNARKs are potentially applicable in diverse settings even outside of blockchains — for example enabling the use of fast but untrusted servers and hardware.
But because SNARK verification is typically cheap, the main determinants of applicability are the costs of the SNARK prover P. In this post, I explain how to estimate these costs to determine when it is reasonable to use SNARKs, and how SNARKs might improve in the future. It is worth noting that this is a fast-moving space, and several of the projects discussed in this post are actively improving their performance.
But first: How SNARKs are deployed
In SNARK deployment a developer typically writes a computer program ψ that takes as input the data w that the prover claims to know (w stands for witness), and checks that w is valid. For example, in rollups, the program will check that all transactions in w are digitally signed, don’t cause any account balances to fall below zero, and so forth. The program ψ is then fed through a SNARK frontend that compiles it into a format that is more amenable to application of SNARK technology. This SNARK-friendly format is called an intermediate representation (IR).
Typically, the IR is some kind of circuit-satisfiability instance that is equivalent to ψ. This means that the circuit C takes as input the data w, as well as some extra inputs typically called “non-deterministic advice,” and runs ψ on w. The advice inputs are used to help C run ψ, while keeping C small. For example, every time ψ divides two numbers x and y, the quotient q and remainder r can be provided as advice to C, and C can simply check that x = qy + r. This check is less expensive than making C run a division algorithm to compute q and r from scratch.
Finally, a SNARK for circuit-satisfiability is applied to C. This is called the SNARK backend. For a handful of highly structured problems such as matrix multiplication, convolutions, and several graph problems, known SNARKs exist that avoid this frontend/backend paradigm and thereby achieve a vastly faster prover. But the focus of this post is on general-purpose SNARKs.
As we will see, the prover costs of the SNARK backend grow with C‘s size. Keeping C small can be challenging, because circuits are an extremely limited format in which to express a computation. They consist of gates, connected by wires. Each gate g is fed some values and applies a very simple function to those values. The result is then fed into “downstream” gates via the wires emanating from g.
SNARK scalability: How much time does it really take?
The key question is, How much more time does the SNARK prover take, relative to simply re-executing ψ on the data? The answer is the prover overhead of the SNARK, relative to direct witness checking. The latter expression refers to the fact that, in the naive proof in which P sends w to V, V checks w’s validity by executing ψ on it.
It is helpful to break the prover overhead into “frontend overhead” and “backend overhead.” If evaluating the circuit C gate-by-gate is F times more expensive than running ψ, then we say the frontend overhead is F. If applying the backend prover to C is B times more expensive than evaluating C gate-by-gate, then we say that the backend overhead is B. The total prover overhead is the product, F·B. This multiplicative overhead can be substantial even if F and B are individually modest.
In practice, F and B can both be roughly 1000 or larger. This means the total prover overhead relative to direct witness checking can be 1 million to 10 million or more. A program that runs for just one second on a laptop can easily lead to a SNARK prover requiring tens or hundreds of days of compute time (single-threaded). Fortunately, this work is typically parallelizable to varying extents (depending on the SNARK).
In summary, if you want to use a SNARK in an application today, then one of three things needs to be true:
- Direct witness checking takes much less than one second on a laptop.
- Direct witness checking is particularly amenable to representation in a circuit, so frontend overhead is small.
- You are willing to wait days for the SNARK prover to finish, and/or pay for huge parallel compute resources.
The rest of this post explains where frontend and backend overheads come from, and how I estimate them for different SNARKs. We’ll then turn to prospects for improvement.
Separating frontends and backends
Completely separating frontends from backends can be challenging because different backends support different kinds of circuits. Hence, frontends can differ depending on the backend with which they expect to interface.
SNARK backends generally support so-called arithmetic circuits, meaning that the inputs to the circuits are elements of some finite field, and that the gates of the circuit perform addition and multiplication of two field elements. These circuits roughly correspond to straight-line computer programs (no branches, conditional statements, and so on) that are algebraic in nature — that is, their primitive data type is field elements.
Most backends actually support a generalization of arithmetic circuits often called Rank-1 Constraint Satisfaction (R1CS) instances. With the notable exception of Groth16 and its predecessors, these SNARKs can be made to support other IRs as well. For example, StarkWare uses something called Algebraic Intermediate Representation (AIR), which is also similar to a notion called PlonKish arithmetization that can be supported by PlonK and other backends. The ability of some backends to support more general IRs can reduce the overhead of frontends that produce those IRs.
Backends also vary in terms of the finite fields that they natively support. I’ll discuss this further in the next section.
Various approaches to frontends
Some (very special) computer programs naturally correspond to arithmetic circuits. One example is the computer program that performs naive multiplication of matrices over some field. But most computer programs are neither straight-line nor algebraic. They often involve conditional statements, operations such as integer division or floating point arithmetic that do not naturally correspond to finite field arithmetic, and more. In these cases, frontend overhead will be substantial.
One popular frontend approach is to produce circuits that essentially execute step-by-step some simple CPU, also called a virtual machine (VM). Frontend designers specify a set of “primitive operations” analogous to assembly instructions for real computer processors. Developers who want to use the frontend will either write “witness-checking programs” directly in the assembly language or else in some higher-level language such as Solidity, and have their programs automatically compiled into assembly code.
For example, StarkWare’s Cairo is a very limited assembly language in which assembly instructions roughly permit addition and multiplication over a finite field, function calls, and reads and writes to an immutable (i.e., write-once) memory. The Cairo VM is a von Neumann architecture, meaning that the circuits produced by the frontend essentially take a Cairo program as public input and “run” the program on the witness. The Cairo language is Turing Complete — despite its limited instruction set, it can simulate more standard architectures, although doing so may be expensive. The Cairo frontend turns Cairo programs executing T primitive instructions into what is called a “degree-2 AIR with T rows and about 50 columns.” What exactly this means is not important for this post, but as far as SNARK provers are concerned, this corresponds to circuits with between 50 and 100 gates for each of the T steps of the Cairo CPU.
RISC Zero takes a similar approach to Cairo, but with the virtual machine being the so-called RISC-V architecture, an open-source architecture with a rich software ecosystem that is growing in popularity. As a very simple instruction set, designing an efficient SNARK frontend that supports it may be relatively tractable (at least relative to complicated architectures such as x86 or ARM). As of May, RISC Zero is turning programs executing T primitive RISC-V instructions into degree-5 AIRs with 3T rows and 160 columns. This corresponds to circuits with at least 500 gates per step of the RISC-V CPU. Further improvements are anticipated in the near future.
The various zkEVM projects (e.g., zkSync 2.0, Scroll, Polygon’s zkEVM) take the virtual machine to be (duh) the Ethereum Virtual Machine. The process of turning every EVM instruction into an equivalent “gadget” (i.e., an optimized circuit implementing the instruction) is substantially more involved than for the simpler Cairo and RISC-V architectures. For this and other reasons, some of the zkEVM projects are not directly implementing the EVM instruction set but rather compiling high-level Solidity programs into other assembly languages before turning those into circuits. Performance results from these projects are pending.
“CPU emulator” projects such as RISC-V and Cairo produce a single circuit that can handle all programs in the associated assembly language. Alternative approaches are “ASIC-like,” producing different circuits for different programs. This ASIC-like approach can yield smaller circuits for some programs, especially when the assembly instruction that the program executes at each timestep does not depend on the program’s input. For example, it can potentially avoid frontend overhead entirely for straight-line programs such as naive matrix multiplication. But the ASIC approach also seems highly limited; as far as I know, it’s not known how to use it to support loops without predetermined iteration bounds.
The final component of frontend overhead comes from the fact that all SNARKs use circuits that operate over finite fields. The CPU on your laptop can multiply or add two integers with a single machine instruction. If a frontend outputs a circuit over a field of large enough characteristic, it can essentially simulate that multiplication or addition via the corresponding field operation. However, implementing the field operation on a real CPU will typically require many machine instructions (although some modern processors do have native support for certain fields).
Some SNARK backends enable more flexible field choices than others. For example, if the backend makes use of a cryptographic group G, the circuit’s field has to match the number of elements in G, which can be limiting. In addition, not all fields support practical FFT algorithms.
There is only one implemented SNARK, Brakedown, that natively supports computations over arbitrary (large enough) fields. Along with its descendents, it has the fastest known concrete prover performance even over fields that other SNARKs support, but its proofs are currently too large for many blockchain applications. Recent work seeks to improve the proof size, but the prover is asymptotically slower and there appear to be barriers to practicality.
Some projects have chosen to work over fields with particularly fast arithmetic. For example, Plonky2 and others use a field of characteristic 264 – 232 + 1 because arithmetic in this field can be implemented several times faster than in less structured fields. However, using such a small characteristic may lead to challenges in efficiently representing integer arithmetic via field operations. (For instance, multiplying a 32-bit unsigned integer by any number larger than 232 might yield a result greater than the field characteristic. Hence, the field choice naturally supports arithmetic only on 32-bit numbers.)
But no matter what, for all popular SNARKs today to achieve 128 bits of security (without a significant increase in verification costs), they have to work over a field of size larger than 2128. As far as I can tell, this means that each field operation is going to require at least about ten 64-bit machine multiplications, and considerably more additions and bitwise operations. Hence, one should factor in at least an order of magnitude frontend overhead due to the need for circuits that operate over finite fields.
To summarize, existing frontends that use a virtual machine abstraction produce circuits with 100x to 1000x gates per step of the virtual machine, and possibly more for more complicated virtual machines. On top of that, finite field arithmetic is at least 10x slower than analogous instructions on modern processors (with possible exceptions if the processor has built-in support for certain fields). An “ASIC frontend approach” might reduce some of these overheads but is currently limited in the kinds of programs it can support.
What are the backend bottlenecks?
SNARKs for circuit-satisfiability are typically designed by combining an information-theoretically secure protocol called a “polynomial IOP” with a cryptographic protocol called a “polynomial commitment scheme.” In most cases, the concrete bottleneck for the prover is the polynomial commitment scheme. In particular, these SNARKs have the prover cryptographically commit to one or more polynomials whose degree is (at least) |C|, the number of gates in the circuit C.
In turn, the concrete bottleneck within the polynomial commitment scheme will depend on the scheme used and the circuit size. But it will always be one of the following three operations: computing FFTs, exponentiations in a cryptographic group, or Merkle-hashing. Merkle-hashing is typically a bottleneck only if the circuit is small, so we will not discuss it further.
Polynomial commitments based on discrete log
In polynomial commitments based on the hardness of the discrete logarithm problem in a cryptographic group G (KZG, Bulletproofs, Dory, and Hyrax-commit), the prover has to compute a Pedersen vector commitment to the coefficients of the polynomial. This involves a multi-exponentiation, of size equal to the polynomial’s degree. In SNARKs, this degree is typically the size |C| of the circuit C.
Done naively, a multi-exponentiation of size |C| requires about 1.5·|C|·log |G| ≈ 400·|C| group operations, where |G| denotes the number of elements in the group G. However, there is an approach, called Pippenger’s algorithm, which can reduce this by a factor of roughly log |C|. For large circuits (say, |C| ≥ 225), this log |C| factor can concretely be 25 or more, that is, for big circuits, we expect that the Pedersen vector commitment may be computable with a little over 10·|C| group operations. Each group operation in turn tends to be (as a very rough ballpark) about 10x slower than a finite field operation. SNARKs using these polynomial commitments are as expensive for P as about 100·|C| field operations.
Unfortunately, existing SNARKs have additional overheads on top of the 100x factor above. For example:
- Spartan’s prover, which uses the Hyrax polynomial commitment, has to do |C|½ many multi-exponentiations each of size |C|½, weakening the speedup from Pippenger’s algorithm by a factor of roughly two.
- In Groth16, P must work over a pairing-friendly group, whose operations are typically at least 2x slower than groups that aren’t pairing friendly. P must also perform 3 multi-exponentiations rather than 1. Combined, this results in at least an additional factor-6 slow down relative to the 100·|C| estimate above.
- Marlin and PlonK also require pairings, and their provers to commit to considerably more than 3 polynomials.
- For any SNARK that uses Bulletproofs (e.g., Halo2, which is roughly PlonK but with Bulletproofs rather than KZG polynomial commitments), the prover has to compute logarithmically many multi-exponentiations during the “opening” phase of the commitment scheme, and this largely erases any Pippenger speedup.
In summary, known SNARKs using Pedersen vector commitments appear to have a backend overhead of at least 200x and up to 1000x or more.
Other polynomial commitments
For SNARKs using other polynomial commitments (such as FRI and Ligero-commit), the bottleneck is that the prover needs to perform large FFTs. For example, in the degree-2 AIRs produced by Cairo (with 51 columns and T rows, one per step of the Cairo CPU), StarkWare’s deployed prover does at least 2 FFTs per column, of length between 16·T and 32·T. The constants 16 and 32 depend on internal parameters of FRI as set by StarkWare and can be reduced — but at the cost of increased verification costs.
Optimistically, an FFT of length 32·T takes about 64·T·log(32T) field multiplications. This means that even for relatively small values of T (e.g., T ≥ 220), the number of field operations per column is at least 64·25·T=1600·T. So the backend overhead appears to be at least in the thousands. Furthermore, it’s possible that large FFTs are even more bottlenecked by memory bandwidth than by field operations.
In some contexts, the backend overhead of SNARKs that perform large FFTs can be mitigated with a technique called proof aggregation. For rollups, this would mean that P (the rollup service) breaks a big batch of transactions into, say, 10 smaller batches. For each small batch i, P produces a SNARK proof πi of the batch’s validity. But P does not post these proofs to Ethereum, as this would lead to a nearly 10-fold increase in gas costs. Instead, the SNARK is applied yet again, this time to produce a proof π establishing that P knows π1 ,…,π10. That is, the witness that P claims to know is the ten proofs π1,…,π10, and direct witness checking applies the SNARK verification procedure to each of the proofs. This single proof π is posted to Ethereum.
In polynomial commitments such as FRI and Ligero-commit, there is a strong tension between P time and V costs, with internal parameters acting as a knob that can trade off one for the other. Since π1 ,…,π10 are not posted to Ethereum, the knob can be tuned so these proofs are large, and producing them is faster. Only in the final application of the SNARK, to aggregate π1 ,…,π10 into a single proof π, does the commitment scheme need to be configured to ensure a small proof.
What are the other bottlenecks to SNARK scalability?
This post has focused on prover time, but other prover costs can also be scalability bottlenecks. For example, for many SNARK backends the prover needs to store several field elements for each gate in C, and this space cost can be huge. For example, a program ψ running in one second on a laptop can perform perhaps a billion primitive operations on a modern processor. As we have seen, in general one should expect C to require well over 100 gates per such operation. This means 100 billion gates, which, depending on the SNARK, could mean tens or hundreds of terabytes of space for P.
Another example: many popular SNARKs (e.g., PlonK, Marlin, Groth16) require a complicated “trusted setup ceremony” to generate a structured “proving key,” which must be stored by the prover. The largest ceremonies have generated proving keys capable of supporting circuits with about 228≈250 million gates. The keys are several dozen gigabytes in size.
In contexts where proof aggregation is possible, these bottlenecks can be substantially mitigated.
Looking ahead: prospects for more scalable SNARKs
Both frontend and backend overheads can be three orders of magnitude or more. Can we expect these to come down substantially in the near future?
I think we will — to a point. First, the fastest backends today (e.g., Spartan and Brakedown, and other SNARKs that combine the sum-check protocol with polynomial commitments) have large proofs — typically square root in the size of the circuit — so people aren’t really using them. I expect the proof size and verifier time to be meaningfully reduced in the near future via depth-one composition with small-proof SNARKs. Similar to proof aggregation, this means a prover would first generate a SNARK proof π with the “fast-prover, large-proof” SNARK, but not send π to V. Rather, P would use a small-proof SNARK to produce a proof π’ that it knows π, and send π‘ to V. This could shave an order of magnitude off of the backend overheads of SNARKs that are popular today.
Second, hardware acceleration can help. A very rough general rule is that GPUs can buy a 10x speedup over CPUs, and ASICs a 10x speedup over GPUs. However, I have three concerns on this front. First, large FFTs may be bottlenecked by memory bandwidth rather than field operations, so SNARKs that perform such FFTs may see limited speedups from specialized hardware. Second, while this post focused on the polynomial commitment bottleneck, many SNARKs require the prover to do other operations that are only slightly less expensive. So breaking the polynomial commitment bottleneck (e.g., speeding up multi-exponentiations in discrete-log-based SNARKs) may leave a new bottleneck operation that is not vastly better than the old one. (For example, SNARKs including Groth16, Marlin, and PlonK also have the prover do FFTs, in addition to multi-exponentiations.) Finally, the fields and elliptic curves that lead to the most efficient SNARKs may continue to evolve for some time, and that may create challenges for ASIC-based prover acceleration.
On the frontend side, we may increasingly find that the “CPU emulator” approach of projects such as Cairo, RISC Zero, zkEVMs, and others actually scale quite well (in terms of performance) with the complexity of CPU instruction sets. Indeed, this is precisely the hope of various zkEVM projects. This may mean that, while the frontend overhead remains three orders of magnitude or more, the frontends manage to support VMs that increasingly match real CPU architectures. A countervailing concern is that the frontends may grow complicated and difficult to audit, as hand-coded gadgets implementing increasingly complicated instructions proliferate. Formal verification methods will likely play an important role in addressing this concern.
Finally, at least in blockchain applications, we may find that most smart contracts appearing in the wild primarily use simple, SNARK-friendly instructions. This may enable low frontend overheads in practice while retaining the generality and improved developer experience that comes with supporting high-level programming languages like Solidity and rich instruction sets including those of the EVM.
Justin Thaler is an Associate Professor at Georgetown University. Before joining Georgetown, he spent two years as a Research Scientist at Yahoo Labs in New York, before which he was a Research Fellow at the Simons Institute for the Theory of Computing at UC Berkeley.
Acknowledgments: I am grateful to Elena Burger for proposing the topic of this post, and to Arasu Arun, Joseph Bonneau, Sam Ragsdale, Daniel Lubarov, and Brendan Farmer for insightful comments. Special thanks also to my editor, Tim Sullivan.
The views expressed here are those of the individual AH Capital Management, L.L.C. (“a16z”) personnel quoted and are not the views of a16z or its affiliates. Certain information contained in here has been obtained from third-party sources, including from portfolio companies of funds managed by a16z. While taken from sources believed to be reliable, a16z has not independently verified such information and makes no representations about the enduring accuracy of the information or its appropriateness for a given situation. In addition, this content may include third-party advertisements; a16z has not reviewed such advertisements and does not endorse any advertising content contained therein.
This content is provided for informational purposes only, and should not be relied upon as legal, business, investment, or tax advice. You should consult your own advisers as to those matters. References to any securities or digital assets are for illustrative purposes only, and do not constitute an investment recommendation or offer to provide investment advisory services. Furthermore, this content is not directed at nor intended for use by any investors or prospective investors, and may not under any circumstances be relied upon when making a decision to invest in any fund managed by a16z. (An offering to invest in an a16z fund will be made only by the private placement memorandum, subscription agreement, and other relevant documentation of any such fund and should be read in their entirety.) Any investments or portfolio companies mentioned, referred to, or described are not representative of all investments in vehicles managed by a16z, and there can be no assurance that the investments will be profitable or that other investments made in the future will have similar characteristics or results. A list of investments made by funds managed by Andreessen Horowitz (excluding investments for which the issuer has not provided permission for a16z to disclose publicly as well as unannounced investments in publicly traded digital assets) is available at https://a16z.com/investments/.
Charts and graphs provided within are for informational purposes solely and should not be relied upon when making any investment decision. Past performance is not indicative of future results. The content speaks only as of the date indicated. Any projections, estimates, forecasts, targets, prospects, and/or opinions expressed in these materials are subject to change without notice and may differ or be contrary to opinions expressed by others. Please see https://a16z.com/disclosures for additional important information.