FAQ on Jolt’s initial implementation

4.9.24

1. What benefits of Jolt and the sum-check protocol are not captured by the raw speed numbers from the initial implementation?
2. Why did you choose Hyrax as the polynomial commitment scheme for the initial Jolt implementation?
3. Can you give a rough breakdown of how Jolt works and what the prover’s costs are?
4. How are you estimating that switching to the Binius polynomial commitment will give at least a 5x speedup?
5. Will the Jolt prover costs (per cycle of the CPU proved) increase as you add more instructions to the primitive instruction set?
6. Can Jolt benefit from GPUs as much as other zkVMs?
7. How do Jolt, SP1, and RISC Zero relate to each other and the broader zkVM landscape?
8. What are the downsides of zkVMs targeting (ostensibly) SNARK-friendly ISAs rather than pre-existing ISAs like RISC-V?
9. In August 2023, Starkware stated that FRI is 50x faster than curve-based commitment schemes, even when committing to small values, as in Jolt. How can this be reconciled with the fact that Jolt-with-curve-based-commitments is faster than all existing zkVMs?
10. StarkWare has described “Circle STARKs” as a breakthrough. Will Circle STARKs change the comparison between Jolt and STARK-based zkVMs?
11. Will STIR change the comparison between Jolt and STARK-based zkVMs?
12. How does the space usage of the Jolt prover compare to other zkVMs?
13. Haven’t some estimates of prover overhead in existing zkVMs been as low as 100,000? How does that not contradict your estimate of 5 million and up?
14. What should I know about precompiles?
15. SP1 claims to be up to 28x faster than RISC Zero on the CPU. So how can Jolt be faster than SP1 yet only 6x faster than RISC Zero?
16. How do you plan to incorporate precompiles into Jolt?
17. What should I know about recursion vis-à-vis Jolt and SP1?
18. Didn’t you recently argue that continuations for the Binius commitment scheme with Keccak as the hash function will work out fine? Why are you worried that FRI with Blake3 won’t?
19. Doesn’t StarkWare use the Blake2s hash function in production? Doesn’t that contradict your assertion that FRI-with-Blake3 is not conducive to recursion?
20. Isn’t Blake3 dozens of times faster than Poseidon2? Why does using Blake3 in SP1 speed up the prover by only 30%?
21. Does recent work on “accumulation schemes without homomorphism” address your concerns about continuations for SP1-with-Blake3?
22. Can you explain the potential benefits of accumulation/folding schemes when Jolt is instantiated with a curve-based commitment scheme?

While you’re reading hover Bold⁽¹⁾ text for more info.

1. What benefits of Jolt and the sum-check protocol are not captured by the raw speed numbers from the initial implementation?
2. Why did you choose Hyrax as the polynomial commitment scheme for the initial Jolt implementation?
3. Can you give a rough breakdown of how Jolt works and what the prover’s costs are?
4. How are you estimating that switching to the Binius polynomial commitment will give at least a 5x speedup?
5. Will the Jolt prover costs (per cycle of the CPU proved) increase as you add more instructions to the primitive instruction set?
6. Can Jolt benefit from GPUs as much as other zkVMs?
7. How do Jolt, SP1, and RISC Zero relate to each other and the broader zkVM landscape?
8. What are the downsides of zkVMs targeting (ostensibly) SNARK-friendly ISAs rather than pre-existing ISAs like RISC-V?
9. In August 2023, Starkware stated that FRI is 50x faster than curve-based commitment schemes, even when committing to small values, as in Jolt. How can this be reconciled with the fact that Jolt-with-curve-based-commitments is faster than all existing zkVMs?
10. StarkWare has described “Circle STARKs” as a breakthrough. Will Circle STARKs change the comparison between Jolt and STARK-based zkVMs?
11. Will STIR change the comparison between Jolt and STARK-based zkVMs?
12. How does the space usage of the Jolt prover compare to other zkVMs?
13. Haven’t some estimates of prover overhead in existing zkVMs been as low as 100,000? How does that not contradict your estimate of 5 million and up?
14. What should I know about precompiles?
15. SP1 claims to be up to 28x faster than RISC Zero on the CPU. So how can Jolt be faster than SP1 yet only 6x faster than RISC Zero?
16. How do you plan to incorporate precompiles into Jolt?
17. What should I know about recursion vis-à-vis Jolt and SP1?
18. Didn’t you recently argue that continuations for the Binius commitment scheme with Keccak as the hash function will work out fine? Why are you worried that FRI with Blake3 won’t?
19. Doesn’t StarkWare use the Blake2s hash function in production? Doesn’t that contradict your assertion that FRI-with-Blake3 is not conducive to recursion?
20. Isn’t Blake3 dozens of times faster than Poseidon2? Why does using Blake3 in SP1 speed up the prover by only 30%?
21. Does recent work on “accumulation schemes without homomorphism” address your concerns about continuations for SP1-with-Blake3?
22. Can you explain the potential benefits of accumulation/folding schemes when Jolt is instantiated with a curve-based commitment scheme?

While you’re reading hover Bold⁽¹⁾ text for more info.

Today, a16z crypto research and engineering teams released an initial implementation of Jolt, a new approach to SNARK design that is already up to 2x faster than the state of the art, with more improvements still to come. See this post for a high-level overview to this body of work, as well as this post for benchmarks and code, and this post for technical specifics about Jolt’s architecture and performance.

Below, I unpack the technical specifics of Jolt, compare its architecture and performance with existing zkVMs, and discuss future steps planned for its development.

Category 1: Jolt’s performance

1. What benefits of Jolt and the sum-check protocol are not captured by the raw speed numbers from the initial implementation?

Developer benefits. Since Jolt realizes the lookup singularity, it is much simpler to add new primitive instructions to Jolt’s VM compared to alternative approaches to zkVM design.

Benefits of working over a big field. Jolt is already the fastest zkVM to date, despite “paying” in prover time to use a 256-bit field. (Most other projects use 31-bit or 64-bit fields.) In return for already paying this price, the Jolt prover obtains the following benefits:

Cheaper recursion via folding. Many zkVMs today use continuations, which means they break the execution of a computer program into chunks (sometimes called “shards”) and prove each chunk independently, before recursively aggregating the proofs into one. Without continuations, prover space costs become prohibitive except for toy-sized computer programs.

SNARKs using curve-based commitments are amenable to very efficient aggregation via folding schemes. The upshot is that, although Jolt does not yet have recursion implemented, it stands to incur less overhead from it relative to projects like SP1 (which also does not fully have recursion implemented, despite partial progress to this end). (See #22 below for details.)

Usefulness even without recursion. Jolt proofs are 3-10 MBs today but they will fall to just a couple of dozen KBs once we add support for other curve-based polynomial commitment schemes such as Zeromorph (see #2 below for details). zkVMs like SP1, which use FRI with a “blowup factor” of 2, cannot achieve proofs smaller than 500KBs without recursion. And unlike Jolt, FRI-based zkVMs involve superlinear-time procedures like FFTs that become a prover bottleneck if the zkVM is applied directly to large computations without continuations.
Less overhead when moving to 64-bit data types. Because Jolt’s current implementation already pays for 256-bit fields, it can represent 64-bit values in a single field element, and it can even multiply two 64-bit values together without overflowing the field modulus. This means that Jolt, once extended to handle 64-bit data types, will be less than 2x slower per instruction than when handling 32-bit data types. Hence, Jolt’s speedup over today’s zkVMs will grow larger when moving to 64-bit data types.

Performance benefits of small fields are still to come. As discussed in my other post, we will switch from curve-based commitments to the hashing-based commitment scheme Binius, and thereby obtain a further speedup of at least 5x for Jolt (see #4 below for details).

This switch only works for sum-check-based SNARKs, which use multilinear rather than univariate polynomials. For example, the Binius commitment scheme exploits “tensor structure” in evaluation queries for multilinear polynomials over the so-called Lagrange basis. This tensor structure is not present for univariate polynomials over the Lagrange basis. Working over the Lagrange basis is crucial to Jolt’s performance, as it ensures that the SNARK prover only commits to small values, and avoids the need for expensive change-of-basis operations.

Binius got even better last week, as Diamond and Posen brought its proof size down from square root in the size of the committed polynomial to polylogarithmic.

More performative precompiles. Today’s zkVM deployments make heavy use of precompiles. And precompiles based on the sum-check protocol can be especially efficient. For example, the Binius paper describes a sum-check-based SNARK for Keccak and other standard hash functions that I expect to have a substantially faster prover than alternative approaches. Such precompiles are most efficiently incorporated into sum-check-based SNARKs like Jolt.

Also, many SNARK applications involve proving knowledge of elliptic-curve-based digital signatures. This is vastly more efficient if the SNARK can work directly over the base field of the appropriate elliptic curve group, which Jolt-with-curve-based-commitments can do (see #14 of my previous post for additional discussion).

Future optimizations. Jolt is a new approach to zkVM design that was implemented by two engineers (Michael and Sam) and a graduate student (Arasu) over the course of five months. No one else besides me and Srinath has attempted to optimize the protocol or codebase, and we have not implemented all of the optimizations already identified. New improvements are certain to come with time and scrutiny.

2. Why did you choose Hyrax as the polynomial commitment scheme for the initial Jolt implementation?

Hyrax is simple, transparent, and reasonably fast for the prover. In particular, opening proofs in Hyrax do not involve any cryptographic work for the prover, so we knew these opening proofs would not be a prover bottleneck.

(More precisely, we’re currently using a stripped-down version of the Hyrax-commitment that is not zero-knowledge).

The downside of using Hyrax is that it leads to somewhat large verifier costs. (Jolt’s proofs when using Hyrax range from a few MBs up to roughly a dozen MBs.)

We can replace Hyrax with any other commitment scheme for multilinear polynomials. In particular, we’ll add support for Zeromorph and HyperKZG, which are adaptations of KZG commitments to multilinear polynomials. They have constant commitment size and logarithmic size evaluation proofs, and verification involves logarithmically many G₁ operations and 3 pairing evaluations (The Zeromorph verifier has already been implemented in Solidity for on-chain verification. A Rust implementation of its prover is here).

Concretely, this will yield a proof size of a couple dozen KBs, with no degradation in prover time compared to the current implementation. Achieving this proof size will also require switching a component of Lasso called a grand product argument. Jolt currently uses one that has log(n)²-sized proofs. A variant described in Section 6 of the Quarks paper has roughly logarithmic-sized proofs, with minimal increase in prover time.

Naively implemented, Zeromorph and HyperKZG opening proofs could turn out to be a bottleneck for the Jolt prover, because they require the prover to commit to random field elements. But we can combine them with the (standard) technique underlying Hyrax to reduce this cost, with a modest increase in verifier work. See Section 14.3 of my book for the (simple) details, which involve committing to a polynomial “in smaller-sized pieces” and then using homomorphic properties to provide a single opening proof for all of the pieces.

3. Can you give a rough breakdown of how Jolt works and what the prover’s costs are?

A VM does two things:

Repeatedly execute the fetch-decode-execute logic of its instruction set architecture.
Perform reads and writes to Random Access Memory (RAM).

Accordingly, Jolt has three components:

To handle the “execute” part of each fetch-decode-execute loop iteration, it invokes the Lasso lookup argument.
To handle reads/writes to RAM (and to registers) it uses a memory-checking argument from Spice, which is closely related to Lasso itself. They are both based on “offline memory checking” techniques, the main difference being that Lasso supports read-only memory while Spice supports read-write memory, making it slightly more expensive.
To handle the “fetch-decode” part of each fetch-decode-execute loop iteration, and to capture some extra constraints not directly handled by Lasso itself, there is a minimal R1CS instance (about 60 constraints per cycle of the RISC-V VM).

To prove satisfaction of the R1CS in (3), we use Spartan, optimized for the highly-structured nature of the constraint system (e.g., the R1CS constraint matrices are block-diagonal with blocks of size only about 60 x 80).

Roughly, the Jolt prover today spends 25% of its time in (1), 25% in (2), and 50% in (3).

About ⅕ of its time is spent computing commitments, and ½ of its time is spent in various invocations of the sum-check protocol. The remaining time is spread across assorted tasks like allocating memory, witness generation, serializing memory, and so forth.

These fractions will shift as we continue to optimize Jolt.

4. How are you estimating that switching to the Binius polynomial commitment will give at least a 5x speedup?

Per #3 above, the Jolt prover currently spends about 1/5 of its time computing commitments, and ½ of its time computing messages in the sum-check protocol (which mainly involves field operations: memory reads/writes during the sum-check protocol involve a single streaming pass over a handful of arrays in each round, so the prover should not be memory bound). These are the core tasks of the Jolt prover. The remaining 30% of prover time is split across a large number of “plumbing” tasks like allocating memory and witness generation (witness generation makes up less than 2% of the Jolt prover’s runtime).

Switching to Binius should drop the time to compute commitments by a factor of more than 10x.

Because Binius works over the field GF[2¹²⁸] (rather than the scalar field of an elliptic curve like BN254, which is what Jolt uses today), field operations in the sum-check protocol will get much cheaper. Exactly how much is hardware-dependent, but Ulvetanna is working hard to ensure the speedups are substantial. For example, on ASICs and FPGAs, it will likely be 30x or more (even without bringing in algorithmic optimizations that minimize the prover’s field work over small-characteristic fields). On ARM machines, the speedups should also be substantial: while Binius is targeted at the so-called tower basis of GF[2¹²⁸], a lot of the prover’s work can happen over the POLYVAL basis, and ARM chips support fast multiplication in this basis.

The biggest open question is how much we can optimize away the plumbing tasks. We haven’t yet tried hard, since they add up to only 30% of the prover’s time today. But after commitments and field operations fall in cost by 10x or more, these tasks will dominate unless optimized further. Their cost should fall by a factor of at least two simply due to switching from a 256-bit field to a 128-bit field. If it falls by much more than that, we’ll obtain an end-to-end prover speedup of over 10x.

One final complication is that working over fields of characteristic two makes RISC-V addition and multiplication less lookup-friendly. So these operations will be handled with bespoke SNARKs given as Protocols 5.3 and 5.5 in the Binius paper, rather than via lookups. The bespoke SNARK for addition is cheaper than Jolt’s current lookup-based approach. The bespoke SNARK for multiplication requires the prover to commit to more data than a pure-lookup-approach over a field of large characteristic. But this will be compensated for by the fact that the Binius commitment scheme itself is much faster per bit committed than a curve-based one.

5. Will the Jolt prover costs (per cycle of the CPU proved) increase as you add more instructions to the primitive instruction set?

Not meaningfully. As long as each primitive instruction is decomposable, the Jolt prover’s cost depends primarily on the size of the inputs to each instruction, not on the number of instructions in the ISA.

This is because at each step of the computation, correct execution of the instruction that gets executed at that step is handled via a single lookup into the table containing all evaluations of all primitive instructions. The size of this table grows linearly with the number of primitive instructions, but the prover’s cost in the lookup argument used (namely, Lasso) grows extremely slowly with the table size.

The verifier’s field work does grow with the number of primitive instructions in the ISA. But since field operations are much less expensive than cryptographic work (such as hashing or group operations, which the verifier has to do owing to use of polynomial commitment schemes), this field work will not be a verifier bottleneck for reasonable instruction set sizes.

6. Can Jolt benefit from GPUs as much as other zkVMs?

I expect so. For example, Ingonyama’s GPU implementation of the sum-check protocol is already complete and will soon be integrated into their main GPU library, ICICLE. ICICLE already has support for the other core operation performed by the Jolt prover, namely MSMs (although it is not yet optimized for Jolt’s setting, where all values committed by the MSM are small).

It’s conceivable that complications will arise when Jolt switches over to the Binius commitment scheme. Specifically, achieving very fast GF[2¹²⁸] arithmetic on a GPU may require effort (whereas, per #4 above, it is already clear that very fast operations in this field can be achieved on ARM chips, ASICs and FPGAs, etc.)

As implementations of sum-check-based SNARKs mature, there is every reason to expect that extremely performative GPU implementations will arise, just as they have for the SNARKs that are popular today.

Category 2: Jolt in the broader zkVM landscape

7. How do Jolt, SP1, and RISC Zero relate to each other and the broader zkVM landscape?

Jolt, SP1, and RISC Zero are all zkVMs targeting the RISC-V instruction set architecture (ISA). This is an ISA that was designed without SNARKs in mind, and a robust tooling ecosystem surrounds it. Other zkVMs target ostensibly “SNARK-friendly” ISAs (though I do not believe these other ISAs are actually any friendlier to SNARKs than RISC-V).

Both SP1 and RISC Zero rely on STARK back-ends applied to AIR constraint systems. By “STARK” here, I mean a two-round polynomial IOP combined with the FRI commitment scheme for univariate polynomials. Jolt is based on different proof machinery: it uses a many-round polynomial IOP based on the sum-check protocol, combined with a commitment scheme for multilinear polynomials.

More details on SP1. SP1 was released in February 2024, and borrows heavily from a number of projects. Its Rust toolchain and SDK are taken from RISC Zero, as is its architecture for incorporating pre-compiles. And it uses the STARK implementation from plonky3.

In a nutshell, SP1’s contribution is to design AIRs implementing each RISC-V instruction, so that plonky3 can be applied to these AIRs. Even these AIRs build heavily on code from a different zkVM called Valida. (Valida targets an ISA that is inspired by RISC-V, but modifies it to be ostensibly friendlier to SNARKs.)

CPUs, GPUs, and blowup factors. RISC Zero has generally focused on GPU implementation: its CPU implementation does not appear to be fully optimized. SP1 has so far focused exclusively on CPUs. The SP1 zkVM is about 5x faster than RISC Zero’s on the CPU. However, a factor of roughly 1.5x out of this 5x speedup is due to SP1 simply using a FRI blowup factor of 2, rather than the blowup factor of 4 used by RISC Zero. This makes recursion more expensive for SP1, though SP1’s current benchmarks do not “charge” the prover anything for this (since SP1 does not yet have recursion fully implemented). So the actual speedup factor of SP1 of RISC Zero on the CPU is currently closer to 3 than it is to 5.

Still, the favorable comparison of Jolt to SP1 shows that Jolt’s speedups over prior zkVMs is not due to optimizing for the CPU over the GPU, but rather is a property of the proof system itself.

8. What are the downsides of zkVMs targeting (ostensibly) SNARK-friendly ISAs rather than pre-existing ISAs like RISC-V?

First, with existing ISAs one can use existing compilers from high-level languages like Rust down to assembly code for the ISA. Those designing new “SNARK-friendly” ISAs have to build their own compilers to obtain assembly code for their custom ISA. Not only is this enormously challenging and effort-intensive (at least if one wants decent performance), it introduces a massive surface area for bugs to enter the system.

Second, many “SNARK-friendly” ISAs are vastly simpler than RISC-V (fewer registers, no bitwise operations, limitations on the number of times each memory cell can be written, etc.). So each primitive instruction does much less “useful work” than a RISC-V instruction. Any given computer program requires more primitive instructions when compiled down to the custom ISA, compared to RISC-V.

Unless each primitive instruction in the SNARK-friendly ISA is actually cheaper to prove than a RISC-V instruction, total prover time will be much higher, since more primitive instructions need to be proved, and each custom instruction is just as expensive to prove as RISC-V instructions.

Jolt is just as fast for any instruction set for which the instructions satisfy a natural “decomposability” property. So targeting weak instruction sets is deeply counterproductive: it weakens the amount of useful work done by each primitive instruction, without reducing the cost of invoking it.

A related issue is that many projects, whether or not they’ve implemented a full-fledged zkVM, have introduced new “SNARK-friendly” programming languages to expose to developers. This has three drawbacks. First, developers have to learn a new language, and are more likely to write buggy programs until they become experts. Second, all programs used to secure anything of value should ultimately be formally verified for correctness, and just as custom ISAs require new compilers, custom DSLs will require new verification tooling. Third, if an ecosystem around a custom language fails to thrive or scale, any applications built using the language become defunct.

In summary, zkVM projects insisting on the use of a custom DSL for efficient compilation to an ostensibly “SNARK-friendly” VM have chosen a path riddled with drawbacks. Each design decision is a mistake–a bad VM, an inefficient SNARK used to prove that VM, a new high-level language to ease compilation to the VM, and a new compiler to perform the compilation. Each mistake adds significant inefficiency and vulnerability to errors and bugs.

9. In August 2023, Starkware stated that FRI is 50x faster than curve-based commitment schemes, even when committing to small values, as in Jolt. How can this be reconciled with the fact that Jolt-with-curve-based-commitments is faster than all existing zkVMs?

The 50x claim was later revised downward to 10x. This adjustment came after criticism, because the original claim overstated the costs associated with Multi-Scalar Multiplications (MSMs) by a factor of 5.

But even the updated claim leads to more than an order of magnitude discrepancy with the fact that Jolt-with-curve-based commitments is already the fastest zkVM to date (with many improvements still to come). How can such a huge discrepancy be explained?

The answer is two-fold. First, the claim ignores that the Jolt prover commits to much less data than other zkVMs. (And due to Jolt’s use of the sum-check protocol, Jolt can’t directly use FRI, because FRI is tailored to univariate polynomials while the sum-check protocol uses multivariate ones.) For example, Jolt commits to about 1200 bits of data per step of the RISC-V CPU while RISC Zero commits to over 8500.

Second, the more data the prover commits to, the more work it has to do to prove the data is well-formed. The claims ignore these large “non-commitments costs” in SNARKs that avoid the sum-check protocol.

In summary, StarkWare’s claims look only at commitment time for a given amount of data, not at how much data needs to be committed, nor how much work is needed to prove the committed data is well-formed. Thus, a purported 10x slowdown (originally claimed to be 50x) turns into a speedup.

10. StarkWare has described “Circle STARKs” as a breakthrough. Will Circle STARKs change the comparison between Jolt and STARK-based zkVMs?

No. Circle STARKs enable the prover to work over the “M31 field,” which is of size 2³¹– 1 and has slightly faster arithmetic than the BabyBear field currently used by SP1/plonky3 and RISC Zero. Switching to the M31 field will at best improve prover time by a modest factor of 1.4.

StarkWare recently said that they expect a Circle-Stark-based proof system called Stwo for the Cairo virtual machine to be about 100x faster than the Stone prover. But according to our experiments, Jolt is already over 35x faster than Stone, and that’s when comparing Stone run on programs carefully written in Cairo-VM assembly to Jolt run on programs written in Rust and compiled to RISC-V.

And even Stwo will partially incorporate the techniques I am advocating for, via use of a sum-check-based lookup argument that is inspired by Lasso.

On top of all of this, the 1.4x speedup promised by Circle STARKs will only come to pass if key STARK prover operations are compute-bound rather than memory-bound. The Circle STARK paper only implemented FFTs, and the current FFT implementation is memory-bound. No performance benefit whatsoever has yet been demonstrated under multi-threading (see Table 1 in Appendix C of the paper).

11. Will STIR change the comparison between Jolt and STARK-based zkVMs?

No. STIR is an impressive result that reduces the FRI proof size over 128-bit fields by a factor of perhaps 1.5x with a modest increase in prover time. However, it does not change the fact that (contrary to widespread views) FRI and STIR, and the SNARKs that invoke them, are optimized for verification costs rather than prover time.

Polynomial commitment schemes like Ligero/Brakedown/Binius have many benefits for the prover relative to FRI and STIR (see #7 of my previous post, and #20 below, for details). These benefits are in addition to the fact that FRI and STIR apply to univariate polynomials rather than multilinear ones, and so cannot be directly combined with the sum-check protocol. These are the polynomial commitment schemes that SNARK designers should use to achieve scalable provers. Similarly, curve-based commitments for multilinear polynomials are fast when combined with the right polynomial IOPs (as demonstrated by Jolt, and Nova before it, and despite widespread misconceptions to the contrary).

SNARKs optimized for verifier cost at the expense of prover time will continue to play a role, namely when proving very simple statements, or to compress proof size and verifier costs via SNARK composition just before posting a proof on-chain.

12. How does the space usage of the Jolt prover compare to other zkVMs?

Today, the Jolt prover’s space usage is roughly 5x larger than SP1’s. However, Jolt’s space usage will fall by a factor of about 8x in the coming weeks.

The way Lasso is currently implemented in Jolt, the prover stores large vectors of 256-bit field elements, over 90% or which are 1s. We currently spend 256 bits per vector entry, even the 1s. We can instead have the prover store a “densified” representation of these vectors, that is, only store the entries that are not 1. Once we implement this, total prover space will fall by a factor of about 8x, and prover time will also decrease somewhat (by perhaps 10%).

Long-term, I expect that the space usage of Jolt will fall another factor of 10x without much impact on prover time. This space improvement will come from two places. First, switching to the Binius commitment scheme, which uses the field of size 2¹²⁸, will substantially lower the space cost of storing vectors of field elements (the maximum size of any field element will fall from 256 bits down to 128 bits, and on top of that, many field elements arising in Jolt will lie within a much smaller subfield, and hence require fewer bits to represent). Second, it has long been known that there are space-efficient implementations of the sum-check protocol prover and various polynomial commitment schemes. Over large-characteristic fields, the space-efficient variant of sum-check comes with a logarithmic-factor increase in prover time, but over small-characteristic fields (which Jolt will use after switching to Binius), the increase in prover time appears to be much smaller.

13. Haven’t some estimates of prover overhead in existing zkVMs been as low as 100,000? How does that not contradict your estimate of 5 million and up?

The right way to measure prover overhead in zkVMs depends on context. The measure that makes the most sense in the broadest array of contexts is CPU time of proving divided by CPU time of computing. By this measure, prover overhead of existing zkVMs is 5 million and up.

In the specific context of blockchain scalability, what matters is the monetary cost of proving a program execution vs. the monetary cost of executing the same program on-chain. Since proving is entirely centralized today, the prover can run on a cluster of GPUs, while blockchain nodes run on CPUs.

Owing to this, some people have estimated the prover overhead of existing zkVMs by reporting the energy cost of running the zkVM prover on GPUs, normalized by the energy cost of executing the program on CPUs. This leads to estimated overheads of as low as 100,000.

This number isn’t wrong, but it is very specific to the context of centralized proving services for blockchain scalability. In essentially any other context, 5 million is a much more accurate estimate of how much more work proving is, relative to computing.

Category 3: Precompiles

14. What should I know about precompiles?

All zkVMs introduce massive performance overhead for the prover. For example, I estimate the current version of Jolt is about 500,000 thousand times slower than simply running the RISC-V program without generating a proof of correctness (see my companion post for details).

It’s almost always better for performance to hand-design a protocol for a specific computation like SHA2 or Keccak evaluation, or verifying elliptic curve digital signatures, than it is to write a Rust program, compile it down to the assembly language of a zkVM, and prove correct execution of the zkVM.

These special-purpose protocols are called precompiles (also known as “built-ins” or “gadgets”).

So to maximize performance, every zkVM should use precompiles for operations that occur over and over again in SNARK applications. As of today, it’s the only way performance is going to be acceptable in most settings.

But there are also major downsides to relying on precompiles. For example, today’s precompiles are all hand-specified circuits or constraint systems (AIRs in the case of SP1/plonky3, RISC Zero, or Cairo). This is exactly the error-prone and time-intensive process that zkVMs are meant to obviate.

There will be a temptation to paper over the massive performance overheads of zkVMs with overreliance on precompiles. But this risks putting us right back where we are today, with security vulnerabilities everywhere and thousands of human hours spent grinding out constraint specifications.

Long-term, zkVMs will grow fast enough that precompiles can be completely avoided in many applications. Precompiles for a handful of pervasive operations will persist, and these will be formally verified for correctness. This will be analogous to how new generations of CPUs come with new primitive instructions baked into the hardware to accelerate common cryptographic operations.

These pervasive precompiles will include a handful of hash functions and digital signature schemes, and commonly-occurring operations in specific application domains, like popular non-linearities and matrix multiplication for machine learning applications.

15. SP1 claims to be up to 28x faster than RISC Zero on the CPU. So how can Jolt be faster than SP1 yet only 6x faster than RISC Zero?

SP1 reports speedups over RISC Zero ranging from about 4x to 28x depending on the benchmark. However, any speedup over about 5x is not due to improvements in the zkVM, but rather to better precompiles. Advances in precompiles are orthogonal to advances in zkVMs, and any zkVM can benefit from them. So, for example, RISC Zero can directly benefit from SP1/plonky3’s better precompiles, just as SP1 itself does.

Our benchmarks seek to assess zkVMs and hence omit the use of precompiles. (Also, extending Jolt to support precompiles is near-term future work, see #16 below).

Long-term, precompiles based on the sum-check protocol will be the fastest, and these integrate most nicely with sum-check-based zkVMs like Jolt. The quintessential examples of such a precompile is the super-efficient sum-check-based protocol for matrix multiplication from 2013 or for FFTs from zkCNN. Another is the sum-check-based SNARK for Keccak from the Binius paper, which once fully optimized, I expect to be state-of-the-art by a significant margin. (A preliminary implementation is already 2x faster than the Keccak precompile from plonky3, with major performance improvements still to be implemented.)

In addition to the above, RISC Zero has experienced roughly a 2x improvement on CPUs since the initial release of SP1 (presumably due to the optimizations summarized here).

16. How do you plan to incorporate precompiles into Jolt?

Incorporating precompiles into Jolt is essential to achieve developer adoption, since performance of all zkVMs, Jolt included, is simply too slow today without them (though Jolt without pre-compiles is faster than most existing zkVMs with precompiles).

The simplest approach is to have the prover “collect” all the inputs to the various invocations of the precompile together and apply (a data-parallel version) of the precompile to the collected inputs. Even this collection step can likely be avoided using techniques already incorporated into the current Jolt implementation.

If the precompile consists of a hand-specified constraint system (which is how the current generation of zkVMs implements precompiles), we would have developers specify the constraint system in an appropriate constraint specification language, and then apply a back-end for that constraint system to prove satisfiability. Today’s zkVMs all use AIR as the constraint system. We can reuse these precompiles by applying BabySpartan or SuperSpartan (our sum-check-based SNARKs for AIR and Plonkish constraint systems) to prove their satisfiability.

Long-term, the fastest precompiles will be special-purpose sum-check-based SNARKs that avoid constraint systems entirely. In this case, we’d probably have developers specify them by expressing the prover and verifier directly, in Rust.

Category 4: Recursion and continuations

17. What should I know about recursion vis-à-vis Jolt and SP1?

Recall that many zkVMs today use continuations, which means they break the execution of a computer program into chunks and prove each chunk independently before recursively aggregating the proofs into one.

SP1 and continuations. SP1/plonky3 does not have the recursive aggregation step implemented (despite partial progress to this end). SP1 simply outputs one proof per chunk. This means SP1 currently has huge verifier costs–the proof for each chunk is over 500KB, and there is one chunk for every 2¹⁹ cycles of the RISC-V CPU. So proofs are dozens of MBs in total for programs that run for 10 million cycles and up. The verifier has to hash most of the data in the proof.

Accordingly, SP1’s benchmarks charge the prover nothing for recursive proof aggregation (i.e., it’s simply not done), even though before it’s ever actually deployed, recursion will have to be enabled to reduce verifier costs (see #1 above for further discussion). This is not entirely unreasonable since the benchmarks use Poseidon2 as the hash function, which is “recursion-friendly.” So while implementing recursion would cost the prover something, it should not be prohibitive.

However, there may still be challenges. For example, some have turned to the slower plonky2 system to obtain a recursive variant of plonky3, perhaps due to limitations posed by plonky3’s exclusive support for AIR constraint systems.

And SP1 is configured to aggressively optimize prover time at the cost of making recursion more expensive. The default SP1 “shard size” is 2¹⁹ and the FRI “blowup factor” is 2, while RISC Zero has a default shard size of 2²⁰ and a FRI blowup factor of 4. The net effect is that recursive proving in SP1 would be about four times more expensive for RISC Zero, yet SP1’s benchmarks charge the prover nothing for this.

Succinct has proposed that further prover speedups will be had by using the Blake3 hash function instead of Poseidon2. However, this would only speed the prover up by about 30%. In addition, it’s not at all clear that Blake3 can be used without recursive proof aggregation becoming a major bottleneck. Proof aggregation for SP1 requires the prover to prove that it correctly hashed well over 500 KB of data per chunk. Current precompiles for Blake3 are nowhere near fast enough to keep this from being a major bottleneck.

Jolt and continuations. Like SP1, Jolt does not yet have continuations implemented. But since Jolt’s implementation already “pays” to use a 256-bit field and elliptic-curve-based commitments, Jolt is amenable to fast continuations via folding (see #1 above and #22 below for additional discussion). Once Jolt switches to the Binius commitment scheme, continuations will be implemented via recursive proof aggregation rather than folding. Recursion will not be a prover bottleneck, as described in my previous post (even more so now that the proof size of Binius has fallen to polylogarithmic).

Complications in comparing performance. Jolt today does not split computations into chunks. SP1 does, but does not recursively aggregate the proofs, so its proofs grow very large when there are many chunks.

Jolt is over 3x faster than SP1 today when both are run with a single chunk, but the speedup falls as the number of chunks SP1 uses increases. This is because SP1’s “intra-chunk” parallelization is not fully optimized (it will soon be improved, though: an interpolation procedure done by the prover is parallelizable but has not yet been parallelized). In contrast, inter-chunk parallelization is trivial (i.e., the prover can work on many chunks at once, independently generating the proof for each chunk).

Intra-chunk parallelization is more challenging and less perfect than inter-chunk parallelization (where the prover trivially proves multiple chunks in parallel). Today, Jolt does not take advantage of inter-chunk parallelization at all (since Jolt only runs on one chunk). Hence, Jolt may see modest performance improvements once it, too, splits big computations into many chunks.

18. Didn’t you recently argue that continuations for the Binius commitment scheme with Keccak as the hash function will work out fine? Why are you worried that FRI with Blake3 won’t?

The Binius paper gives a very fast SNARK for Keccak, which uses the sum-check protocol, combined with the Binius commitment scheme. This protocol, once fully optimized, will be much faster than non-sum-check-based SNARKs for Keccak. This very high performance will be needed for continuations to work out, at least without resorting to huge shard sizes that lead to many GBs of prover space and other performance bottlenecks.

19. Doesn’t StarkWare use the Blake2s hash function in production? Doesn’t that contradict your assertion that FRI-with-Blake3 is not conducive to recursion?

StarkWare builds Merkle trees using Blake2s as the hash function at a handful of “big” layers of the Merkle tree, and a slow, “SNARK-friendly” hash function at the other layers. This way, most hashes along any particular root-to-leaf authentication path are “SNARK-friendly” (keeping recursion cheap), but the prover time required to build the Merkle tree is dominated by the handful of big layers where the faster hash function is used.

However, most STARK-based zkVMs store lengthy vectors of field elements at each leaf of the Merkle tree. In this context, StarkWare’s technique of “Blake2s-hash only the big layers” is not very helpful, because most of the verifier’s work involves applying Blake2s to vectors stored at the leaves.

StarkWare’s production prover is dozens of times slower than Jolt, even when running it on hand-written Cairo-VM assembly. The fact that it uses Blake2s in conjunction with recursion does not mean that this would be helpful in other zkVMs.

20. Isn’t Blake3 dozens of times faster than Poseidon2? Why does using Blake3 in SP1 speed up the prover by only 30%?

Blake3 is fastest when hashing large files. The Merkle trees used in SP1 store moderately-sized vectors at the leaves. These are, effectively, small files, which each need to be hashed. In this context, Blake3 appears to only be about 3x-4x faster than Poseidon2.

This is yet another benefit for the prover of the Ligero/Brakedown/Binius commitment schemes compared to FRI: with the Ligero/Brakedown/Binius commitments, there are always large vectors at the leaves of the Merkle tree. This allows bigger reductions in hashing time when using hash functions like Blake3.

Merkle hashing is about 30% of the SP1 prover time when using Poseidon2 and about 10% when using Blake3. So a 3x-4x reduction in hashing time translates to only about a 30% reduction in total prover time.

This highlights that using Blake3 rather than Poseidon2 reduces commitment time, but does nothing to lower the time to prove well-formedness of the committed data. Sum-check-based SNARKs like Jolt are faster not only because less committed data means faster computation of the commitments, but also because proving that data is well-formed is cheaper when there’s less of it.

21. Does recent work on “accumulation schemes without homomorphism” address your concerns about continuations for SP1-with-Blake3?

No. Implementing continuations via the newly proposed accumulation scheme will be worse for the prover than using Poseidon2 and applying naive SNARK recursion (meaning, recursion in the style of plonky2, or RISC Zero continuations).

This is for at least two reasons. First, every time m shards are aggregated together, the new scheme requires the prover to compute and commit to extra data. This extra data does not need to be committed in existing accumulation schemes based on homomorphic commitments. The extra data is a random linear combination of the data committed for each shard individually, where the random coefficients come from a “big field” (at least 128 bits), even when the data being combined all comes from a small field. So simply computing this linear combination slows down the prover by a significant factor, likely more than the 30% increase that would come from using Poseidon2 to ensure naive recursion is not a bottleneck.

Second, the main reason to implement continuations is to control prover space when proving large computations, but with the new accumulation proposal, there are limitations on how small prover space can be. This is because the newly proposed accumulation scheme still needs the prover to prove that it correctly computed many hash evaluations, specifically O(dλ logn) of them where d is the depth of the accumulation and λ is the security parameter. For comparison, naive recursion with STIR involves proving O(λ log(n) loglog(n)) hashes. So the improvement of the new folding scheme relative to naive recursion is a d/loglog(n) factor, which means it is actually worse if d is larger than loglogn, and even for constant d it is better only by a small factor.

d can be kept small in one of two ways. One way is to accumulate a very large amount of committed data “at once,” but this keeps the prover’s space large. The other is to use a hybrid scheme — accumulate for a small number of levels without ever accumulating a lot of data “all at once,” and then apply brute-force recursion to aggregate the results of those accumulations. If Blake3 is impractical for naive recursion with STIR, it is likely also impractical for the hybrid accumulation scheme, as the reduction in hashes being proven relative to STIR is small. And in this hybrid scheme, the prover inherits the space bottleneck of brute-force recursion rather than standard homomorphic folding schemes: namely, recursively proving that STIR or FRI verification was done correctly.

22. Can you explain the potential benefits of accumulation/folding schemes when Jolt is instantiated with a curve-based commitment scheme?

As discussed above, continuations refers to chopping up the execution of the VM into chunks, each consisting of several million cycles, applying Jolt to each chunk individually, and then aggregating the chunks together. And there are two high-level approaches to aggregating the chunks together. One is “naive recursion”: this means producing a full Jolt proof per chunk, but rather than outputting all those proofs (which would lead to large proofs and high verifier costs, as in SP1 today), instead the prover recursively proves it knows those proofs.

The other approach is folding (sometimes called accumulation). The main benefit of folding over the naive-recursion approach is that with folding, you can omit the evaluation proof for all committed polynomials, instead using the homomorphism property of curve-based commitments to aggregate all committed polynomials across all chunks into a single committed polynomial. Omitting these evaluation proofs has two potential benefits: (1) the prover never has to compute them and (2) the Jolt verifier, which has to be expressed via an appropriate constraint system (or RISC-V program) to implement recursion, never has to check them.

Jolt makes sure these evaluation proofs are not a bottleneck for the prover, so (1) isn’t a big deal. But (2) matters: it’s exactly the reason that I do not think recursion can be implemented for SP1-with-Blake3 without significant costs. Due to (2), folding can help ensure recursion isn’t a meaningful contributor to the prover’s costs. It can also be useful for minimizing prover space, by enabling the chunks to be much smaller without recursion becoming a prover bottleneck.

***

Justin Thaler is Research Partner at a16z and an Associate Professor in the Department of Computer Science at Georgetown University. His research interests include verifiable computing, complexity theory, and algorithms for massive data sets.

***

The views expressed here are those of the individual AH Capital Management, L.L.C. (“a16z”) personnel quoted and are not the views of a16z or its affiliates. Certain information contained in here has been obtained from third-party sources, including from portfolio companies of funds managed by a16z. While taken from sources believed to be reliable, a16z has not independently verified such information and makes no representations about the current or enduring accuracy of the information or its appropriateness for a given situation. In addition, this content may include third-party advertisements; a16z has not reviewed such advertisements and does not endorse any advertising content contained therein.

This content is provided for informational purposes only, and should not be relied upon as legal, business, investment, or tax advice. You should consult your own advisers as to those matters. References to any securities or digital assets are for illustrative purposes only, and do not constitute an investment recommendation or offer to provide investment advisory services. Furthermore, this content is not directed at nor intended for use by any investors or prospective investors, and may not under any circumstances be relied upon when making a decision to invest in any fund managed by a16z. (An offering to invest in an a16z fund will be made only by the private placement memorandum, subscription agreement, and other relevant documentation of any such fund and should be read in their entirety.) Any investments or portfolio companies mentioned, referred to, or described are not representative of all investments in vehicles managed by a16z, and there can be no assurance that the investments will be profitable or that other investments made in the future will have similar characteristics or results. A list of investments made by funds managed by Andreessen Horowitz (excluding investments for which the issuer has not provided permission for a16z to disclose publicly as well as unannounced investments in publicly traded digital assets) is available at https://a16z.com/investment-list/.

Charts and graphs provided within are for informational purposes solely and should not be relied upon when making any investment decision. Past performance is not indicative of future results. The content speaks only as of the date indicated. Any projections, estimates, forecasts, targets, prospects, and/or opinions expressed in these materials are subject to change without notice and may differ or be contrary to opinions expressed by others. Please see https://a16z.com/disclosures/ for additional important information.