# Data availability sampling and danksharding: An overview and a proposal for improvements

Danksharding is an approach to scaling the amount of data on chain for a future version of Ethereum. The goal of this upgrade is to ensure that the data on chain was made available to archiving parties when it was first posted. It does so using a technique called Data Availability Sampling, or simply DAS.

In this post we survey how data availability in danksharding works and propose some modifications to the underlying technique. In particular, we explore a small modification that may improve data recovery: the current proposal requires 75% of the shares to recover a block, whereas the modification may reduce this bound to 25%.

## Protodanksharding

Danksharding is scheduled to follow protodanksharding (EIP-4844). Protodanksharding will allow clients to write more data to the blockchain by introducing a new transaction type called a “blob-carrying transaction.” Initially, this new transaction type will carry with it a list of up to four data blobs, where each blob is at most 128 KB (4096 field elements of 32 bytes each), adding up to 512 KB additional data per block (targeting 256 KB on average), compared to current Ethereum’s block size, which is around 100 KB on average.

These data blobs will be handled differently: (i) they will only be stored for a limited time, for example, 30-60 days, and (ii) the data will not be directly accessible to smart contracts despite being part of the transaction data. Instead, only a short commitment to the blob data, called DATAHASH, will be available to smart contracts. The additional burden on the validators seems tolerable: validators currently store less than 100 GB to maintain the blockchain’s state. After protodanksharding they will have to store an additional 50-100 GB.

Danksharding will follow. It will increase the data available to clients by 60x over protodanksharding by increasing the maximum number of blobs per block. Blocks will grow from 0.5 MB per block to 30 MB. But because validators cannot be made to store 60x more data, the data will be dispersed among them such that each validator will store only a small subset of data. Yet they can come to agreement on whether they collectively store all of the data through a data-availability sampling (DAS) protocol.

The blob-data will be priced via a mechanism similar to EIP-1559, and will target about 1 data-gas per byte. Calldata, which is the current cheapest alternative, is priced at 16 gas per byte. But since there will be two different fee markets, these costs are not directly comparable. Roll-up clients will benefit from those upgrades, because currently more than 90% of their client’s fees go to paying Ethereum data fees.

Other projects, such as Celestia and EigenLayer, are also adopting DAS techniques to increase the available data space. These designs are much simpler than fully sharding the Ethereum network.

## The goal of data availability sampling

We describe the scheme assuming a proposer-builder separation (PBS) design:

- Clients submit their blob-carrying transactions to block builders.
- A block builder forms a block
*B*by selecting*N*client data-blobs. Data-blob number*i*comes with a short commitment*C*signed by the client that sent it. Let_{i}*C*= (*C*_{1}, . . . ,*C*) be the list of all_{N}*N*signed commitments in the block*B*. - Block builders submit their proposed blocks to the current block proposer (one of the validators). The block proposer selects one of the blocks and posts it to the network as is.

The challenge is to ensure that the block *B* can be reconstructed at a later time. To do so, the builder replicates the block across a large network of *V* validators. One could ask each and every validator to store the entire block, but that is considered too expensive. Instead, the block builder

- encodes the block
*B*into a larger block*E*using erasure coding; - breaks the block
*E*into*V*overlapping fragments*P*_{1}, . . . ,*P*;_{V} - sends to validator number
*i*the pair (*P*,_{i}*C*).

Every validator checks that the fragment *P _{i}* that it received is consistent with the list

*C*of signed commitments. The block builder provides proofs for the validators to facilitate these checks.

With this setup, a Data Availability Sampling scheme has two protocols:

- A
**sampling**protocol runs between a**sampling verifier**and the validator set. The sampling verifier takes the list*C*as input and requests randomly selected elements of the block*E*from the validator set. The sampling verifier outputs*success*if it received all the requested elements and all are consistent with*C*. - A
**reconstruction**protocol runs between a**reconstruction agent**and the validator set. The reconstruction agent takes*C*as input and requests elements of block*E*from the validator set. Once it collects more than 75% of the elements, and all are valid, the reconstruction agent computes and outputs the block*B*.

(We discuss an approach below that may reduce the number of elements needed for reconstruction.)

The requirement is that if the sampling verifier outputs *success*, then the reconstruction agent will output the block *B*, provided it receives over three quarters of the elements as input. Reconstruction should succeed as long as enough elements are provided, even if the provided elements are selected adversarially.

To recap, the following parties participate in danksharding:

*Client*: sends data blobs (which are either transactions or bundles) to a builder.*Builder*: makes a block and sends fragments of this block to validators.*Block proposer*(one of the validators): posts the block to the network.*Sampling verifier*(any of the validators): runs the sampling protocol and signs the block header if the protocol outputs success.*Reconstruction agent*: reconstructs a previously posted block when needed by interacting with the entire validator set. Reconstruction succeeds if the validators respond with more than three-quarters of valid elements.

### Erasure coding and polynomial commitments

There are two building blocks to the scheme that we explain next: erasure coding and polynomial commitments.

#### Building block #1: Erasure-coding

Erasure coding dates back to the 1960s, motivated by the need to transmit information over a lossy channel. It is used in danksharding to protect against validators that lose their data fragments. The technique expands the data from *N* elements to *M* elements (*M* > *N*) such that the original data can be reconstructed from *any* intact *N* out of *M* elements of the expanded data. Imagine taking *N* elements (original data), encoding them into *M* = 2*N* elements, and giving one encoded element to each of the 2*N* validators. If a majority of validators are honest, they can always jointly reconstruct the original data. This technique protects against *crash* faults of any half of the validators. It can be extended to protect against *byzantine* behavior of a half of the validators using polynomial commitments discussed in the next section.

Here is how expansion works in more detail. To expand the data from *N* field elements *d*_{1}, *d*_{2}, . . . , *d _{N}* ∈ 𝔽

*to*

_{p}*M*encoded elements

*e*

_{1},

*e*

_{2}, . . . ,

*e*∈ 𝔽

_{M}*, and we interpolate the unique degree*

_{p}*N*– 1 polynomial

*p(x)*that satisfies

*p(i)*=

*d*for

_{i}*i*= 1, . . . ,

*N*. The polynomial is then evaluated at

*M*points:

*e*= (

_{i}*i*,

*p*(

*i*)) for

*i*= 1,. . . ,

*M*. The original polynomial,

*p*(

*x*), can be reconstructed from

*any*

*N*out of the

*M*encoded elements using Lagrange interpolation. This expansion method is called

*Reed-Solomon encoding*or

*low-degree polynomial expansion*.

#### Building block #2: Polynomial commitments

Polynomial commitment is a building lock of a DAS scheme. It allows the scheme to do two things:

- commit to a univariate polynomial
*p*in 𝔽[_{p}*X*] of bounded degree*D*using a short commitment string,*C*. - open the committed polynomial at a given point
*x*∈ 𝔽_{p.}

More precisely, given *x,y* ∈ 𝔽* _{p}*, the committer can create a short proof π that the committed polynomial

*p*satisfies

*p*(

*x*) =

*y*and that its degree is at most

*D*. The proof π is verified with respect to the commitment string

*C*and the values

*x, y*. This π is called an

*evaluation proof*.

Security guarantees that the commitment is binding to the polynomial and that an adversary cannot create a false proof.

A number of practical polynomial commitment schemes can be used here. Danksharding uses the KZG commitment scheme that requires a trusted setup ceremony to generate public parameters (called an SRS) but has constant-size commitments and constant-size evaluation proofs. KZG commitments have a number of properties that make them especially well suited for danksharding:

*the commitment is homomorphic*: if*C*_{1}is a commitment to*p*_{1}and*C*_{2}is a commitment to*p*_{2}, then*C*_{1}+*C*_{2}is a commitment to*p*_{1}+*p*_{2};*the commitment is unique*: if two people independently compute a commitment to a polynomial*p*, they obtain the same commitment;*evaluation proofs are homomorphic*: for a given*x*, if π_{1}is a proof that*p*_{1}(*x*) =*y*_{1}and π_{2}is a proof that*p*_{2}(*x*) =*y*_{2}, then π_{1 }+ π_{2}is a proof that (*p*_{1 }+*p*_{2})(*x*) =*y*_{1 }+*y*_{2}.

We can now explain how a client commits to its data-blob. First, the client interprets the data-blob as a vector of *m* field elements *d*_{1}, . . . , *d _{m}* ∈ 𝔽

_{p}, for

*m*≤ 4096. Next, it interpolates a univariate polynomial

*p*∈ 𝔽

_{p}[

*X*] of degree at most

*m*– 1 such that

*p*(

*i*) =

*d*for

_{i}*i*= 1, . . . ,

*m*. (Technically, danksharding uses 1, ω, ω

^{2}, . . . , ω

^{m-1}∈ 𝔽

_{p}as the evaluation points, where ω ∈ 𝔽

*is an*

_{p}*m*th root of unity, and a reverse-bit order; this is done for efficiency, but we will not consider it here for simplicity.) Finally, the client constructs the KZG polynomial commitment to the polynomial

*p*. It signs the commitment and sends the commitment-signature pair to the builder. This process requires public parameters (an SRS) containing 4096 group elements.

### Data Dispersal

Next, we explain how the block builder encodes a block and splits it into fragments to send to the validators. Fix some 256-bit prime *p*. The block builder does the following:

**input**: A block *B* in danksharding can contain up to 256 data blobs (64 times more blobs than in protodanksharding), where each data blob is a vector of 4096 elements in 𝔽_{p}. Thus, we can represent a block as a 256 × 4096 matrix of elements in 𝔽_{p}. Each row in this matrix corresponds to one data-blob from a client. Recall that each client sends to the builder one row of *B* along with a signed KZG polynomial commitment for that row. The builder collects 256 signed polynomial commitments *C*_{0}, . . . , *C*_{255}, one commitment per row.

**step 1**: The builder interpolates a bivariate polynomial *d*(*X*,*Y*) such that *d*(*i*,*j*) = *B*[*i*,*j*] for *i *= 0, . . . , 255 and *j *= 0, . . . , 4095. This bivariate polynomial has degree at most 255 in *X* and degree at most 4095 in *Y*.

**step 2**: The builder uses the erasure coding method described above to expand the block by a factor of two in each direction. That is, it forms a 512 × 8192 matrix *E* of field elements by setting *E*[*i*,*j*] ← *d*(*i*,*j*) for *i *= 0, . . . , 511 and *j *= 0, . . . , 8191. This is illustrated in the following figure.

**step 3**: The builder verifies that each signed *C*_{i} is a KZG commitment to the univariate polynomial *d*_{i}(*Y*) := *d*(*i*,*Y*), for all *i *= 0, . . . , 255. Observe that the polynomial *d*_{i}(*Y*) is an interpolation of row *i* of *B*, and therefore must be the same as the polynomial committed to by client *i*. The builder rejects all data-blobs for which *C*_{i} is malformed.

Now the builder uses *C* = (*C*_{0}, . . . , *C*_{255}) as a commitment to the block *B*, or more precisely, to the bivariate polynomial *d*(*X*,*Y*).

Let’s show that *C* = (*C*_{0}, . . . , *C*_{255}) is indeed a polynomial commitment to the polynomial *d*(*X*,*Y*). For some given *x*,*y*,*z* ∈ 𝔽_{p}, let us construct an evaluation proof that convinces the verifier that *d*(*x*,*y*) = *z* with respect to the commitment *C*. Since the degree of *X* in *d*(*X*,*Y*) is at most 255, there are constants λ_{0}, . . . , λ_{255} ∈ 𝔽_{p} that depend only on *x* such that

*d*(*x*,*Y*) = λ_{0} · d(0,*Y*) + . . . + λ_{255} · d(255,*Y*).

Then, by the homomorphic property of KZG commitments it follows that

*C _{x}* := λ

_{0}·

*C*

_{0}+ . . . + λ

_{255}·

*C*

_{255}

is a KZG commitment to the univariate polynomial *d*_{x}(*Y*) := *d*(*x*,*Y*). Hence, the verifier can construct *C _{x}* from

*C*on its own. Let π be the KZG evaluation proof for the polynomial

*d*

_{x}(

*Y*) that convinces the verifier that

*d*

_{x}(

*y*) =

*z*with respect to the commitment

*C*. This π is the required evaluation proof for

_{x}*d*(

*X*,

*Y*) at the point (

*x*,

*y*).

This argument shows that *C* is a polynomial commitment to *d*(*X*,*Y*). It is worth noting that while each client signs a row of *B* independently of the other clients, the collection of all client signatures functions as a signature on the polynomial commitment to *d*(*X*,*Y*).

**step 4**: The minimum unit of communication in this DAS scheme is a *sample*, which is a 16-elements row vector. The matrix *E* with 512 × 8192 elements is treated as a square matrix of 512 × 512 samples. Let *V* be the number of validators. Then the block builder breaks the matrix *E* into *V* overlapping fragments *P*_{1}, . . . , *P _{V}*, where

*P*comprises exactly two rows and two columns of samples of

_{i}*E*, chosen at random from the 512 rows and 512 columns of samples. Thus,

*P*contains 2 × 512 × 16 + 2 × 8192 = 9216 field elements in 𝔽

_{i}*. This is much smaller than the full block*

_{p}*B*, which is about a million field elements. The minimum number of validators in Ethereum is 128 (and currently around ~500,000), so enough validators exist to ensure that the whole block is sufficiently well covered.

**step 5**: The block builder sends the triple (*P _{i}*,

*C*, π

*) to validator number*

_{i}*i*, where π

*is a list of evaluation proofs for all the elements in*

_{i}*P*: one proof for each cell

_{i}*d*(

*x*,

*y*) in the two rows and two columns of samples in

*P*. The proofs in π

_{i}*enable the validator to verify that every field element in*

_{i}*P*is consistent with the commitment

_{i}*C*.

In danksharding the number of validators can greatly exceed the number of columns or rows in the block. Therefore, some columns and rows might be stored by multiple validators. As such, danksharding uses both replication and Reed-Solomon erasure coding to ensure that the data can be reconstructed.

### Full block’s reconstruction

When a reconstruction agent needs to reconstruct the entire block, it has *C*, and asks the validator set to send it their fragments. In response, an honest validator *i* sends (*P _{i}*, π

*). The reconstruction agent checks the opening proofs in π*

_{i}*, and if valid, it accepts the values in*

_{i}*P*. Hence, byzantine validators cannot send false data. However, Byzantine validators may refuse to send their data and not respond to the reconstruction agent.

_{i}When some data is missing, danksharding needs at least 75% of the matrix *E* to reconstruct the block. Since the validators only store complete rows and columns, the worst-case scenario is the following one where 75% – ε of the block’s elements are present, yet it is impossible to reconstruct the missing elements.

To see why the data cannot be reconstructed in this case, observe that there is a non-zero bivariate polynomial δ(*X*,*Y*) that evaluates to zero on the entire “present” region and has the required degree bounds on *X* and *Y*. Therefore, during reconstruction it is not possible to tell if the missing white area comes from the correct polynomial *d*(*X*,*Y*) or the polynomial *d*(*X*,*Y*) + δ(*X*,*Y*); both polynomials agree on the present data. As a result, when the missing data follows this pattern (up to a permutation of rows and columns) the missing data cannot be reconstructed. Note that if a validator drops some of its data, thereby becoming “byzantine,” it drops it all, both rows and both columns.

So less than 75% of the block is not enough in the worst case. However when 75% or more of the block is present the reconstruction is guaranteed to succeed by a simple greedy algorithm.

**The reconstruction algorithm**: Reconstruction works by iteratively finding an incomplete row (or a column) with at least 50% of available elements in it and using univariate interpolation to reconstruct the missing elements of the row (or column). To see why this procedure eventually reconstructs the whole block, let’s assume the block is still missing some elements, yet we can find no row or column that can be reconstructed. This means that each row and column either has <50% available elements or it is full (has 100% of available elements). Let’s select any incomplete row; it has >50% of unavailable elements, the columns through those elements also have to have >50% unavailable elements, which immediately implies that >25% of the block is unavailable, which contradicts the original assumption.

To render one block unreconstructable, the adversary needs to attack at least 15/16 of the validator set. The attackers in this case would pick a quadrant to erase, and attack those validators that have at least one of their two rows or one of their two columns in that quadrant.

### Local reconstruction of validator’s data

Suppose that an honest validator *i* crashes and loses its data. When it comes back online it wants to reconstruct its two rows and two columns *P _{i}* as well as the opening proofs π

*. Danksharding enables a validator to do that without reconstructing the entire block. The validator can download its data from another validator that stores the same exact set of points, or by obtaining 50% of the elements of its row (or column) from other validators, and reconstructing the rest of the row through univariate interpolation. This process does not have to go through full reconstruction.*

_{i}For example, imagine a validator stores at least 50% of column number 42, but less than 100% and it needs to reconstruct the missing elements. That is, the validator holds pairs (*E*_{(i,42)}, π_{i}) ∈ (𝔽, 𝔾) for all *i* ∈ *S*, for a set *S* where 256 ≤ |*S*| < 512. To reconstruct the missing pairs, the validator does the following:

- Using Lagrange interpolation over the field 𝔽 constructs a polynomial
*p*(*x*) of degree 255 s.t.*p*(*i*) =*E*_{(i,42)}for all*i*∈*S*. - Evaluates the polynomial to obtain the missing elements:
*E*_{(i,42)}:=*p*(*i*) for all*i*∈ {0..511}\*S*. Note that the points are guaranteed to lie on a 255-degree polynomial because the validator has checked them against the commitments*C*. - Obtain the missing proofs via a multi-exponentiation by doing polynomial interpolation “in the exponent”: for all
*i*in {0..511}\*S*compute π_{i}:= ∑_{j∈S}π· Λ_{j}_{j,S}(*i*), where Λ_{j,S}(*i*) is the Lagrange coefficient Λ_{j,S}(*i*) := ∏_{k∈S,k≠j}(*i*–*k*)/(*j*–*k*).

Step 3 uses the fact that KZG polynomial commitments have *homomorphic proofs*. To the best of our knowledge, KZG is the only polynomial commitment scheme with this property.

### Sampling for data availability in danksharding

To determine whether enough of the expanded block *E* is available for reconstructing the original block *B*, the sampling client queries for random samples of *E*. In response to each query, it gets a sample (16-elements row vector) and an evaluation proof to check the sample against the commitment *C*. If all its queries are successfully answered, the client accepts the data as available. If the client makes a total of *Q* queries, then it falsely accepts unavailable data with probability (3/4)^{Q} which drops exponentially quickly with the number of queries *Q*. The rate 3/4 corresponds to the reconstruction threshold of 75% from the previous section.

In the Ethereum network, sampling will be done by three types of parties: (i) full-nodes who can’t afford to store the blob-data in full, (ii) light clients, and (iii) the validators themselves. The validators have to certify that the data is available before casting any votes on a block and its predecessors. Each epoch has a fixed set of validators, which are split into 32 committees randomly. Each committee is assigned a 12 seconds slot of the epoch (there are 32 slots per epoch). One block is expected to be confirmed per slot. Each validator receives the fragments (2 rows and 2 columns of samples) of each of the blocks, even for blocks when the validator is not part of the attesting committee. In addition, each validator samples the current block and all the previous blocks to ensure that all the blocks are both valid and available before selecting a block for attestation according to a fork-choice rule.

The protocol guarantees that the data will be available for an epoch of 32 slots, that is, for 6 minutes. Additional approaches such as proofs of retrievability or incentive mechanisms can help ensure the data is made available for longer periods of time, e.g., 30-60 days when the blobs should be available before expiring.

## A proposal for 25% reconstruction

In this section we explain how to make reconstruction succeed if only 25% of the data is available (compared to 75% as in current danksharding explained above).

When the fraction of available points is below 75%, then the greedy column-wise and row-wise univariate interpolation can fail. Instead, we propose to do reconstruction directly through bivariate polynomial interpolation. In this case, full reconstruction is possible even if only 25% of the points of *E* are available. However, this requires a small modification to the way that fragments are assigned to validators. Instead of giving each validator two complete rows and two complete columns of samples, we propose that each validator is assigned to store

- one complete row (512 samples), one complete column (512 samples), and
- a collection of 1024 samples randomly (or pseudorandomly) spread around the matrix
*E*.

The overall storage requirement per validator is unchanged, but now reconstruction is possible even if fewer than 75% of the samples are available. Moreover, multiple validators store the same set of points, so that if any validator needs to reconstruct its fragment, it can request it from any other validator that stores the exact same fragment. If no such validator is available, it can do local or full reconstruction to obtain its fragment.

This hybrid proposal allows cheap reconstruction in the happy-case when the number of byzantine validators is small, so that more than 75% of the samples of *E* are available. In the unhappy case, when the number of byzantine validators becomes too high, the data can be reconstructed using full-bivariate interpolation from only 25% of the samples thanks to the random dispersion of samples throughout the matrix *E*.

Bivariate interpolation can naively be done by solving a linear system of equations on the coefficients of the polynomial. This simple approach requires constructing an interpolation matrix with 2^{20} × 2^{20} field elements (32TB). This is quite large but not infeasible. However, there are better methods that can be used. (See for example the survey from P.J.Olver-2016.) While the cost of bivariate interpolation is non-trivial, it is only needed when the fraction of recovered points is below 75%, and can be treated as a safety measure. To enable this safety measure, danksharding needs to be slightly modified to assign fragments to validators as outlined above.

### Danksharding with IPA commitments

The main drawback of the previous construction is the need for a trusted setup for the KZG polynomial commitment scheme. Otherwise the scheme is blazingly fast. However, the trusted setup typically needs a lot of work and coordination. (Our work on on-chain KZG setups can help simplify the ceremony for future projects). Some other polynomial commitment schemes do not need a trusted setup (e.g., Bulletproofs), although they do not have homomorphic proofs required for the efficiency of validators when reconstructing the data they need to store (as pointed out by Vitalik).

The construction can be altered, however, to avoid the need for homomorphic proofs and still have light-weight validators. The high-level idea is to make block builders compute commitments to the columns of the block-matrix *E*. With this approach, the validators won’t need to reconstruct the column-proofs; they will simply reconstruct their columns in full themselves and recompute the proofs against the column commitment from scratch. Through consensus, the validators would make sure they agree on the same set of columns-commitments.

More precisely, the **steps 1-4** are the same as in danksharding explained above. Step 5, however, is different.

**step 5**: The block builder computes polynomial-commitments to the columns of B: denote them by *T* = (*T*_{0}, *T*_{1}, . . . , *T*_{4095}), each *T _{j}* is a commitment to

*d*(

*X*,

*j*) (concretely it could be Pedersen vector commitment to the vector of coefficients of

*d*(

*X*,

*j*)). Next, it creates a proof that

*C*and

*T*commit to the same matrix as follows. The block’s builder chooses a (pseudo)random point (

*x̂*,

*ŷ*), uses the homomorphism of the commitments to interpolate and compute the column commitment

*T*

_{ŷ}to a univariate polynomial

*d*(

*X*,

*ŷ*), and a row commitment

*C*

_{x̂}to a univariate polynomial

*d*(

*x̂*,

*Y*) and create two proofs: π

_{x̂}– the polynomial evaluation proof of

*d*(

*x̂*,

*ŷ*) against

*C*

_{x̂}, and π

_{ŷ}– the polynomial evaluation proof of

*d*(

*x̂*,

*ŷ*) against

*T*

_{ŷ}. The point (

*x̂*,

*ŷ*) is universal for all of the validators, and can be obtained for example from a random beacon output or generated pseudo-randomly with a Fiat-Shamir transform. The block’s builder then sends (

*P*

_{i},

*C*,

*T*, π

_{x̂}, π

_{ŷ}, d(

*x̂*,

*ŷ*)) to validator number

*i*, where

*P*is the two rows and two columns of

_{i}*E*. The builder computes no proofs in this construction.

The validator verifies the proofs π_{x̂}, π_{ŷ} to catch a malicious block’s builder who generates the column commitments *T* incorrectly. The validator then recommits to two rows and columns in *P*_{i} and verifies that those match the corresponding elements in *C* and *T*. If the row, denote *x*, in *P*_{i} is within range of the original matrix, *B*: *x* ∈ {0..255}, the validator simply verifies that the commitment matches the corresponding element *C*_{x} from *C*; however if the row is in the expanded portion: *x* ∈ 256..511, the validator first interpolates through the commitments in *C* to obtain *C*_{x}:

*C _{x}* := λ

_{0}·

*C*

_{0}+ . . . + λ

_{255}·

*C*

_{255}

Note that interpolation is possible because the IPA commitments (not their proofs though) are homomorphic. The validator verifies the columns of *P _{i}* in a similar way against

*T*, interpolating if needed.

In this construction, the block’s builder does not have to compute the proofs for the validators. The validators can compute all the proofs themselves. It’s an interesting research problem, however, to figure out an efficient algorithm to compute a large number of proofs at once (similar to how it’s done for KZG, or using Vitalik’s ideas for IPA).

The main bottleneck of this approach, as we see it, is the loss in efficiency for sampling clients, as the proofs for IPA-based schemes are logarithmic in size (in contrast to constant-size for KZG). Moreover, the sampling clients might need to download column commitments in addition to the row commitments, incurring an overhead in communication. However, we believe this to be a promising direction for further exploration.

### Danksharding with Merkle commitments

Another plausible approach Vitalik explored recently to avoid the trusted setup is to use Merkle commitments instead of polynomial commitments. Merkle commitment is not a polynomial commitment scheme, so it’s more challenging to make it work with erasure-coding. The block’s builder will erasure code the data and commit to the expanded data with Merkle-tree roots. The main challenge is to detect incorrectly expanded data without downloading the full data. Fraud proofs can be used to get around this issue, but that relies on the existence of a client that would download the full data, check that it was erasure coded correctly and correctly committed to, and would raise an alarm if it is not by providing a fraud proof.

Alternatively, a FRI proof can be used to check that the leaves of the Merkle tree are close to a Reed-Solomon codeword (i.e., check that the data underlying the Merkle commitment was erasure-coded correctly). The sampling clients would check the FRI proof and sample a sufficient fraction of the leaves by requesting their Merkle proofs to ensure the data is available and can be reconstructed.

***

Data availability sampling, and danksharding as one of its concrete instantiations, would bring the cost of data-storage down, leading to more scalable and cheaper blockchains. The coding aspects of DAS is a potentially rich area for research with many possible directions to explore. We suggested one of the possible avenues: improving the reconstruction protocol to use fewer samples (25% instead of 75%). Another exciting direction is to explore alternative commitment schemes that do not require a trusted setup.

***

### Read more

**Protodanksharding**

- EIP-4844: the specification of ProtoDankSharding (Feb 2022)
- Proto-Danksharding FAQ by Vitalik Buterin (~ Apr 2022)
- Bankless deep-dive – youtube with Vitalik, Dankrad, Protolambda, Tim Beiko (May 2022)
- “Proto-Danksharding: a technical analysis” by grizzly-answer-991

**DAS for Ethereum and danksharding**

- Danksharding workshop at Devcon: Youtube (Oct 14 ’22),
- New sharding design with tight beacon and shard block integration (~ Dec 2021) by Dankrad Feist with concrete bandwidth and compute estimates
- An explanation of the sharding + DAS proposal by Vitalik Buterin (Dec 2020)
- 2D data availability with Kate commitments forum discussion on commitments’ expansion (Oct 2020)
- Data Availability Sampling Phase 1 Proposal by Vitalik Buterin (Aug 2021)
- The Hitchhiker’s Guide to Ethereum by Jon Charbonneau and delphidigital (May 2022)
- Dude, what’s the Danksharding situation?” Workshop with Vitalik Buterin and Dankrad Feist (Feb 2022)
- Data Availability Sampling Networking RFP from the Ethereum Foundation (May 2022)
- DAS requirements (May 2022)
- A note on data availability and erasure coding (Sep 2018)

**Research on DAS**

- Arithmetic hash based alternatives to KZG for proto-danksharding by Vitalik Buterin (Oct 6 ’22),
- What would it take to do DAS with inner product arguments (IPAs)? by Vitalik Buterin (Feb 2022)
- Danksharding without KZG a fragment of Vitalik’s talk at SBC (Sep 2022)
- Fraud and Data Availability Proofs: Maximising Light Client Security and Scaling Blockchains with Dishonest Majorities by Mustafa Al-Bassam, Alberto Sonnino, Vitalik Buterin (Sep 2018)
- Information Dispersal with Provable Retrievability for Rollups by Kamilla Nazirkhanova, Joachim Neu, David Tse (Nov 2021)
- Joachim Neu on the networking side of DAS as well as intuition behind sampling
- Yuval Domb on the specifics of the required interpolations and DFTs for DAS

**KZG commitments**

- “Constant-Size Commitments to Polynomials and Their Applications” by Aniket Kate, Gregory M. Zaverucha, and Ian Goldberg (2010) – the original KZG paper
- “KZG polynomial commitments” by Dankrad Feist (Jun 2020)
- “How to compute all Pointproofs” by Alin Tomescu (2020)
- “Fast Amortized Kate Proofs” by Dankrad Feist and Dmitry Khovratovich (2020)

***

Thanks to Ansgar Dietrichs for providing many insightful comments that informed the post.

***

Valeria Nikolaenko is a Research Partner at a16z crypto. Her research focuses on cryptography and blockchain security. She has also worked on topics such as long-range attacks in PoS consensus protocols, signature schemes, post-quantum security, and multi-party computation. She holds a PhD in Cryptography from Stanford University under advisorship of Professor Dan Boneh, and worked on the Diem blockchain as part of the core research team.

Dan Boneh leads the applied cryptography group at Stanford University and co-directs the computer security lab. Dan’s research focuses on applications of cryptography to computer security, and includes cryptosystems with novel properties, web security, security for mobile devices, and cryptanalysis. He is the author of over one hundred publications and is a recipient of the 2014 ACM prize, the 2013 Godel prize, and the 2011 Ishii award for industry education innovation.

***

*The views expressed here are those of the individual AH Capital Management, L.L.C. (“a16z”) personnel quoted and are not necessarily the views of a16z or its affiliates. a16z is an investment adviser registered with the U.S. Securities and Exchange Commission. Registration as an investment adviser does not imply any special skill or training. Certain information contained in here has been obtained from third-party sources, including from portfolio companies of funds managed by a16z. While taken from sources believed to be reliable, a16z has not independently verified such information and makes no representations about the enduring accuracy of the information or its appropriateness for a given situation. In addition, this content may include third-party information; a16z has not reviewed such material and does not endorse any advertising content contained therein.*

*This content is provided for informational purposes only, and should not be relied upon as legal, business, investment, or tax advice. You should consult your own advisers as to those matters. References to any securities, digital assets, investment strategies or techniques are for illustrative purposes only, and do not constitute an investment recommendation or offer to provide investment advisory services. Furthermore, this content is not directed at nor intended for use by any investors or prospective investors, and may not under any circumstances be relied upon when making a decision to invest in any fund managed by a16z. (An offering to invest in an a16z fund will be made only by the private placement memorandum, subscription agreement, and other relevant documentation of any such fund and should be read in their entirety.) Any investments or portfolio companies mentioned, referred to, or described are not representative of all investments in vehicles managed by a16z, and there can be no assurance that the investments or investment strategies will be profitable or that other investments made in the future will have similar characteristics or results. A list of investments made by funds managed by Andreessen Horowitz (excluding investments for which the issuer has not provided permission for a16z to disclose publicly as well as unannounced investments in publicly traded digital assets) is available at **https://a16z.com/investments/**.*

*Additionally, this material is provided for informational purposes solely and should not be relied upon when making any investment decision. Past performance is not indicative of future results. Investing in pooled investment vehicles and/or digital assets includes many risks not fully discussed herein, including but not limited to, significant volatility, liquidity, technological, and regulatory risks. The content speaks only as of the date indicated. Any projections, estimates, forecasts, targets, prospects, and/or opinions expressed in these materials are subject to change without notice and may differ or be contrary to opinions expressed by others. Please see **https://a16z.com/disclosures** for additional important information.*