The cut-set bound derived in Network Coding for Distributed Storage Systems is achievable for functional repair. However it remains an open problem if the same can be achieved for Systematic repair (also called Exact Repair).
1. Distributed Storage Codes with Repair-by-Transfer and Non-achievability of Interior Points on the Storage-Bandwidth Tradeoff (Shah, Rashmi, Kumar and Ramchandran) shows that the cut-set bound derived for functional-repair is not achievable for exact-repair at essentially all points on the interior of the storage-repair bandwidth tradeoff.
2. Optimal Exact-Regenerating Codes for Distributed Storage at the MSR and MBR Points via a Product-Matrix Construction (Rashmi, Shah and Kumar) shows via construction of explicit codes that the cut-set bound is tight for all [n, k, d] at the minimum bandwidth extremum, and for all [n, k, d ≥ 2k-2] at the minimum storage extremum.
3. Distributed Data Storage with Minimum Storage Regenerating Codes - Exact and Functional Repair are Asymptotically Equally Efficient (Cadambe, Jafar and Maleki) and On the Existence of Optimal Exact-Repair MDS Codes for Distributed Storage (Suh and Ramchandran) show that the minimum storage (MSR) extremum is achievable asymptotically (i.e., as the number of message symbols → ∞) for all [n, k, d].
4. Explicit Codes Minimizing Repair Bandwidth for Distributed Storage (Shah, Rashmi, Kumar and Ramchandran) show that the minimum storage (MSR) point is not achievable with scalar (β=1), linear codes when d < 2k-3.
During repair of a failed node, the amount of data that needs to be communicated to the replacement node over the network is analyzed here. In general, this process may require computing random linear combinations of all the data the at the helping disk, thereby requiring it to read all the data. In scenarios where the storage nodes are distributed over a network, the network bandwidth will be much smaller than the internal disk IO. However, this is not always the case, and it may also of interest to analyse the amount of information that actually needs to be read from the disks helping in the repair.
Research in the direction of reducing number of disk reads during repair:
- Explicit Construction of Optimal Exact Regenerating Codes for Distributed Storage (Rashmi, Shah, Kumar, Ramchandran) present (the first) regenerating codes that repair a failed node without performing any computation, i.e., that perform uncoded-repair (repair-by-transfer). These codes operate at the MBR point, for the case when d=n-1.
- Fractional Repetition Codes for Repair in Distributed Storage Systems (El Rouayheb, Ramchandran) generalize the construction provided in the above paper to cover a larger set of parameters by relaxing the requirement on repair to one where only a specific set of d nodes can help in repairing a particular failed node. This paper also provides an upper bound on the maximum amount of data that can be stored in systems using these codes.
- Enabling Node Repair in Any Erasure Code for Distributed Storage (Rashmi, Shah, Kumar) allow one to introduce repair properties into any existing erasure code, however, relaxing the requirements as compared to the regenerating codes setting. The number of symbols to be read during repair is proportional to the sparsity of the generator matrix of the code employed. Hence, the employment of LDGM or Fountian codes allows for repair with a small amount of reads.
- On Codes for Optimal Rebuilding Access (Wang, Tamo, Bruck) present vector-linear MDS codes that can rebuild exactly any single failed node with minimum number of disk reads. In their construction, the number of systematic nodes and the number of parity-check nodes are arbitrary. In Functional-repair-by-transfer regenerating codes (Shum, Hu) show that for functional repair, there exist scalar-linear repair-by-transfer regenerating codes with minimum storage, k=2, and d=n-1.
How do we allocate a given storage budget over a set of nodes so that the probability of successful recovery is maximized? The distributed storage allocation problem remains challenging even for very simple failure (or access) models. For example, assuming that each node fails independently with the same probability, it can be suboptimal to store the same amount of data on a subset of nodes.