System Design – Distributed Storage Systems – Q&A

Q1. What are distributed storage systems?

Answer: The storage of data across multiple locations and devices, primarily decentralized, is called distributed storage system. Distributing the data across various devices and nodes would provide redundancy, fault tolerance, and high availability. Data Integrity is managed in case of hardware, network, and software failures.

Q2. What are some of the popular distributed storage systems?

Answer: Some of the popular distributed storage systems are Hadoop Distributed File System (HDFS), Ceph, GlusterFS, Amazon S3 (Simple Storage Service), Google Cloud Storage, Microsoft Azure Blob Storage, IPFS (InterPlanetary File System) and Storj.

Q3. What other important features of distributed storage systems besides performance and availability?

Answer: Data security and Reliability is an important features of distributed storage systems, although it is not the primary goal. These are necessary for the primary goals to make sense for the products. This concept is very important for system design.

Q4. What are SLAs provided by distributed storage systems?

Answer: SLA stands for Service Level Agreement. This is a formal agreement between a service provider such as Amazon S3 and a customer like you or me who want to use the service. What is this agreement on? The agreement is on factors like performance, reliability, availability, data durability, and support response times majorly but it can be on several other factors.

Q5. Can you explain some of the popular SLAs in the market by the service providers of distributed storage systems?

Answer: Below are some standard SLA components distributed storage system service providers provide.

1. Data Durability: Data durability is the likelihood of data loss over a specific period. Service providers often say it in terms of the number of nines (e.g., 99.999999999% in a year); this means that there will be no data loss for this percentage in a year.

2. Data Availability: Data Availability refers to the percentage of time for the storage system to be available and accessible. Service providers often say it in terms of the number of nines (e.g. 99.9% in a year), meaning there will be no access problems for that long.

3. Latency: Often called roundtrip latency, this is the time the system takes to respond to a specific request (e.g., read/write). Service providers would promise a consistent latency target even in worst-case scenarios.

4. Throughput: The amount of data in a certain time that will be transferred between the requester and the storage system is promised. This is also called bandwidth. A minimum level of throughput is always promised.

5. Data Security and Protection: SLAs on data security may not be in terms of numbers like the other SLAs. The service provider promises that they will adhere to certain standards on encryption, and data storage, such as GDPR, HIPAA, and other compliances.

6. Recovery Point Objective (RPO) and Recovery Time Objective (RTO): During a disaster recovery operation by the storage systems, RPO and RTO refer to maximum data loss in terms of time and maximum time for recovery, respectively.

7. Support Response Time: How much time it takes support to answer when you raise a problem and also to reply in between the conversations is defined by support response time. This depends on the category of membership you have and also the severity of the ticket raised. If it is a production impact issue, the service provider would mostly have to respond within minutes or below an hour.

Q6. What are the storage systems we use as part of day-to-day operations and not just big corporations?

Answer: Most computer and device users use Dropbox, IBM Box, Google Drive, Microsoft One Drive, iCloud, etc., as part of the day-to-day business and usage of devices. These are also huge distributed storage systems and have a complex design at the backend, but we mostly do not care how they are designed or meet SLAs.