System Design – Databases – Q&A
Q1. How to perform scaling of databases?
Answer: Similar to compute resources, database servers can be scaled vertically and horizontally. VERTICAL SCALING IS adding CPU, RAM, and storage into the database server. The horizontal scaling of databases includes techniques such as Sharding, Read Replicas, Indexing, Caching, Clustering, Connection Pooling, Data Archiving, and Purging using Managed Database services. We will see each of these in upcoming questions.
Q2. What is a cold standby database server?
Answer: A cold standby server is a secondary server with periodic backups automatically or manually from the main server. In case of primary server failure, the cold standby server would be made active for failover. But note that the cold standby server will not be always up-to-date with the data as the primary server. There will be downtime for failover. But this is the cheapest possible failover facility for a database server. The same is applicable to any server, not just a database server. To reduce data loss and downtimes, use hot standby servers.
Q3. What is a warm standby server?
Answer: Warm standby server, unlike the cold standby server, is partially active, always taking backups from the primary server. It is ready to take requests when a primary server is down. The process of backing up data at regular intervals is also called Replication. Data loss and downtime are minimized as compared to the warm standby server.
Q4. What is a hot standby server?
Answer: This backup server has data written simultaneously with the primary server. It is ready to take over if the primary server fails almost instantaneously with no data loss and downtime.
Q4. Do cold, warm, and hot standby server usages come under vertical or horizontal scaling?
Answer: Both cold and warm standby server usages come under vertical scaling. Even if you use a better secondary server, it is still a single server handling requests. If a heavy workload has caused the primary server to be down, the secondary server would also go down with the same load. And also, more than calling this scaling, it can be called a failover strategy and fault tolerance strategy.
The hot standby server is used for “read” requests from clients while the primary server is active. This is because the data is simultaneously being written to both servers. This way, this comes under horizontal scaling.
Q5. What is Sharding?
Answer: Sharding is a database scaling technique. Instead of using one database server for the entire database, we split the database into shards or divisions of the database and use separate database servers for each of these shards. Each of these shards can also have a backup shard or scaling strategies for each of them.
Q6. What are the strategies used for sharding?
Answer: Sharding means dividing the database into chunks to increase the scalability and performance of the database queries overall. It is achieved using Range-based sharding, Hash-based sharding, Directory-based sharding, and geographic sharding.
Q7. Can you explain each of the sharding techniques?
Answer: Please find below an explanation in a practical manner for each sharding technique.
Range-based sharding: The database is divided into different ranges, such as all employees from 1-1000 as shard 1, 1001 to 2000 as shard 2, and so on. This way, a query coming for a specific employee is quickly redirected to a particular range based on logic and improved performance.
Hash-based sharding: The request for data should be identified based on a key. And then, by applying a hash function, it should be possible to identify the shard in which the data is stored. This arrangement of shards is called hash-based sharding.
Directory-based sharding: A separate lookup or directory service is present to find out at runtime to see which shard should the data request be redirected.
Geographic sharding: If the shards are divided based on geographies, such as all the employees from country X are in shard 1, country Y in shard 2, etc. Also, based on data sources can be one strategy. This way, if we know all the requests for country X come from country X only, it is possible to keep the database server of that shard in that geographic location for improved performance.
Q8. SQL is best for sharding databases. True/False?
Answer: False. SQL and other relational databases mean that data is related across tables. And if these tables stay in different shards, an update of one table needs an update of another table, and the processing is no longer independent across shards. The various database servers can no longer perform without worrying about dependencies. Hence, NoSQL databases are the best suited for sharding.
Q9. Can you give some examples of sharding database offerings?
Answer: Some of the popular NoSQL technologies are MongoDB, Apache Cassandra, Google Cloud Spanner, Amazon Dynamo DB, CockroachDB, MySQL with MySQL cluster, and Hbase in the Apache Hadoop ecosystem.
Q10. What is Denormalization?
Answer: Let us say there is a table with data about Reward Name and Employee ID in one table. There can be multiple rewards where the same Employee ID appears. But Employee ID and Employee Name are in another table. A join operation on the two tables is required for a READ operation to know Employee Name based on Reward. If these two tables stay in two shards where each shard has a database server of its own, JOIN operations are very costly. Even when you need to update, two shards need to be contacted.
Denormalization is a way of combing the tables that need JOIN or similar operations into one table. This would increase the redundancy of data. In the previous example, when we join Tables 1 and 2, everywhere Employee ID is present with a reward that we queried, we need to add the respective Employee Name in the third column. So, an increase in storage and redundancy is present. However, one lookup query is present now to get all the data.
Q11. What are Data Lakes?
Answer: Data Lakes are centralized facilities, particularly on the Cloud, to store large amounts of unstructured and structured data. Data Lakes provides storage and automated tools and technologies for organizing, managing, analyzing, and using data. We can call all the popular tools on the Cloud for storing and processing data, Data Lakes.
The methods employed for these processes and example tools are as below.
Data Ingestion: Data Ingestion refers to the process of getting and importing the data into the data lake. Ex: AWS Glue
Storage with fault tolerance and scalability at reasonable cost: These facilitate processes for main storage. Ex: Hadoop HDFS, Amazon S3, Azure Data Lake.
Data Lakes using Data Cataloging: These lakes organize metadata and data into data catalogs. Ex: Apache Atlas, AWS Glue Data Catalog, and Azure Data Catalog.
Cleaning, Normalization, Aggregation: The data lakes can do these processes automatically. Ex: Apache Spark, Apache Flink, and AWS Glue ETL.
Besides these, data integration with programming and analytical tools, security, and lifecycle management of data are also taken care of by these data lakes.