System Design – Fault Tolerance

Q1. What is Fault Tolerance?

Answer: Fault Tolerance or Resiliency in System Design means the ability of a system to remain active even in case of failures or disruptions. The system should be so designed that a backup plan is ready, the system should be able to auto-detect failures, and the responses be built to auto-correct or warn the administrators. At the same time, there should be no or minimal downtime for users in a seamless manner.

Q2. What are the basic steps taken for fault tolerance in system design?

Answer: The basic steps for fault tolerance or resiliency in any system design are below.

Redundancy and Failover: Replicate all the critical components, such as databases, servers, network servers, etc., and introduce redundancy. Keep backup servers ready with this plan to take over in case of failure of primary components. This is called Failover.

Load Balancing: Never keep a single server managing all requests; use a load balancer and send requests to multiple servers. Whenever a server fails, keep a system to mark it as failed so the load balancer no more sends any requests to the failed component.

Backups: Take storage, database, and other required backups regularly so that it is easy to recover to the last stored state of the system.

Recovery: Design a recovery system so the failed component can recover automatically or warn the administrators for manual intervention.

Run in Degraded State: Make arrangements such that the whole system is never down completely in case of failures but can run in an informed or degraded set of activities in case of failure.

Monitoring and Testing: Implement Health Monitoring and constant testing to identify failures to identify early signals before they happen, such as memory full, faulty CPU, overheating, network problems, etc., and make recovery and failover easy.

Q3. What are some of the techniques in fault isolation and containment?

Answer: Fault Isolation and Containment refers to the ability of a system to keep failures confined to the specific component and not allow them to spread throughout the system. Below are a few techniques for the same.

Microservices Architecture: Break your application into pieces of services called microservices where microservice would perform a specific functionality only. Each microservice communicates with other services at runtime using methods such as API.

There is no compilation or execution dependency at runtime, meaning even if one service fails, another service would just get a warning or error in the API response but would not fail. This helps in easy isolation and containment of the failure.

Bulkheads: “Bulkheads” is a design pattern in which you would assign separate threads, memory, and database connections to different parts of the system (this is called Partitioning resources), and separate processes, containers, or virtual machines are allocated for each service (Isolating services). This helps in the isolation of failed components in case of failure and allows critical components to continue functioning well.

Circuit Breakers: A circuit breaker pattern is a design pattern. The system keeps checking for responses from API calls (ex: network calls and HTTP requests) to see if a number of failures go beyond a limit. Once they cross, the component or specific call can be marked as failed and blocked or give a warning/error from the API call. Again check after a predefined period to see if it succeeds and allow it to function again normally.

Q4. What is Geo-Routing?

Answer: Geo-Routing or geolocation-based routing is the routing mechanism where user requests are routed to their nearest data or computer center based on geography. However, sometimes the requests may proceed to another location based on network, traffic, availability, and other factors defined by the system.

This helps in performance, reduces latency, and increases user experience. Sometimes, the same request can be served in a different way considering the preference of the users. Also, sometimes it is possible to secure the requests of a certain country user to proceed to servers only in that country.

Q5. What are the techniques used in load balancers and routing to help in geo-routing?

Answer: DNS-based routing and Anycast routing are common techniques used in load balancers and routers to provide geo-routing.

Q6. Which services use geo-routing to improve performance and user experience for content delivery?

Answer: Content Delivery Networks (CDNs) and Cloud Service Providers are the most common users of geo-routing techniques for effective content delivery.

Q7. What is the standard technique of fault tolerance when the servers are distributed across the world, or the system is a large distributed system?

Answer: The most standard technique of fault tolerance is to use load balancers and replication of each type of component, such as a computer server or database server, at every level of the hierarchy of the distributed system. The replication and servers can be at the hardware, department, organization, city, state, and country levels. This depends on factors like profits vs. cost and budget availability.