Thinking About Data Systems: Reliable, Scalable, and Maintainable Applications.

Context

Rather than being compute-intensive, modern applications are data-intensive. Volumes, complexity, and the velocity at which data changes are increasingly the major issues. Data-intensive applications require a database to store data and retrieve it later, a cache to speed up reads, search indexes to filter data in various ways, stream processing to send messages to other processes that can be handled asynchronously, and batch processing to crunch large amounts of data on a regular basis.

As a result, knowing which tools and methodologies are best appropriate for the task is critical while developing an application. The distinctions between many of these data storage and processing methods are becoming increasingly hazy, particularly given that a growing number of applications have needs that no single tool can meet. Consequently, diverse tools are now being sewn together to execute various application activities utilizing application code.

The design of a data system or service raises numerous questions, as there are numerous aspects that influence the design of a data system. For most software systems, the three most essential concerns are:

Reliability - even in the face of hardship, the system should continue to function correctly.
Scalability - There should be adequate mechanisms in place to deal with system growth as it occurs.
Maintainability - Various personnel should be able to work effectively on the system.

1. Reliability

Typical software expectations include:

Performing operations as expected by the user
Tolerating user errors or unexpected uses of software
Good enough performance for the desired use-case, under expected load and volume of data
Preventing any unwanted access or abuse

We may reasonably conclude that reliability entails continuing to perform properly even when things go wrong. Faults are the things that can go wrong. Resilient systems are those that can predict and deal with failures. System failures are caused by faults. While it is impossible to completely eliminate the possibility of a fault, it is critical to develop fault-tolerant systems that prevent failures from occurring.

Hardware faults

Disk crashes, malfunctioning RAMs, power grid blackouts, and even someone disconnecting the wrong network cable are examples of hardware defects. To lower the rate of system failures, the first approach is usually to provide redundancy to individual hardware components. This was sufficient till lately. However, as data quantities and computing needs have grown, the rate of hardware failures has risen in lockstep. As a result, there has been a shift toward building systems that can withstand the loss of entire machines instead of, or in addition to, hardware redundancy.

Software faults

Systematic mistakes are more difficult to predict, and because they are associated across nodes, they are more likely to cause system failures than uncorrelated hardware problems. Bugs that create these types of systems failures are usually dormant for a long time until they are triggered by a unique set of circumstances. There is no easy fix for the problem of systematic software flaws.

Human faults

Humans are the ones who design and develop systems, as well as the ones who maintain them. Surprisingly, humans are notorious for being unreliable, even when they have the greatest of intentions.

How do we, therefore, make our systems reliable in the face of unreliable humans? The most effective systems integrate several approaches:

Create systems that reduce the likelihood of errors
People should be separated from the places where they make the most mistakes and those where they can cause failures
Systems should be properly tested at all levels
Allow for speedy and painless recovery in the event of human error. In the event of a failure, this helps to reduce the impact
Have mechanisms in place for monitoring that are both explicit and detailed.
Put in place good management procedures and training

2. Scalability

A system's reliability now does not guarantee that it will remain such in the future. Increased load is a common cause of this: perhaps a system has grown from 10,000 concurrent users to 100,000 concurrent users, or it is processing much more data than before. The term scalability refers to a system's ability to cope with increased load.

Describing load

To explain a system's growth issues (what happens if the load doubles), it's necessary to first define the system's existing load in detail. Load parameters can be used to describe the load. The ideal parameters depend significantly on the system's architecture: they could be requests per second to a web server, the ratio of reads to writes in a database, the cache hit rate, or something else entirely.

Describing performance

You can study what occurs when the load on a system grows once the load has been stated. There are two approaches to investigate this:

How does increasing the load effect the performance of your system while keeping the system resources (CPU, memory, bandwidth, etc.) constant?
How many resources do you need to add if you want to maintain the same performance when you change a load parameter?

Answering both questions necessitate the use of performance metrics. So, let's take a quick look at describing system performance.

What constitutes system performance differs from one system to the next. In batch processing systems such as Hadoop, for example, there is an overemphasis on throughput. What matters in online systems is the service's response time, which is the time between a client sending a request and receiving a response. In practice, for a system handling a large number of requests, this response time can vary greatly. It's probably safe to assume that slow requests are more expensive in general, for example, because they process more data. Even when all requests are expected to take the same amount of time, there are still differences.

Percentiles are used to determine normal response times. The median of a sorted set of response times, for example, can be used to estimate how long users will have to wait: half of user requests are fulfilled in less than the median response time, while the other half take longer. The 50th percentile is also known as the median, and it is commonly abbreviated as p50.

High response time percentiles, often known as tail latencies, are significant because they have a direct impact on users' experiences. As a result, it's critical to look at higher percentiles to determine how awful your outliers are., with the 95th, 99th, and 99.9th being the common (abbreviated as p95, p99, p999).

Approaches for coping with load

After going over the parameters for expressing load and the metrics for analyzing performance, it's critical to consider scalability. How can we ensure that performance is maintained even while load parameters are increased?

It goes without saying that an architecture designed for one level of load will not be able to withstand ten times the load. As a result of the increased load, the architecture will need to be reconsidered. The scaling up versus scaling out dichotomy is always a topic of discussion. Good designs, on the other hand, frequently involve a pragmatic mix of approaches.

Some systems (elastic systems) automatically add computing resources when they detect an increase in load, whilst others must be scaled manually (someone has to analyze the load capacity and decide whether to add more machines to the system or not). If loads are highly uncertain, the former is preferable, whereas the latter is simpler and may have fewer operating shocks.

Large-scale system architectures are often highly tailored to the applications. The load parameters — assumptions about which activities will be common and which will be rare – are used to build an architecture that scales well for a certain application.

3. Maintainability

It's self-evident that the majority of software costs are not generally incurred during development, but rather during continuing maintenance. As a result, we should create software systems that will ideally reduce the amount of agony experienced during maintenance. Let's look at the three key design concepts for software systems that have a big impact on maintainability.

Operability: Making Life Easy for Operations

Systems should be constructed in such a way that it is simple for the operations team to run it smoothly thereafter. Good operability entails making mundane jobs simple, allowing the operations team to concentrate on higher-value duties.

Simplicity: Managing Complexity

Complex systems slow everyone down, raising maintenance costs even more. Budgets and schedules are frequently overrun when complexity makes maintenance difficult. Additionally, there is a higher danger of introducing bugs. Reduced complexity, on the other hand, considerably increases maintainability, therefore simplicity should always be the goal when designing systems.

Evolvability: Making Change Easy

Changes in system needs are unavoidable. This could be due to fresh information, new use cases, a shift in business objectives, new feature requests, or the replacement of obsolete platforms, among other things. Simplicity and abstractions are directly linked to systems that are easily adaptable to changing requirements and customizable. Simple and easy-to-understand systems are typically easier to change than complex systems.

Summary

This article has gone over some essential ways of thinking about data-intensive applications. To be useful, applications must meet a number of criteria. There are functional requirements (what it should do, such as allow data storage, retrieval, search, and processing in various ways) and non-functional requirements (security, reliability, compliance, scalability, compatibility, and maintainability). This article focused on reliability, scalability, and maintainability.

Michael's Blog