A Practical Framework for System Design and Architecture

June 15, 202611 min read

system design

Summary

When engineers are asked to design a system architecture, it's often unclear where to start. Should you gather requirements, choose technologies, draw diagrams, or estimate load first?

This article presents a practical framework for architecture design that helps structure the decision-making process and avoid common mistakes. Using a banking notification system as a running example, we'll walk through four key stages: understanding requirements, performing back-of-the-envelope estimations, creating a high-level design, and refining the architecture in detail.

You'll learn how requirements drive architectural decisions, why rough calculations matter before choosing technologies, and how to reason about trade-offs between scalability, reliability, performance, and cost. The approach is equally useful for real-world system design and for System Design interviews.

When you're asked to design the architecture of a system, it's often unclear where to begin. Some engineers immediately start drawing diagrams. Others jump straight into choosing technologies. Some begin building a prototype.

The problem is that good architecture rarely emerges from the first solution that comes to mind. Effective designs usually appear only after you understand the problem you're solving, the constraints you must work within, and the scale the system is expected to handle.

In this article, I'd like to share an approach that helps structure the architecture design process. It can be applied both in real-world engineering projects and during the System Design portion of technical interviews.

When This Approach Makes Sense

Large Systems and High Uncertainty

The approach described below is primarily intended for relatively large systems or problems with a significant degree of uncertainty. If you're building a small service with straightforward requirements and a limited lifespan, going through every stage may be unnecessary and could simply increase Time To Market.

Architecture is always a trade-off between the quality of design and the speed of delivery.

Using It During System Design Interviews

This approach also works well in interviews. However, it's important to remember that interviews are time-constrained. Nobody expects you to design a system that took hundreds of engineers years to build within a 45–60 minute session.

Your goal is to demonstrate how you think: how you identify requirements, analyze trade-offs, and make engineering decisions. Because of this, the depth of each stage should be adjusted based on the time available during the interview.

The Architecture Design Process

Architecture design is rarely a strictly linear process. As new information emerges, you'll often need to revisit earlier decisions, challenge assumptions, and refine parts of the design.

Nevertheless, in most cases the process can be broken down into four major stages:

Understand the problem and gather requirements.
Perform rough capacity estimations.
Build a high-level design.
Dive into detailed architecture.

It's important to understand that these are not four independent steps. Requirements influence estimations. Estimations influence architecture. Architecture influences implementation details. Every decision should be justified by requirements, constraints, or expected workload.

To illustrate this relationship, we'll use a single example throughout the article.

Imagine we're tasked with designing a notification system for a bank. The system must send notifications about money transfers, balance changes, transaction confirmations, and other banking events.

At first glance, the problem appears fairly simple. However, as we dig deeper, we'll discover a large number of architectural decisions and trade-offs hidden beneath the surface.

Stage 1: Understand the Problem

The first question you should ask yourself is: What problem are we actually trying to solve?

Many architectural mistakes happen not because of poor technology choices, but because the problem itself was misunderstood.

It's surprisingly common for discussions about databases, message queues, and microservices to begin long before anyone has a clear understanding of who the users are and what the system is supposed to accomplish.

At this stage, it's useful to actively engage stakeholders and ask questions such as:

Who will use the system?
What are the primary use cases?
Which scenarios are business-critical?
What constraints exist?
What does success look like?

The more context you gather, the better your decisions will be in later stages.

Brainstorming sessions with stakeholders and future users can be particularly valuable. These discussions often uncover hidden requirements or reveal perspectives that were initially overlooked.

Let's return to our example.

At first, the task sounds straightforward: we need a system that sends notifications to users.

After talking to stakeholders, however, we learn that:

notifications are related to banking operations;
both push notifications and email are supported;
losing a notification is unacceptable;
most notifications must be delivered within 30 seconds;
a user may have multiple devices;
push notifications can fail due to third-party providers;
delivery history must be retained for five years;
peak traffic may reach 20 million notifications per hour.

Based on this information, we can derive the system requirements.

Functional Requirements

Send push notifications.
Send email notifications.
View delivery status.
Retry failed notifications.
Maintain an audit history of notifications.

Non-Functional Requirements

Deliver 90% of notifications within 30 seconds.
Prevent message loss.
Maintain high availability.
Store delivery history for five years.

Architectural Drivers

Not all requirements have the same impact on the architecture. Once requirements have been gathered, it's useful to identify a small set of architectural drivers — requirements that will shape most of the system design decisions.

In our case, these are:

no message loss;
delivery within 30 seconds;
high availability;
long-term storage of delivery history.

These requirements will drive most of the architectural decisions we make later.

In practice, non-functional requirements often have a greater influence on architecture than functional ones. Two systems may provide exactly the same functionality, yet radically different requirements for latency, availability, scalability, or data retention can lead to completely different architectures.

Stage 2: Back-of-the-Envelope Estimation

At this point, we understand the requirements and know which ones matter most.

The next step is to estimate the scale of the system.

The goal is not to obtain perfectly accurate numbers. The goal is to understand orders of magnitude and validate whether potential solutions are realistic.

At this stage, we're trying to answer a handful of fundamental questions: how many users the system will serve, how many notifications it will generate, what peak traffic patterns to expect, how much data must be stored, and what level of throughput the infrastructure needs to support.

Suppose the system must handle up to 20 million notifications per hour.

This translates roughly into:

5,500 notifications per second on average;
16,500 notifications per second during a 3× traffic spike;
an average notification size of 4 KB;
approximately 80 GB of new data per hour;
roughly 1.9 TB per day;
more than 700 TB per year before replication.

These estimates already allow us to make several important observations.

First, storing the entire history in expensive high-performance databases may become prohibitively costly.

Second, notification delivery should be asynchronous because external providers may be slow or temporarily unavailable.

Third, we need to account for scaling message queues, workers, and storage systems from the beginning.

It's important to understand that these calculations are not performed for their own sake. Their purpose is to quickly eliminate solutions that cannot support the expected scale.

This stage helps avoid situations where an architecture looks elegant on paper but collapses under real production workloads.

Stage 3: Build the High-Level Design

Once we have rough estimates, we can begin making architectural decisions based on data rather than assumptions.

At this stage, the goal is to identify the major building blocks of the system:

APIs;
databases;
message queues;
caches;
external integrations.

It's important not to dive too deeply into implementation details yet. A common mistake is trying to design a sophisticated distributed system before understanding whether that complexity is actually necessary.

I generally follow the same principle I use when writing code: KISS (Keep It Simple, Stupid). If a problem can be solved with a simpler solution, that's usually the best place to start.

The requirements gathered earlier now begin to shape the architecture. Since notification loss is unacceptable, we need a reliable message broker with durable storage. The requirement to track delivery status introduces a dedicated persistence layer for notification state. Because external providers may respond slowly or become temporarily unavailable, delivery should be handled asynchronously. Finally, the expected throughput of tens of thousands of messages per second requires worker services that can scale horizontally.

As a result, we may end up with an architecture similar to the following:

High level design - Notification System — High-Level Design — Notification System

At this stage, discussing the design with colleagues is extremely valuable. A fresh set of eyes can often reveal unnecessary complexity, challenge assumptions, or uncover scenarios that were previously overlooked.

Stage 4: Refine the Architecture

Once the high-level architecture is in place, we can begin examining individual components in more detail.

If the previous stage focused on identifying the major parts of the system and their interactions, this stage focuses on understanding how those parts actually satisfy the system's requirements.

The key requirements remain the same: no notification loss, delivery of most notifications within 30 seconds, high availability, and the ability to track delivery status. The goal of this stage is to translate those requirements into concrete implementation decisions.

At this point, architecture starts evolving from a collection of components into a concrete set of engineering solutions.

Let's look at the lifecycle of a notification within the system.

This diagram illustrates how requirements translate into implementation details.

When a notification is created, it is first published to a message broker. This decouples request handling from the delivery process and helps smooth traffic spikes.

A worker service then consumes the message, determines the appropriate delivery channel, prepares the request, and sends it to the external provider.

From here, two outcomes are possible:

Successful Delivery

If the provider confirms receipt of the notification, the system stores the delivery status in the database.

Storing delivery status enables several important capabilities. It allows operators to see the current state of a notification, provides visibility into delivery performance through metrics, and supports auditing requirements.

The status storage component may seem like a small detail, but it's directly tied to both user experience and operational visibility.

Without it, troubleshooting delivery issues becomes significantly more difficult.

Failed Delivery

If delivery fails, the next step is determining the type of failure. Not all failures should be handled the same way.

Temporary failures may occur for many reasons:

provider outages;
network issues;
rate limiting;
temporary service degradation.

In these cases, the notification should not be discarded. Instead, it should be moved to a retry queue and processed again after a delay. This retry mechanism improves reliability without requiring immediate human intervention.

Some failures are unlikely to succeed regardless of how many retries are attempted.

Examples include:

an invalid email address;
a deleted mobile device;
an unsupported notification target.

In these situations, repeatedly attempting delivery wastes resources and generates unnecessary load. A common solution is to move such messages into a Dead Letter Queue (DLQ), where they can be analyzed separately. This prevents problematic messages from continuously cycling through the system.

Additional Reliability Mechanisms

As we continue refining the design, additional components and patterns often emerge. These are typically driven directly by the system's non-functional requirements.

As the design becomes more detailed, additional reliability mechanisms start to emerge. Idempotent processing helps prevent duplicate notifications, retry mechanisms handle transient failures, and Dead Letter Queues isolate messages that cannot be processed successfully. At the same time, operational concerns introduce queue monitoring, alerting, persistent state tracking, and automatic worker scaling.

These mechanisms are rarely visible in a high-level architecture diagram, yet they often determine whether a system succeeds in production. This stage is where many of the difficult questions finally get answered.

For example:

What happens if a worker crashes while processing a message?
How do we prevent duplicate notifications?
Where should processing state be stored?
How does the system react when an external provider becomes unavailable?
How do we detect that the queue is growing faster than it can be processed?

The stricter the requirements for reliability, availability, and performance, the more of these questions need to be addressed.

Over time, the high-level diagram gradually evolves into a production-ready architecture.

Trade-Offs

Every architectural decision comes with a cost. Guaranteed delivery improves reliability but inevitably increases system complexity. Storing a complete audit history raises storage costs, while replication improves availability at the expense of additional infrastructure resources. Asynchronous processing makes the system more resilient to failures, but it also introduces latency and operational complexity. Nearly every architectural decision creates both benefits and costs, which is why trade-off analysis is such a central part of architecture design.

Because of this, the architect's job is not to find the perfect solution.

The real challenge is finding the right set of trade-offs for a particular problem. There is rarely a universally correct answer. A design that is ideal for one organization may be entirely inappropriate for another. Architecture is ultimately the art of balancing competing concerns.

Architecture Design Is Iterative

In practice, architecture design rarely follows a neat sequence of steps.

For example, after performing capacity estimations, you may discover that storing data for five years exceeds the project's budget.

At that point, you might need to revisit the requirements and ask:

Do we really need five years of retention?
Can older records be archived?
Do all historical records require fast access?

Once the requirements change, the architecture may change as well.

This is why architecture design should be viewed as an iterative process rather than a linear one. Each iteration improves understanding of the problem and helps refine the solution.

The most effective architects are not those who get everything right on the first attempt. They are the ones who continuously challenge assumptions and adapt their designs as new information becomes available.

Real-World Constraints

Architectures are rarely designed in a vacuum.

In reality, architecture decisions are constrained by factors such as the company's existing technology stack, infrastructure, team expertise, available computing resources, and budget. These constraints often matter just as much as the technical characteristics of the solution itself.

Because of this, the technically optimal solution is not always the best business decision. In many cases, choosing a technology that the team already understands is preferable to introducing something new, even if the new technology appears superior on paper.

Of course, there are situations where adopting a new technology is justified. But those decisions should be supported by strong evidence and a clear business rationale.

Just as in earlier stages, seeking feedback from colleagues before implementation can significantly improve the final design. The cost of changing architecture is lowest before the first line of production code is written.

Why There Is No Such Thing as a Perfect Architecture

Architecture design is one of the most interesting and at the same time one of the most challenging areas of software engineering. Unlike many technical problems, architectural challenges rarely have a single correct answer. More often, there is a range of viable solutions, each with its own strengths, weaknesses, and trade-offs.

Every architectural decision is ultimately a bet on the future. When we choose a database, define service boundaries, introduce a message broker, or optimize for a specific scalability target, we are making assumptions about how the system will evolve. Sometimes those assumptions prove correct. Sometimes they don't.

This is why there is no universal blueprint that can produce the perfect architecture for every situation. Every system operates under a unique combination of requirements, constraints, business goals, team capabilities, and expected growth. The architecture that works well for one company may be completely inappropriate for another.

Over time, engineers develop architectural intuition. That intuition is built through experience: designing systems, operating them in production, making mistakes, learning from failures, and observing which decisions stand the test of time.

Architecture will always involve uncertainty. But a structured process helps reduce that uncertainty and significantly lowers the chances of making arbitrary or poorly informed decisions. The goal is not to design the perfect system. The goal is to design a system that solves the right problem, satisfies the most important requirements, and remains maintainable as the system evolves.

And in practice, that's usually what good architecture looks like.

Back to articles