As you progress in your career as a software engineer you gradually are tasked with solving larger and more complex problems. Eventually you are tasked with designing systems or modifications to systems that require multiple developers to work on and may take weeks or months to build. These problems, generally referred to as system design problems, are not the type that are taught in school. While you probably took many quarters or semesters diving into various algorithms, data structures, databases, operating systems, and discrete mathematics, its unlikely you ever had to take a graded test on how to build a robust multi-service system that would live for years and be modified by dozens of people.
System design does not have clear right and wrong answers. It’s an art. The design choices you make could be both right and wrong. Often it’s only years after you’ve made them that anyone will realize they were wrong.
Below are the principles I try to follow both when creating and reviewing designs.
Everything can fail
Networks. Disks. RAM. Databases. Other services. Whatever you are building will definitely fail. You’ll deploy a bug to production. You’ll corrupt data. You will definitely do something that will negatively impact a customer. How you prepare to deal with these failures is what matters. Code should be resilient. Monitoring on all the key failure points is a baseline requirement.
How will I know that something is failing?
What am I going to do when it does?
Know how to test it before you build it
I’m a big believer in knowing what done looks like before you start. To me, every engineering effort should always begin with a test. If you don’t know how to test that something does what you think it should, you will never know when you are done. The act of thinking through how to test your system is also a forcing function. It forces you to face the complexity of your system early. If it is difficult to test, it is probably difficult to understand.
Your design is likely wrong in some way, but you don’t know it. Give yourself as much opportunity to defer decisions as you can. Algorithm details should be decoupled from interfaces so that one can change without impacting the other. Implementation details should not impact the overall design.
By deferring decisions on implementation details, you can focus first on building out a system that logically works. Concerns like data persistence and performance are secondary. That way you can solve these problems without worrying about breaking intended behavior.
Plan for 10X scale
It’s tempting to want to build something that can scale infinitely. The reality is that you won’t anticipate all possible bottlenecks. You will also need to make compromises to get something built sooner rather than later. Plan on a solution that can scale only to 10X what you expect. Expect that by the time you reach that 10X scale it will be time to start making changes.
Components that change together should live together
This is a principle that applies both at the code implementation and service levels. As you break out domains into independent micro-services it can become difficult to know what should go where. Knowing what parts will need to change at the same time can guide you on how to group components into services.
Every service should have a clear, documented purpose
Every time you build a new service you should write down the exact purpose of that service. The purpose should be as simple and succinct as possible. This serves multiple purposes. First, it forces you to articulate why you are making the decisions you are making. Second, it informs future engineers who will look to modify what you are building. With a clear purpose to each service it will help future engineers know whether they should modify an existing service to meet their needs or if they should create a new service.