An updated and organized reading list for illustrating the patterns of scalable, reliable, and performant large-scale systems. Concepts are explained in the articles of prominent engineers and credible references. Case studies are taken from battle-tested systems that serve millions to billions of users.
If your system goes slow
Understand your problems: scalability problem (fast for a single user but slow under heavy load) or performance problem (slow for a single user) by reviewing some design principles and checking how scalability and performance problems are solved at tech companies. The section of intelligence are created for those who work with data and machine learning at big (data) and deep (learning) scale.
If your system goes down
"Even if you lose all one day, you can build all over again if you retain your calm!" - Thuan Pham, former CTO of Uber. So, keep calm and mind the availability and stability matters!
If you are having a system design interview
Look at some interview notes and real-world architectures with completed diagrams to get a comprehensive view before designing your system on whiteboard. You can check some talks of engineers from tech giants to know how they build, scale, and optimize their systems. Good luck!
If you are building your dream team
The goal of scaling team is not growing team size but increasing team output and value. You can find out how tech companies reach that goal in various aspects: hiring, management, organization, culture, and communication in the organization section.
Community power
Contributions are greatly welcome! You may want to take a look at the contribution guidelines. If you see a link here that is no longer maintained or is not a good fit, please submit a pull request!
Many long hours of hard work have gone into this project. If you find it helpful, please share on Facebook, on Twitter, on Weibo, or on your chat groups! Knowledge is power, knowledge shared is power multiplied. Thank you!
Content
- Principle
- Scalability
- Availability
- Stability
- Performance
- Intelligence
- Architecture
- Interview
- Organization
- Talk
- Book
Principle
- Lessons from Giant-Scale Services - Eric Brewer, UC Berkeley & Google
- Designs, Lessons and Advice from Building Large Distributed Systems - Jeff Dean, Google
- How to Design a Good API & Why it Matters - Joshua Bloch, CMU & Google
- On Efficiency, Reliability, Scaling - James Hamilton, VP at AWS
- Principles of Chaos Engineering
- Finding the Order in Chaos
- The Twelve-Factor App
- Clean Architecture
- High Cohesion and Low Coupling
- Monoliths and Microservices
- CAP Theorem and Trade-offs
- CP Databases and AP Databases
- Stateless vs Stateful Scalability
- Scale Up vs Scale Out: Hidden Costs
- ACID and BASE
- Blocking/Non-Blocking and Sync/Async
- Performance and Scalability of Databases
- Database Isolation Levels and Effects on Performance and Scalability
- The Probability of Data Loss in Large Clusters
- Data Access for Highly-Scalable Solutions: Using SQL, NoSQL, and Polyglot Persistence
- SQL vs NoSQL
- SQL vs NoSQL - Lesson Learned at Salesforce
- NoSQL Databases: Survey and Decision Guidance
- How Sharding Works
- Consistent Hashing
- Consistent Hashing: Algorithmic Tradeoffs
- Don’t be tricked by the Hashing Trick
- Uniform Consistent Hashing at Netflix
- Eventually Consistent - Werner Vogels, CTO at Amazon
- Cache is King
- Anti-Caching
- Understand Latency
- Latency Numbers Every Programmer Should Know
- The Calculus of Service Availability
- Architecture Issues When Scaling Web Applications: Bottlenecks, Database, CPU, IO
- Common Bottlenecks
- Life Beyond Distributed Transactions
- Relying on Software to Redirect Traffic Reliably at Various Layers
- Breaking Things on Purpose
- Avoid Over Engineering
- Scalability Worst Practices
- Use Solid Technologies - Don’t Re-invent the Wheel - Keep It Simple!
- Simplicity by Distributing Complexity
- Why Over-Reusing is Bad
- Performance is a Feature
- Make Performance Part of Your Workflow
- The Benefits of Server Side Rendering over Client Side Rendering
- Automate and Abstract: Lessons at Facebook
- AWS Do's and Don'ts
- (UI) Design Doesn’t Scale - Stanley Wood, Design Director at Spotify
- Linux Performance
- Building Fast and Resilient Web Applications - Ilya Grigorik
- Accept Partial Failures, Minimize Service Loss
- Design for Resiliency
- Design for Self-healing
- Design for Scaling Out
- Design for Evolution
- Learn from Mistakes
Scalability
- Microservices and Orchestration
- Domain-Oriented Microservice Architecture at Uber
- Service Architecture (3 parts: Domain Gateways, Value-Added Services, BFF) at SoundCloud
- Container (8 parts) at Riot Games
- Containerization at Pinterest
- Evolution of Container Usage at Netflix
- Dockerizing MySQL at Uber
- Testing of Microservices at Spotify
- Docker in Production at Treehouse
- Microservice at SoundCloud
- Operate Kubernetes Reliably at Stripe
- Cross-Cluster Traffic Mirroring with Istio at Trivago
- Agrarian-Scale Kubernetes (3 parts) at New York Times
- Nanoservices at BBC
- PowerfulSeal: Testing Tool for Kubernetes Clusters at Bloomberg
- Conductor: Microservices Orchestrator at Netflix
- Docker Containers that Power Over 100.000 Online Shops at Shopify
- Microservice Architecture at Medium
- From bare-metal to Kubernetes at Betabrand
- Kubernetes at Tinder
- Kubernetes at Quora
- Kubernetes Platform at Pinterest
- Microservices at Nubank
- Payment Transaction Management in Microservices at Mercari
- Service Mesh at Snap
- GRIT: Protocol for Distributed Transactions across Microservices at eBay
- Rubix: Kubernetes at Palantir
- CRISP: Critical Path Analysis for Microservice Architectures at Uber
- Distributed Caching
- EVCache: Distributed In-memory Caching at Netflix
- EVCache Cache Warmer Infrastructure at Netflix
- Memsniff: Robust Memcache Traffic Analyzer at Box
- Caching with Consistent Hashing and Cache Smearing at Etsy
- Analysis of Photo Caching at Facebook
- Cache Efficiency Exercise at Facebook
- tCache: Scalable Data-aware Java Caching at Trivago
- Pycache: In-process Caching at Quora
- Reduce Memcached Memory Usage by 50% at Trivago
- Caching Internal Service Calls at Yelp
- Estimating the Cache Efficiency using Big Data at Allegro
- Distributed Cache at Zalando
- Application Data Caching from RAM to SSD at NetFlix
- [Tradeoffs of Replicated Cache at