How they SRE
Introduction
How They SRE How They SRE is a curated knowledge repository of Site Reliability Engineering (SRE) best practices, tools, techniques, and culture adopted by leading technology or tech-savvy organizations.
Numerous organizations frequently share their insights and expertise, encompassing best practices, tools, and techniques that shape their engineering culture. They do this through various public platforms such as engineering blogs, conferences, and meetups. This repository compiles and presents content gathered from these sources.
Topics
- Site Reliability Engineering
- Hiring and Building SRE teams
- SRE Culture
- DevOps
- Monitoring & Observability
- Alerting
- Incident Response & Post-Mortem
- On-Call
- Testing in Production
- Chaos Engineering
- Automation
- Performance
- Platform Engineering
Organizations
Achievers
Blog Posts
- Enter the Abattoir - Building 'à la carte' gitops tooling
- Scaling Production Globally — The service mesh facelift (Part-1)
- Scaling Production Globally - Solving observability problems for developers (Part-2)
- Load Testing Kubernetes: Building a Framework (Part-1)
- Load Testing Kubernetes: Resolving bottlenecks and improving performance (Part-2)
Airbnb
Blog Posts
- Automated Incident Management Through Slack
- Detecting Vulnerabilities With Vulnture
- Alerting Framework at Airbnb
- When The Cloud Gets Dark — How Amazon’s Outage Affected Airbnb
- Intelligent Automation Platform: Empowering Conversational AI and Beyond at Airbnb
- Production Secret Management at Airbnb
- Automating Data Protection at Scale, Part 1
- Automating Data Protection at Scale, Part 2
- Automating Data Protection at Scale, Part 3
- Dynamic Kubernetes Cluster Scaling at Airbnb
Algolia
Blog Posts
Alibaba Cloud
Blog Posts
Asana
Blog Posts
- How Asana uses Asana: Security incident response
- How Asana ships stable web application releases
- Analysis of recent downtime & what we’re doing to prevent future incidents
- Developer environment: Achieving reliability by making it fast to reset
- Three security tactics for every IT leader to consider this fall
ASOS
Blog Posts
- Playing the blame-less game
- A day in the life of… Cat S (Head of Reliability Engineering)
- An AKS Performance Journey: Part 1 — Sizing Everything Up
- An AKS Performance Journey: Part 2 — Networking It Out
- Cyber Security @ ASOS.com
- Security Operations 24x7
- The skills we look for in Cyber Security Incident Response
Atlassian
Blog Posts
Baidu
Videos
Basecamp
Blog Posts
- Inside a CODE RED: Network Edition
- Three Basecamp outages. One week. What happened?
- Basecamp 2 and Basecamp 3 search outage report
- Reducing Incident Escalations at Basecamp
Books
Bloomberg
Videos
- Capacity Planning and Performance Enhancement with Page Reference Sampling
- Why SREs can't afford to NOT do Chaos Engineering
- Tracing Real-Time Distributed Systems
- The Bloomberg Story: Building SRE Teams in an "Immeasurable" Organisation
- Visibility into Loggers (and Other Low Level Services)—Seeing the Trees from the Forest
Booking.com
Blog Posts
- How Reliability and Product Teams Collaborate at Booking.com
- Incidents, fixes, and the day after
- Troubleshooting: A journey into the unknown
Videos
Capital One
Blog Posts
- Automate Application Monitoring with Slack
- Automate AWS Infrastructure with Boto 3: AWS Health Check
- Active-Active Shared-Nothing Database Architecture
- The 3 R’s of SREs: Resiliency, Recovery & Reliability
- 5 Steps to Getting Your App Chaos Ready
- 4 Real-World Scenarios That Read Like Chaos Engineering Experiments
- Embrace the Chaos … Engineering
- 3 Lessons Learned From Implementing Chaos Engineering at Enterprise
- A Deep Dive Into Seamless Blue/Green Deployment Using AWS CodeDeploy
- Secure Docker Containers Require Secure Applications
- 4 Steps for Pairing the Cloud and DevOps to Improve Resiliency
- Container Ready Applications with Twelve-Factor App and Microservices Architecture
- Deploying with Confidence — Minimize Risk, Maximize Resiliency With Canary Deployments on AWS
- Architecting for Resiliency
- Continuous Chaos — Introducing Chaos Engineering into DevOps Practices
- The Mon-ifesto Part 1: Metrics
Major incidents & analysis reports
Videos
DBS
Blog Posts
- Presenting at iThome’s SRE Conference: Our DBS SRE Transformation Journey Thus Far
- Debunking the seven most popular Site Reliability Engineering myths
- How To Use SRE To Cultivate A Blameless Culture In The Workplace
- Site Reliability Engineering at DBS Bank
- Automating Configuration Management at Scale
- [How DBS dispelled the myths of Chaos