Architecture & Scaling
- Is the service's architecture documented?
- Can the service tolerate machine failures whilst preserving SLO?
- Are the service components easily scalable?
- Can users consume unbounded resources through your service?
Code
Where is the code hosted?If this is a our company's OSS (Open Student Society) project, does it have a CLA setup (Contributor License Agreement), is there an appropriate license and governance doc?
CI & Testing
Do you have CI? Do you use an existing "blessed" CI solution?Does the CI build the final production assets? Where are the built assets stored? Are they versioned? [We expect all assets run in production to be built by CI, not on a developers laptop.]
Release process
- Is the release process documented? How regularly do you release? [You should document the process of getting a new build of your service into production; putting these docs in the infrastructure configuration repository is a good idea.]
- Is there a staging environment? A dev environment?
- Can updates to the service be rolled out without downtime? Can releases be safely rolled back?
Config management
- Is the service config version controlled? Is the config in the infrastructure configuration repo?
- Are resource requests and limits appropriately configured? [In very general, requests should be set to the 95th quantile of the usage over the last week.]
- Are your jobs restarted when config changes?
- If you have external dependencies, are they configured with Terraform?
- If the service required secrets, are they in an approved reliable secret store?
Security
- How is access to your new service controlled?
- Is TLS used over all untrusted networks?
- Is sensitive data encrypted at rest?
SLA/SLO
- Is there a published SLO for the service?
- How is the SLO monitored/calculated and who owns it?
- Is the SLO defensible?
Observability
- Does the service export metrics?
- Do the exported metrics allow for RED Method style analysis? [The RED method: Key metrics for microservices architecture - Rate/Error/Duration]
Alerts
- Do you use Prometheus alerting? Are the alerts version controlled?
- Are alerts routed to the correct please?
- Do you use SLO-based alerting?
- Have alerts been tested and validated in lower environments?
Dashboards
- Are dashboards version controlled?
- Are these RED Method-Style dashboards for the service?
- Are the dashboards available in our monitoring system?
Logs
- Does the service emit logs?
- Are all log error codes documented?
- Do your jobs emit credentials or secrets in their logs?
Back-up and recovery
External Services
- Are the external dependencies sufficiently monitored and is alerting set up?
On-call & Incident Response
- Do you have follow-the-sun on-call shifts setup?
- Is the on-call rotation adequately staffed?
- Are you using PagerDuty/OpsGenie/VictorOps/…?
- Have the individuals on-call received adequate training on how to handle the specific alerts?
- Does the service have an entry on the Cloud status page?
- Is there an escalations channel for the customer support enquiries?
If the service has been in production for more than one month:
- Have recent outages been followed up with a post mortem?
- Did the outages have documented recovery steps/guides? If not, then they should be created
- Over the past month, has the service generated less than 2 pages per day on average?
State Management (If you're service uses a databases e.g. GCS, RDS etc)
- Are there (sufficiently frequent) backups? Have you tested restore?
- Should the data in the database be exported to BigQuery/Redshift/… for BI?
Feedback
- Did you find this checklist useful?
- What do you think is missing from this checklist?