Architecture & Scaling
- Is the service's architecture documented?
 - Can the service tolerate machine failures whilst preserving SLO?
 - Are the service components easily scalable?
 - Can users consume unbounded resources through your service?
 
Code
Where is the code hosted?If this is a our company's OSS (Open Student Society) project, does it have a CLA setup (Contributor License Agreement), is there an appropriate license and governance doc?
CI & Testing
Do you have CI? Do you use an existing "blessed" CI solution?Does the CI build the final production assets? Where are the built assets stored? Are they versioned? [We expect all assets run in production to be built by CI, not on a developers laptop.]
Release process
- Is the release process documented? How regularly do you release? [You should document the process of getting a new build of your service into production; putting these docs in the infrastructure configuration repository is a good idea.]
 - Is there a staging environment? A dev environment?
 - Can updates to the service be rolled out without downtime? Can releases be safely rolled back?
 
Config management
- Is the service config version controlled? Is the config in the infrastructure configuration repo?
 - Are resource requests and limits appropriately configured? [In very general, requests should be set to the 95th quantile of the usage over the last week.]
 - Are your jobs restarted when config changes?
 - If you have external dependencies, are they configured with Terraform?
 - If the service required secrets, are they in an approved reliable secret store?
 
Security
- How is access to your new service controlled?
 - Is TLS used over all untrusted networks?
 - Is sensitive data encrypted at rest?
 
SLA/SLO
- Is there a published SLO for the service?
 - How is the SLO monitored/calculated and who owns it?
 - Is the SLO defensible?
 
Observability
- Does the service export metrics?
 - Do the exported metrics allow for RED Method style analysis? [The RED method: Key metrics for microservices architecture - Rate/Error/Duration]
 
Alerts
- Do you use Prometheus alerting? Are the alerts version controlled?
 - Are alerts routed to the correct please?
 - Do you use SLO-based alerting?
 - Have alerts been tested and validated in lower environments?
 
Dashboards
- Are dashboards version controlled?
 - Are these RED Method-Style dashboards for the service?
 - Are the dashboards available in our monitoring system?
 
Logs
- Does the service emit logs?
 - Are all log error codes documented?
 - Do your jobs emit credentials or secrets in their logs?
 
Back-up and recovery
External Services
- Are the external dependencies sufficiently monitored and is alerting set up?
 
On-call & Incident Response
- Do you have follow-the-sun on-call shifts setup?
 - Is the on-call rotation adequately staffed?
 - Are you using PagerDuty/OpsGenie/VictorOps/…?
 - Have the individuals on-call received adequate training on how to handle the specific alerts?
 - Does the service have an entry on the Cloud status page?
 - Is there an escalations channel for the customer support enquiries?
 
If the service has been in production for more than one month:
- Have recent outages been followed up with a post mortem?
 - Did the outages have documented recovery steps/guides? If not, then they should be created
 - Over the past month, has the service generated less than 2 pages per day on average?
 
State Management (If you're service uses a databases e.g. GCS, RDS etc)
- Are there (sufficiently frequent) backups? Have you tested restore?
 - Should the data in the database be exported to BigQuery/Redshift/… for BI?
 
Feedback
- Did you find this checklist useful?
 - What do you think is missing from this checklist?