Monday, September 19, 2022

Service Readiness Checklist (SaaS)

 Architecture & Scaling

  • Is the service's architecture documented?
  • Can the service tolerate machine failures whilst preserving SLO?
  • Are the service components easily scalable?
  • Can users consume unbounded resources through your service?

Code

  • Where is the code hosted?
  • If this is a our company's OSS (Open Student Society) project, does it have a CLA setup (Contributor License Agreement), is there an appropriate license and governance doc?

CI & Testing

  • Do you have CI?  Do you use an existing "blessed" CI solution?
  • Does the CI build the final production assets?  Where are the built assets stored?  Are they versioned? [We expect all assets run in production to be built by CI, not on a developers laptop.]

Release process

  • Is the release process documented?  How regularly do you release?  [You should document the process of getting a new build of your service into production; putting these docs in the infrastructure configuration repository is a good idea.]
  • Is there a staging environment?  A dev environment?
  • Can updates to the service be rolled out without downtime?  Can releases be safely rolled back?

Config management

  • Is the service config version controlled?  Is the config in the infrastructure configuration repo?
  • Are resource requests and limits appropriately configured? [In very general, requests should be set to the 95th quantile of the usage over the last week.]
  • Are your jobs restarted when config changes?
  • If you have external dependencies, are they configured with Terraform?
  • If the service required secrets, are they in an approved reliable secret store?

Security

  • How is access to your new service controlled?
  • Is TLS used over all untrusted networks?
  • Is sensitive data encrypted at rest?

SLA/SLO

  • Is there a published SLO for the service?
  • How is the SLO monitored/calculated and who owns it?
  • Is the SLO defensible?

Observability

  • Does the service export metrics?
  • Do the exported metrics allow for RED Method style analysis?  [The RED method: Key metrics for microservices architecture - Rate/Error/Duration]

Alerts

  • Do you use Prometheus alerting?  Are the alerts version controlled?
  • Are alerts routed to the correct please?
  • Do you use SLO-based alerting?
  • Have alerts been tested and validated in lower environments?

Dashboards

  • Are dashboards version controlled?
  • Are these RED Method-Style dashboards for the service?
  • Are the dashboards available in our monitoring system?

Logs

  • Does the service emit logs?
  • Are all log error codes documented?
  • Do your jobs emit credentials or secrets in their logs?

Back-up and recovery


External Services

  • Are the external dependencies sufficiently monitored and is alerting set up?

On-call & Incident Response

  • Do you have follow-the-sun on-call shifts setup?
  • Is the on-call rotation adequately staffed?
  • Are you using PagerDuty/OpsGenie/VictorOps/…?
  • Have the individuals on-call received adequate training on how to handle the specific alerts?
  • Does the service have an entry on the Cloud status page?
  • Is there an escalations channel for the customer support enquiries?

If the service has been in production for more than one month:

  • Have recent outages been followed up with a post mortem?
  • Did the outages have documented recovery steps/guides?  If not, then they should be created
  • Over the past month, has the service generated less than 2 pages per day on average?

State Management (If you're service uses a databases e.g. GCS, RDS etc)

  • Are there (sufficiently frequent) backups?  Have you tested restore?
  • Should the data in the database be exported to BigQuery/Redshift/… for BI?

Feedback

  • Did you find this checklist useful?
  • What do you think is missing from this checklist?