My IT Journey Since 2004!: September 2022

Monday, September 19, 2022

Architecture & Scaling

~~Code~~

~~Where is the code hosted?~~
~~If this is a our company's OSS (Open Student Society) project, does it have a CLA setup (Contributor License Agreement), is there an appropriate license and governance doc?~~

~~CI & Testing~~

~~Do you have CI? Do you use an existing "blessed" CI solution?~~
Does the CI build the final production assets? Where are the built assets stored? Are they versioned? [We expect all assets run in production to be built by CI, not on a developers laptop.]

Release process

Is the release process documented? How regularly do you release? [You should document the process of getting a new build of your service into production; putting these docs in the infrastructure configuration repository is a good idea.]
Is there a staging environment? A dev environment?
Can updates to the service be rolled out without downtime? Can releases be safely rolled back?

Config management

Is the service config version controlled? Is the config in the infrastructure configuration repo?
Are resource requests and limits appropriately configured? [In very general, requests should be set to the 95th quantile of the usage over the last week.]
Are your jobs restarted when config changes?
If you have external dependencies, are they configured with Terraform?
If the service required secrets, are they in an approved reliable secret store?

Security

SLA/SLO

Observability

Does the service export metrics?
Do the exported metrics allow for RED Method style analysis? [The RED method: Key metrics for microservices architecture - Rate/Error/Duration]

Alerts

Dashboards

Logs

Back-up and recovery

External Services

On-call & Incident Response

Do you have follow-the-sun on-call shifts setup?
Is the on-call rotation adequately staffed?
Are you using PagerDuty/OpsGenie/VictorOps/…?
Have the individuals on-call received adequate training on how to handle the specific alerts?
Does the service have an entry on the Cloud status page?
Is there an escalations channel for the customer support enquiries?

If the service has been in production for more than one month:

Have recent outages been followed up with a post mortem?
Did the outages have documented recovery steps/guides? If not, then they should be created
Over the past month, has the service generated less than 2 pages per day on average?

State Management (If you're service uses a databases e.g. GCS, RDS etc)

Feedback