Monday, September 19, 2022

Service Readiness Checklist (SaaS)

 Architecture & Scaling

  • Is the service's architecture documented?
  • Can the service tolerate machine failures whilst preserving SLO?
  • Are the service components easily scalable?
  • Can users consume unbounded resources through your service?

Code

  • Where is the code hosted?
  • If this is a our company's OSS (Open Student Society) project, does it have a CLA setup (Contributor License Agreement), is there an appropriate license and governance doc?

CI & Testing

  • Do you have CI?  Do you use an existing "blessed" CI solution?
  • Does the CI build the final production assets?  Where are the built assets stored?  Are they versioned? [We expect all assets run in production to be built by CI, not on a developers laptop.]

Release process

  • Is the release process documented?  How regularly do you release?  [You should document the process of getting a new build of your service into production; putting these docs in the infrastructure configuration repository is a good idea.]
  • Is there a staging environment?  A dev environment?
  • Can updates to the service be rolled out without downtime?  Can releases be safely rolled back?

Config management

  • Is the service config version controlled?  Is the config in the infrastructure configuration repo?
  • Are resource requests and limits appropriately configured? [In very general, requests should be set to the 95th quantile of the usage over the last week.]
  • Are your jobs restarted when config changes?
  • If you have external dependencies, are they configured with Terraform?
  • If the service required secrets, are they in an approved reliable secret store?

Security

  • How is access to your new service controlled?
  • Is TLS used over all untrusted networks?
  • Is sensitive data encrypted at rest?

SLA/SLO

  • Is there a published SLO for the service?
  • How is the SLO monitored/calculated and who owns it?
  • Is the SLO defensible?

Observability

  • Does the service export metrics?
  • Do the exported metrics allow for RED Method style analysis?  [The RED method: Key metrics for microservices architecture - Rate/Error/Duration]

Alerts

  • Do you use Prometheus alerting?  Are the alerts version controlled?
  • Are alerts routed to the correct please?
  • Do you use SLO-based alerting?
  • Have alerts been tested and validated in lower environments?

Dashboards

  • Are dashboards version controlled?
  • Are these RED Method-Style dashboards for the service?
  • Are the dashboards available in our monitoring system?

Logs

  • Does the service emit logs?
  • Are all log error codes documented?
  • Do your jobs emit credentials or secrets in their logs?

Back-up and recovery


External Services

  • Are the external dependencies sufficiently monitored and is alerting set up?

On-call & Incident Response

  • Do you have follow-the-sun on-call shifts setup?
  • Is the on-call rotation adequately staffed?
  • Are you using PagerDuty/OpsGenie/VictorOps/…?
  • Have the individuals on-call received adequate training on how to handle the specific alerts?
  • Does the service have an entry on the Cloud status page?
  • Is there an escalations channel for the customer support enquiries?

If the service has been in production for more than one month:

  • Have recent outages been followed up with a post mortem?
  • Did the outages have documented recovery steps/guides?  If not, then they should be created
  • Over the past month, has the service generated less than 2 pages per day on average?

State Management (If you're service uses a databases e.g. GCS, RDS etc)

  • Are there (sufficiently frequent) backups?  Have you tested restore?
  • Should the data in the database be exported to BigQuery/Redshift/… for BI?

Feedback

  • Did you find this checklist useful?
  • What do you think is missing from this checklist?

Tuesday, June 14, 2022

I have become a Manager

I have become an Engineering Operations Manager since early 2020 (of course with the help from my Special Mind Power Technique derived from Ancient Text of Confucianism) 

I am now managing 12 Senior Engineers, including 4 Leads, for ID&F products, hosted services & cloud services. 

Since then, I spent most of my time managing the people and the services. I have not got much time in developing something. Hence this space has been idle for some time. 

Today, I get to know there was one programming competition organized by Panasonic back in year 2007. Someone provided the winner list in Lowyat forum. Therefore I further studied the current status of the winners and came out with the table below. 

RankNameCollegeCurrent StatusCurrent CompanyLocationRemark
1st prize winnerTan Aik KeongMMUFounderAgmo StudioMY 
2nd prize winnerChooi Kah Wai APIIT
Technical Lead
PathDAOMY 
3rd prize winnerSimon Lim Hao WooiUTARSoftware EngineerCanvaAU 
Outstanding winnerGan Eng ChinUMSenior Software EngineerAutomatticMYOne of the early engineers in Experian CheetahMail
Outstanding winnerFong Kha ChunUM    
Outstanding winnerKwan Toh ChoongCurtin U of TechDigital Platform Design ManagerShellMY11 years in Shell
Outstanding winnerLee Chee CheongMMU   Last job as Technical Manager at iRadar Sdn Bhd
Outstanding winnerTee Shu HuiUTMSenior Firmware Development EngineerMicron TechSG 
Merit WinnersTan Kian TatMMUSelf-Employed  SAP Consultant since grad
Merit WinnersLim Fang-YinUTAR   Active in Quora
Merit WinnersNew Chin JianUM    
Merit WinnersChoong You Qi MMUStaff Android EngineerSetelMY 
Merit WinnersChan Kin MengMMUFounder CEOGameconomy MYWeb3.0 Gaming
Merit WinnersSiew Lead ChoonUM   SAP?
Merit WinnersYeoh Yan PoreUSMGMIHS Group  
Merit WinnersErnest Eg Ket LungUSM    
Merit WinnersChan Chen ShyangCurtin U of TechIP ConsultantNokiaMYNokia till today
Merit WinnersAhmad Irshad B. Abdul HamidUniTenSoftware EngineerSageMY 
Merit WinnersChua Fook ChingMMUBusiness & Integration Arch Specialist (SAP)AccentureMY 
Merit WinnersKhor Yit KeanTARC