Enterprise Systems

Distributed Storage Reliability Tooling

Tooling and runbook automation for a multi-tenant distributed storage system serving regulated industries.

About this project

On-call engineers handled the same storage-tier symptoms repeatedly with manual remediation — slow, error-prone, and burning out the team.

Solution

Built a runbook automation layer with safe-by-default operations, dry-run mode, and audit logging; codified the top 30 incident classes.

Technology

  • Go
  • Python
  • Prometheus
  • Ansible
  • PagerDuty API

Impact

On-call pages dropped 58%; first-action time on remaining pages fell to under 90 seconds.