Responsibilities
• Own and evolve our service templates, Helm chart conventions, and Argo CD App-of-Apps patterns so that adding or migrating a service is a guided, low-risk experience.
• Build and maintain reusable GitHub Actions workflows (build / push / scan, frontend build / deploy, SonarQube scans, semantic release) and improve CI feedback loops, build times, and caching.
• Define and enforce platform standards for observability — structured logs into Loki, metrics into Prometheus / Mimir, dashboards in Grafana, and SLOs / alerts wired in by default.
• Build self-service tooling around environments, secrets, feature flags, and access — so that the right thing is easy and the wrong thing is hard to do by accident.
• Own the developer-facing aspects of identity and access (Auth0, IdP integrations, Tailscale access, IRSA / service accounts) and keep onboarding and offboarding smooth.
• Partner with DevOps on infrastructure changes, with Cloud Security on guardrails, and with backend / frontend / data / firmware teams to understand their pain points and prioritize platform investments.
• Mentor engineers across the org on platform conventions, lead design reviews for new services, and push back on patterns that don’t scale.
• Treat the platform as a product: gather feedback, define roadmaps, write docs, and measure adoption and reliability.
Required Skills
• Track record of owning and delivering platform initiatives end-to-end, from design through adoption, with limited day-to-day supervision.
• Strong working knowledge of Kubernetes (EKS or similar) and GitOps workflows with Argo CD or Flux.
• Hands-on experience with Infrastructure as Code using Terraform; comfort with Terragrunt or a similar wrapper.
• Solid experience with CI/CD systems, ideally GitHub Actions, including reusable / composable workflows and release automation.
• Working knowledge of AWS core services (EKS, EC2, RDS, S3, IAM, VPC, ECR) and how to compose them into reliable, secure platforms.
• Experience designing developer abstractions — Helm charts, service templates, internal CLIs, scaffolding tools, or Backstage-style portals — that other engineering teams easily interact with.
• Strong programming skills in Python, Bash, or TypeScript for building tooling and automation.
• Experience integrating observability (Grafana, Loki, Prometheus / Mimir, OpenTelemetry, or similar) as a default rather than an afterthought.
• Strong written communication skills, with a habit of writing docs, runbooks, and wikis that engineers can actually use.
Bonus Skills
• Experience with Backstage or a comparable internal developer portal.
• Experience integrating Argo Workflows or similar Kubernetes-native job / pipeline runners into a developer platform.
• Familiarity with Databricks or ML Ops pipelines and the developer experience around data / model deployment.
• Experience with Tailscale, Auth0, EntraID, or other identity / zero-trust networking tooling.
• Familiarity with cloud architectures supporting IoT / embedded systems and distributed, low-power devices.
• Experience in high-growth startup environments where you must wear many hats.