We are seeking a Platform Reliability & Architecture Lead to take end-to-end ownership of the stability, scalability, and technical robustness of a multi-product SaaS ecosystem. This is a deeply hands-on, horizontal engineering role that operates across core systems and product lines, ensuring platforms remain performant, resilient, and secure as usage scales and traffic spikes.
This role is not a people-management position. Instead, it is designed for a senior technologist who thrives on system design, debugging complex failures, and building reliable distributed systems that operate predictably under pressure.
What You’ll Be Responsible For
1. Platform Stability & Operational Excellence
Own availability, latency, and reliability targets across multiple products and shared services.
Proactively identify failure patterns, reduce system fragility, and eliminate recurring production issues.
Lead deep-dive incident investigations and drive permanent fixes rather than short-term patches.
Prepare systems for peak traffic events through load testing, capacity planning, and resilience improvements.
Act as the final technical escalation point during critical production incidents.
2. Scalable Systems & Architecture Evolution
Shape and evolve backend architecture to support increasing scale and complexity.
Design robust distributed workflows using async processing, retries, idempotency, and fault isolation patterns.
Improve API contracts, data schemas, and service boundaries to reduce coupling and failure blast radius.
Introduce reliability patterns such as circuit breakers, backpressure handling, and graceful degradation.
3. Deep Hands-On Engineering Leadership
Review and guide complex technical designs across teams and products.
Drive cross-cutting refactors to remove architectural bottlenecks and accumulated technical debt.
Unblock engineers facing hard performance, concurrency, or data consistency problems.
Set high standards for code quality, testing rigor, and production readiness.
4. Observability, Metrics & Early Detection
Define meaningful service-level indicators (SLIs) and objectives (SLOs) for critical systems.
Build actionable dashboards and alerts that surface issues before customers are impacted.
Ensure logs, metrics, and traces provide end-to-end visibility into distributed workflows.
Partner with platform and infra teams to strengthen monitoring and alerting strategies.
5. Security, Data Integrity & Compliance
Enforce secure development practices across backend systems and APIs.
Ensure correct access controls, secrets handling, audit logging, and data protection mechanisms.
Support compliance initiatives (e.g., SOC 2) through system design and operational discipline.
Safeguard data correctness and permission boundaries across shared services.
6. Cross-Team Technical Partnership
Collaborate with Product teams to convert business requirements into resilient technical designs.
Work closely with Support and Customer Success to understand real-world production pain points.
Partner with Infrastructure and Core Systems teams on performance, scalability, and reliability improvements.
Align with QA on load testing, integration testing, and failure simulation strategies.
What We’re Looking For
Must-Have Experience
3–4+ years of backend engineering experience, preferably using Python.
Strong fundamentals in distributed systems, backend architecture, and data consistency.
Deep hands-on experience with relational databases, schema design, and query optimization.
Practical expertise in async processing, background jobs, queues, retries, and caching (Redis).
Exceptional debugging ability across application, database, and infrastructure layers.
Proven track record of improving system reliability and performance in production environments.
Ability to influence technical direction across multiple teams without formal authority.
Nice-to-Have
Experience with observability platforms such as Datadog, Sentry, Prometheus, or Elasticsearch.
Exposure to large enterprise integrations or CRM-heavy platforms.
Prior ownership of uptime, SLAs, or reliability for a multi-product SaaS platform.
Familiarity with security best practices and compliance-driven system design.
What Success Looks Like
First 90 Days
Build a deep mental model of the platform, workflows, and failure modes.
Deliver early wins in stability, performance, and operational visibility.
Become the go-to engineer for complex system-level issues.
4–6 Months
Establish clear reliability metrics and operational standards across systems.
Reduce noise from alerts and incidents through architectural improvements.
Noticeable drop in production errors and customer-impacting issues.
7–12 Months
Consistently achieve high availability (99.9%+) across core platforms.
Ensure async and background systems are predictable, scalable, and resilient.
Enable faster product development through cleaner, more stable foundations.
Core Skills
Backend Systems Engineering
Distributed Systems & Reliability
Platform Architecture
Database Design & Performance
Scalable SaaS Infrastructure