Platform Reliability & Architecture Lead at Weekday in Bengaluru

This role is for one of our clients

Industry: Education

Seniority level: Associate level

Min Experience: 3 years

Location: Bengaluru

JobType: full-time

₹25,00,000 - ₹28,00,000 a year

We are seeking a Platform Reliability & Architecture Lead to take end-to-end ownership of the stability, scalability, and technical robustness of a multi-product SaaS ecosystem. This is a deeply hands-on, horizontal engineering role that operates across core systems and product lines, ensuring platforms remain performant, resilient, and secure as usage scales and traffic spikes.

This role is not a people-management position. Instead, it is designed for a senior technologist who thrives on system design, debugging complex failures, and building reliable distributed systems that operate predictably under pressure.

What You’ll Be Responsible For

1. Platform Stability & Operational Excellence

Own availability, latency, and reliability targets across multiple products and shared services.

Proactively identify failure patterns, reduce system fragility, and eliminate recurring production issues.

Lead deep-dive incident investigations and drive permanent fixes rather than short-term patches.

Prepare systems for peak traffic events through load testing, capacity planning, and resilience improvements.

Act as the final technical escalation point during critical production incidents.

2. Scalable Systems & Architecture Evolution

Shape and evolve backend architecture to support increasing scale and complexity.

Design robust distributed workflows using async processing, retries, idempotency, and fault isolation patterns.

Improve API contracts, data schemas, and service boundaries to reduce coupling and failure blast radius.

Introduce reliability patterns such as circuit breakers, backpressure handling, and graceful degradation.

3. Deep Hands-On Engineering Leadership

Review and guide complex technical designs across teams and products.

Drive cross-cutting refactors to remove architectural bottlenecks and accumulated technical debt.

Unblock engineers facing hard performance, concurrency, or data consistency problems.

Set high standards for code quality, testing rigor, and production readiness.

4. Observability, Metrics & Early Detection

Define meaningful service-level indicators (SLIs) and objectives (SLOs) for critical systems.

Build actionable dashboards and alerts that surface issues before customers are impacted.

Ensure logs, metrics, and traces provide end-to-end visibility into distributed workflows.

Partner with platform and infra teams to strengthen monitoring and alerting strategies.

5. Security, Data Integrity & Compliance

Enforce secure development practices across backend systems and APIs.

Ensure correct access controls, secrets handling, audit logging, and data protection mechanisms.

Support compliance initiatives (e.g., SOC 2) through system design and operational discipline.

Safeguard data correctness and permission boundaries across shared services.

6. Cross-Team Technical Partnership

Collaborate with Product teams to convert business requirements into resilient technical designs.

Work closely with Support and Customer Success to understand real-world production pain points.

Partner with Infrastructure and Core Systems teams on performance, scalability, and reliability improvements.

Align with QA on load testing, integration testing, and failure simulation strategies.

What We’re Looking For

Must-Have Experience

3–4+ years of backend engineering experience, preferably using Python.

Strong fundamentals in distributed systems, backend architecture, and data consistency.

Deep hands-on experience with relational databases, schema design, and query optimization.

Practical expertise in async processing, background jobs, queues, retries, and caching (Redis).

Exceptional debugging ability across application, database, and infrastructure layers.

Proven track record of improving system reliability and performance in production environments.

Ability to influence technical direction across multiple teams without formal authority.

Nice-to-Have

Experience with observability platforms such as Datadog, Sentry, Prometheus, or Elasticsearch.

Exposure to large enterprise integrations or CRM-heavy platforms.

Prior ownership of uptime, SLAs, or reliability for a multi-product SaaS platform.

Familiarity with security best practices and compliance-driven system design.

What Success Looks Like

First 90 Days

Build a deep mental model of the platform, workflows, and failure modes.

Deliver early wins in stability, performance, and operational visibility.

Become the go-to engineer for complex system-level issues.

4–6 Months

Establish clear reliability metrics and operational standards across systems.

Reduce noise from alerts and incidents through architectural improvements.

Noticeable drop in production errors and customer-impacting issues.

7–12 Months

Consistently achieve high availability (99.9%+) across core platforms.

Ensure async and background systems are predictable, scalable, and resilient.

Enable faster product development through cleaner, more stable foundations.

Core Skills

Backend Systems Engineering

Distributed Systems & Reliability

Platform Architecture

Database Design & Performance

Scalable SaaS Infrastructure

We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.

Platform Reliability & Architecture Lead

Job Description - Platform Reliability & Architecture Lead

Similar Platform Reliability & Architecture Lead Jobs in India

Mobile Apps