Logo-of-Weekday-hiring-for-jobs-in-India-on-GrabJobs

Platform Reliability & Architecture Lead

icon building Company : Weekday
icon briefcase Job Type : Full Time

Number of Applicants

 : 

000+

Click to reveal the number of candidates who applied for this job.
icon loader
icon loader

Let AI Supercharge Your Job Hunt!

JobCopilot scans 500,000+ company career sites daily to find jobs for you

Never miss an opportunity Save hours by auto-filling applications forms Land more interviews with tailored applications
happy man
thunder iconActivate JobCopilot

Job Description - Platform Reliability & Architecture Lead

This role is for one of our clients
Industry: Education
Seniority level: Associate level

Min Experience: 3 years
Location: Bengaluru
JobType: full-time
₹25,00,000 - ₹28,00,000 a year
We are seeking a Platform Reliability & Architecture Lead to take end-to-end ownership of the stability, scalability, and technical robustness of a multi-product SaaS ecosystem. This is a deeply hands-on, horizontal engineering role that operates across core systems and product lines, ensuring platforms remain performant, resilient, and secure as usage scales and traffic spikes.
This role is not a people-management position. Instead, it is designed for a senior technologist who thrives on system design, debugging complex failures, and building reliable distributed systems that operate predictably under pressure.

What You’ll Be Responsible For
1. Platform Stability & Operational Excellence
Own availability, latency, and reliability targets across multiple products and shared services.
Proactively identify failure patterns, reduce system fragility, and eliminate recurring production issues.
Lead deep-dive incident investigations and drive permanent fixes rather than short-term patches.
Prepare systems for peak traffic events through load testing, capacity planning, and resilience improvements.
Act as the final technical escalation point during critical production incidents.
2. Scalable Systems & Architecture Evolution
Shape and evolve backend architecture to support increasing scale and complexity.
Design robust distributed workflows using async processing, retries, idempotency, and fault isolation patterns.
Improve API contracts, data schemas, and service boundaries to reduce coupling and failure blast radius.
Introduce reliability patterns such as circuit breakers, backpressure handling, and graceful degradation.
3. Deep Hands-On Engineering Leadership
Review and guide complex technical designs across teams and products.
Drive cross-cutting refactors to remove architectural bottlenecks and accumulated technical debt.
Unblock engineers facing hard performance, concurrency, or data consistency problems.
Set high standards for code quality, testing rigor, and production readiness.
4. Observability, Metrics & Early Detection
Define meaningful service-level indicators (SLIs) and objectives (SLOs) for critical systems.
Build actionable dashboards and alerts that surface issues before customers are impacted.
Ensure logs, metrics, and traces provide end-to-end visibility into distributed workflows.
Partner with platform and infra teams to strengthen monitoring and alerting strategies.
5. Security, Data Integrity & Compliance
Enforce secure development practices across backend systems and APIs.
Ensure correct access controls, secrets handling, audit logging, and data protection mechanisms.
Support compliance initiatives (e.g., SOC 2) through system design and operational discipline.
Safeguard data correctness and permission boundaries across shared services.
6. Cross-Team Technical Partnership
Collaborate with Product teams to convert business requirements into resilient technical designs.
Work closely with Support and Customer Success to understand real-world production pain points.
Partner with Infrastructure and Core Systems teams on performance, scalability, and reliability improvements.
Align with QA on load testing, integration testing, and failure simulation strategies.

What We’re Looking For
Must-Have Experience
3–4+ years of backend engineering experience, preferably using Python.
Strong fundamentals in distributed systems, backend architecture, and data consistency.
Deep hands-on experience with relational databases, schema design, and query optimization.
Practical expertise in async processing, background jobs, queues, retries, and caching (Redis).
Exceptional debugging ability across application, database, and infrastructure layers.
Proven track record of improving system reliability and performance in production environments.
Ability to influence technical direction across multiple teams without formal authority.
Nice-to-Have
Experience with observability platforms such as Datadog, Sentry, Prometheus, or Elasticsearch.
Exposure to large enterprise integrations or CRM-heavy platforms.
Prior ownership of uptime, SLAs, or reliability for a multi-product SaaS platform.
Familiarity with security best practices and compliance-driven system design.

What Success Looks Like
First 90 Days
Build a deep mental model of the platform, workflows, and failure modes.
Deliver early wins in stability, performance, and operational visibility.
Become the go-to engineer for complex system-level issues.
4–6 Months
Establish clear reliability metrics and operational standards across systems.
Reduce noise from alerts and incidents through architectural improvements.
Noticeable drop in production errors and customer-impacting issues.
7–12 Months
Consistently achieve high availability (99.9%+) across core platforms.
Ensure async and background systems are predictable, scalable, and resilient.
Enable faster product development through cleaner, more stable foundations.

Core Skills
Backend Systems Engineering
Distributed Systems & Reliability
Platform Architecture
Database Design & Performance
Scalable SaaS Infrastructure
We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.
Original job Platform Reliability & Architecture Lead posted on GrabJobs ©. To flag any issues with this job please use the Report Job button on GrabJobs.
Share Job
Share Job

Auto-Apply to Platform Reliability & Architecture Lead Jobs with your AI JobCopilot

thunder icon Auto-Apply with AI

Similar Platform Reliability & Architecture Lead Jobs in India

GrabJobs is the no1 job portal in India, connecting you to thousands of jobs fast! Find the best jobs in India, apply in 1 click and get a job today!

Mobile Apps

Copyright © 2026 Grabjobs Pte.Ltd. All Rights Reserved.