Logo-of-Stone-X-hiring-for-jobs-in-India-on-GrabJobs

Lead Engineer - Reliability Engineering

Job Description - Lead Engineer - Reliability Engineering






Overview






Connecting clients to markets – and talent to opportunity.

With 5,400+ employees and over 80,000 institutional, commercial, and payments clients, we operate from more than 80 offices spread across six continents. As a Fortune 100, Nasdaq-listed provider, we connect clients to the global markets – focusing on innovation, human connection, and providing world-class products and services to all types of investors.

Whether you want to forge a career connecting our retail clients to potential trading opportunities, or ingrain yourself in the world of institutional investing, StoneX Group is made up of four business segments that offer endless potential for progression and growth.

 

Engage in a deep variety of business-critical activities that keep our company running efficiently. From strategic marketing and financial management to human resources and operational oversight, you’ll have the opportunity to optimize processes and implement game-changing policies.

 

As a Lead Engineer in Reliability Engineering, you will help define and drive the next stage of reliability maturity across our platforms and services. This is a senior hands-on engineering role for someone who has already spent several years building and operating Site Reliability Engineering practices in a large organization, and who understands what good looks like in production at scale.

You will partner closely with our Platform Engineering Observability team to improve reliability standards, operational practices, service ownership models, and engineering guardrails. Over time, you will help grow reliability capabilities across the wider engineering organization by mentoring engineers, shaping ways of working, and building practical and measurable reliability practices. The team is actively expanding end-to-end observability coverage across key business applications. This role will help ensure that telemetry, service health, reliability standards, and operational practices are implemented consistently and effectively as adoption grows.

This is an individual contributor role with no direct people management responsibilities. Success in this role will come through technical leadership, hands on engineering contribution, mentorship, and influence across teams.

 









Responsibilities






 

  • Define and drive reliability engineering standards, practices, and the enterprise reliability maturity model across platforms and services, including service tiering and adoption metrics
  • Partner with engineering, platform, infrastructure, and product teams to improve service reliability, resilience, operability, and supportability
  • Establish and mature reliability practices such as SLOs, SLIs, error budgets, alert quality, toil reduction, production readiness reviews, and service ownership expectations
  • Build and improve operational processes for change safety, release confidence, capacity planning, resilience testing, and disaster recovery across multi-cloud and hybrid environments
  • Use observability platforms such as Datadog or similar tools to improve visibility, actionable alerting, dashboards, and service health reporting
  • Drive end-to-end observability adoption for critical applications, ensuring consistent implementation of metrics, logs, traces, service maps, dashboards, and actionable alerting
  • Partner with application, platform, and infrastructure teams to improve instrumentation quality, service ownership, and operational readiness as key applications are onboarded into the observability ecosystem
  • Define and standardize observability architecture and telemetry standards, including service dependency mapping, service health indicators, alert quality, and operational response workflows
  • Drive automation of operational tasks and embed reliability guardrails into platform engineering workflows, including CI/CD pipelines and internal developer platforms
  • Apply observability standards to golden path templates, workflows, scorecards, and dashboards within the internal developer platform so operational best practices are embedded holistically throughout the SDLC
  • Identify reliability risks including architectural weaknesses, service fragility, and third-party or provider dependencies, and partner with teams to address them
  • Define and track meaningful reliability metrics and operational KPIs that help engineering teams improve service outcomes over time
  • Act as a senior hands-on engineer who guides technical direction while contributing directly to design, implementation, and operational improvement work
  • Mentor and coach engineers across the team, helping them develop stronger reliability and operational engineering skills








Qualifications






 

  • A track record of building, improving, or scaling reliability engineering or SRE practices in a large organization
  • 7+ years of experience in SRE, production engineering, platform engineering, infrastructure engineering, or a closely related role
  • Several years of hands-on experience supporting production systems at scale, including incident response, problem management, availability improvement, and operational excellence
  • Strong practical experience defining and implementing SLOs, SLIs, error budgets, service health models, and reliability focused engineering practices
  • Strong experience with observability platforms such as Datadog or similar platforms, including metrics, logs, tracing, alerting, dashboards, and service level reporting
  • Experience driving or supporting end-to-end observability adoption across application teams, including instrumentation, telemetry standards, dashboards, alerting, and service level reporting
  • Experience driving improvements in incident management, post incident reviews, on call effectiveness, and operational maturity
  • Experience automating operational processes using tools such as Terraform, scripting languages, CI and CD pipelines, and cloud native platforms
  • Experience working with Kubernetes, Linux, Git, and modern cloud or platform infrastructure
  • Strong systems thinking and the ability to balance reliability, latency, engineering velocity, risk, and cost
  • Strong communication, collaboration, and influencing skills, with the ability to work across multiple teams and levels of seniority
  • Demonstrated ability to mentor engineers and help raise the reliability maturity of a broader team
  • A practical mindset, someone who can define strong engineering practices and also contribute directly in a hands-on way

 

What makes you stand out:

 

  • You have helped build or formalize reliability engineering or SRE practices in a complex organization
  • You know what good looks like for service ownership, production readiness, alerting quality, incident response, and operational accountability
  • You have helped onboard critical applications into an end-to-end observability model, improving visibility across metrics, logs, traces, service dependencies, and operational response
  • You have successfully reduced toil, improved service reliability, and created measurable operational improvements across teams
  • You are able to influence engineering culture, not just tooling or process
  • You have coached less experienced engineers and helped teams grow into stronger operational ownership
  • You are comfortable introducing structure and standards without creating unnecessary bureaucracy
  • You can work across observability, platform engineering, and application teams to create practical, adoptable reliability practices

 

Education / Certification Requirements: 

 

  • Bachelor’s degree in computer science, engineering, or a related field, or equivalent practical experience
  • Relevant certifications are a plus, but practical experience building and operating reliable systems at scale is more important
  • Commitment to continual professional and technical development

 

Working environment:

  • Hybrid, four days in the office.
  • Occasional Travel Requirements, for team collaboration meetings and conferences.

 

 

#LI-Hybrid





Original job Lead Engineer - Reliability Engineering posted on GrabJobs ©. To flag any issues with this job please use the Report Job button on GrabJobs.
Share Job
Share Job

Similar Lead Engineer - Reliability Engineering Jobs in India

GrabJobs is the no1 job portal in India, connecting you to thousands of jobs fast! Find the best jobs in India, apply in 1 click and get a job today!

Mobile Apps

Copyright © 2026 Grabjobs Pte.Ltd. All Rights Reserved.