As a Senior Software Engineer in SRE at eMed, you will play a key role in ensuring our platform is highly available, secure, and performant. You’ll lead reliability engineering efforts across production systems, drive operational excellence, and collaborate closely with application and infrastructure teams to design resilient services. This role suits an engineer with a software mindset and deep operational experience, who thrives on improving systems through automation and proactive engineering.
2.2 WHAT YOU WILL WORK ON
Design and implement robust monitoring, alerting, and observability systems across all services and infrastructure
Lead reliability reviews, incident response, and post-incident analysis—focusing on prevention, learning, and long-term improvements
Improve service scalability, fault tolerance, and performance through architectural input and systems optimisation
Build and maintain automation for infrastructure management using Terraform, and delivery pipelines using GitHub Actions
Partner with software engineers to improve the operational readiness and resilience of services, including capacity planning and runbooks
Lead initiatives to reduce operational toil through tooling, automation, and process improvement
Manage and optimise our production Kubernetes and AWS environments with a focus on reliability, security, and cost-effectiveness
Contribute to security hardening efforts, including network controls, secrets management, and compliance readiness
Participate in and lead in-person stand-ups, incident reviews, and cross-team planning sessions
Share knowledge and mentor engineers on best practices in observability, incident response, and operational engineering
2.3 WHAT WE’RE LOOKING FOR:
Technical Skills (Essential)
Strong experience operating Kubernetes and cloud-native infrastructure (preferably EKS on AWS) in production environments
Proficiency in AWS services, including networking, compute, IAM, and logging/monitoring tools (e.g. CloudWatch, ELB, VPC)
Skilled in Terraform and Infrastructure as Code practices
Deep understanding of observability tooling (metrics, logs, tracing) and incident management workflows
Strong coding skills for building tools, scripts, and automation
Ability to troubleshoot complex infrastructure issues and lead delivery of reliable cloud solutions
Preferred
Experience implementing SLAs, SLOs, and error budgets to guide operational priorities
Background in healthcare or other regulated industries with security and compliance requirements
All Job Ads are subject to GrabJobs’s Terms of Service. We allow users to flag postings that may be in violation of those terms. Job Ads may also be flagged by GrabJobs moderation team. However, no moderation system is perfect, and flagging a posting does not ensure that it will be removed.
Be the first to receive the latest Others Full-Time Jobs in the US.
Setup your job alert:
By activating job alerts, I agree to GrabJobs Terms & Privacy Policy. I can unsubscribe to job alerts anytime.
Skip
GrabJobs is the no1 job portal in the US, connecting you to thousands of jobs fast!
Find the best jobs in the US, apply in 1 click and get a job today!