Site Reliability Engineer (SRE) IT Platform Engineering

Company : Omnissa

Job Type : Full Time

Bengaluru, India

Number of Applicants

000+

Apply Now

Let AI Supercharge Your Job Hunt!

JobCopilot scans 500,000+ company career sites daily to find jobs for you

Never miss an opportunity Save hours by auto-filling applications forms Land more interviews with tailored applications

Activate JobCopilot

Job Description - Site Reliability Engineer (SRE) IT Platform Engineering

Job Description:

Site Reliability Engineer (SRE) – Observability, Tracing & Platform Operations

About Omnissa — Our Company, Mission & Vision

Omnissa is an independent global leader in digital work platforms, originating from the former VMware End User Computing business and now operating under KKR ownership. Omnissa empowers organizations to deliver seamless, flexible, and secure digital work experiences to employees everywhere. Our platform supports over 26,000 customers worldwide, including 7 of the top 10 Fortune 500 enterprises.

Our purpose is clear:

Guided by decades of innovation and strengthened by significant investment in AI, open APIs, and next generation digital workspace technologies, Omnissa is building the industry’s first autonomous workspace experience—a platform that simplifies operations for IT while unlocking higher productivity for employees.

Platform Engineering at Omnissa (IT Organization)

The Platform Engineering team within Omnissa’s internal IT organization is responsible for architecting, operating, and continuously improving our enterprise grade infrastructure platforms. Our mission is to deliver highly resilient, scalable, and secure systems that power Omnissa’s internal operations and customer facing services.

Our environment includes:

Core Platforms

On-premise cloud environments built on:

VMware Cloud Foundation (vCF)

Apache CloudStack

Proxmox Virtualization Stack

Kubernetes based orchestration for containerized workloads

Opensource S3 compatible object storage systems

Observability Infrastructure

We have developed extensive Observability Infrastructure that monitors all Omnissa internal services that is managed automatically leveraging:

Prometheus

Grafana

Loki

Ansible

AI Driven Automation & Incident Response

We have developed an internal AI powered incident diagnosis and first response platform, leveraging cutting edge open technologies including:

Ollama

n8n

Various Model Context Protocol (MCP) servers

These systems help us reduce mean time to detect (MTTD) and mean time to-resolve (MTTR) through automated analysis, enrichment, and intelligent triage.

Role Overview — Site Reliability Engineer (SRE)

We are seeking a highly skilled SRE with deep expertise in Observability, particularly in:

Automation

Grafana

Loki

Prometheus

Development/Scripting

This role is critical to maintaining the reliability, performance, and operational integrity of our platforms. You will support both planned and unplanned workstreams, collaborating closely with engineering, incident management, and service owners.

The role includes participation in an on-call rotation, including nights and weekends, to ensure continuous coverage for mission critical systems.

Key Responsibilities

Observability Engineering

Design, deploy, and maintain Loki, Grafana, Prometheus, and integrated observability pipelines.

Contribute to new monitoring initiatives by developing new monitoring checks for services.

Maintain and improve our automation workflows that manage the infrastructure.

Develop and refine AI workflows for incident analysis and auto-remediation.

Continuously enhance logging, metrics, and tracing coverage across services.

Reliability, Resilience & Performance

Ensure high availability, capacity planning, and performance optimization across platforms.

Drive reliability improvements through automation, SLIs/SLOs, and root cause analysis.

Partner with development and platform teams to embed reliability best practices.

Incident Management & On-Call

Participate in the global on-call rotation, including weekends.

Leverage our AI driven incident diagnosis tools (Ollama, n8n, MCP) to accelerate response.

Manage unplanned work such as production incidents, outages, high urgency escalations and participate in post – mortem reviews.

Coordinate post incident reviews and continuous improvement initiatives.

Planned & Unplanned Work Management

Utilize the Atlassian toolset (Jira, Confluence, Opsgenie, etc.) for structured task, change, and incident management.

Manage planned maintenance, releases, and platform improvements.

Collaborate with cross functional teams to prioritize backlog and operational tasks.

Platform Operations

Support and enhance internal clouds based on vCF, CloudStack, and Proxmox.

Operate Kubernetes clusters and improve the reliability of containerized workloads.

Maintain S3 compatible storage platforms used across the enterprise.

Required Skills & Experience

Familiarity with at least one scripting/programming language.

Strong hands-on expertise with:

Grafana, Loki, Tempo (or similar tracing systems), Prometheus

Experience with Configuration Management tools (e.g., Ansible/Saltstack).

Proficiency in operating modern Linux-based distributed systems.

Experience supporting large scale, highly available architectures.

Familiarity with Kubernetes, CI/CD pipelines, and Infrastructure as Code.

Comfortable with on-call participation and incident leadership.

Experience with Atlassian tools (Jira, Confluence, Opsgenie).

Proficiency in Linux & Windows.

Nice to Have

Exposure to Ollama, N8N, or similar AI orchestration/automation tooling.

Experience with S3 storage internals or open source object stores (e.g., SeaweedFS, Ceph).

Understanding of virtualization stacks such as Proxmox, vSphere/vCF, or CloudStack.

Background in SRE driven culture, including SLIs/SLOs and error budgeting.

Original job Site Reliability Engineer (SRE) IT Platform Engineering posted on GrabJobs ©. To flag any issues with this job please use the Report Job button on GrabJobs.

Apply Now

Auto-Apply to Similar Jobs

Share Job

Get your Resume Reviewed for Free

Automate Job Applications for Similar Jobs

Auto-Apply to Site Reliability Engineer Jobs with your AI JobCopilot

Auto-Apply with AI

Similar Site Reliability Engineer Jobs in India

Get your Resume Reviewed for Free

Email address

Why are you reporting this job?

I think it’s a discriminatory or offensive

I think it’s fraudulent or a scam

I think it’s trying to sell something unrelated to the job / it’s asking for money

I think it contains incorrect or broken information

Other

All Job Ads are subject to GrabJobs’s Terms of Service. We allow users to flag postings that may be in violation of those terms. Job Ads may also be flagged by GrabJobs moderation team. However, no moderation system is perfect, and flagging a posting does not ensure that it will be removed.

Setup your job alert:

Frequency

By activating job alerts, I agree to GrabJobs Terms & Privacy Policy. I can unsubscribe to job alerts anytime. Skip

Site Reliability Engineer (SRE) IT Platform Engineering

Job Description - Site Reliability Engineer (SRE) IT Platform Engineering

Similar Site Reliability Engineer Jobs in India

Mobile Apps