Senior SRE Engineer (MLOps) - AI

Company : Salla

Job Type : Full Time

Makkah, Saudi Arabia

Number of Applicants

000+

Apply Now

Let AI Supercharge Your Job Hunt!

JobCopilot scans 500,000+ company career sites daily to find jobs for you

Never miss an opportunity Save hours by auto-filling applications forms Land more interviews with tailored applications

Activate JobCopilot

Job Description - Senior SRE Engineer (MLOps) - AI

Description

Description

Salla is looking for a Senior SRE Engineer (MLOps) to join our Salla AI team. This role focuses on running our AI and ML systems as real production systems, not side experiments — owning the operational layer around models, prompts, agents, inference services, and retrieval systems. You will be responsible for enabling Agentic AI and Generative AI features to operate reliably, securely, and cost-effectively at scale within the Salla ecosystem.

This role is SRE- and platform-engineering-first, with a strong emphasis on reliability, observability, safe releases, cost, and governance, while collaborating closely with engineering, data, and AI teams to give every pod a fast, safe path to production. It exists because AI systems fail differently from normal services — a prompt change can behave like a code change, an agent calling tools needs auditability, and latency, quality, and cost can move together in uncomfortable ways.

Key Responsibilities

Own reliability for ML and agentic AI services in production — SLOs, dashboards, alerts, runbooks, and incident follow-ups
Build observability across the AI stack — latency, errors, traces, tool calls, cost, and user impact
Design safe-release patterns for models, prompts, agents, tools, and configuration, including canary, rollback, feature-flag, and evaluation-gate strategies
Provide operational support for inference APIs, queues, retrieval layers, and AI workflows running on Kubernetes/EKS
Establish ownership, traceability, and guardrails around what agentic systems (e.g. Sidekick, the growth advisor) are allowed to do, including how they call internal tools
Defend agent tool-calling against prompt injection and untrusted-data risks — establish and enforce data-trust boundaries so that untrusted store/merchant content cannot manipulate agent decisions, tool calls, or actions
Drive AI cost governance — per-model and per-pod spend visibility, token-cost tracking, and anomaly alerting
Build automation and self-service paths so product teams have a known safe path to production instead of rebuilding it each time
Turn recurring operational pain into simple, reusable platform standards that other teams adopt
Participate in architecture discussions, code reviews, and technical decision-making

Requirements

4+ years in SRE, platform engineering, DevOps, or production infrastructure, operating distributed systems in production — not only in demos
Hands-on experience with Kubernetes and cloud-native systems in production
Familiarity with deploying ML projects
Strong command of CI/CD, GitOps, observability, and incident response
Solid experience with infrastructure-as-code, secrets management, and networking
Ability to write automation or platform tooling in Python, or a similar language
Production judgment — knowing how to make systems measurable, debuggable, repeatable, and safe to change (you do not need to be a machine learning researcher)
Ability to work across teams, explain trade-offs clearly, and turn operational pain into standards engineers will actually use

Nice to have:

Experience with MLOps or ML platforms — model serving, registries, evaluation, feature/data dependencies, drift monitoring, or ML pipelines
Familiarity with LLM applications or agentic systems — RAG, vector databases, tool calling, workflow orchestration, memory, traces, guardrails, or evaluation pipelines
Exposure to tooling such as OpenTelemetry, Prometheus, Grafana, MLflow, KServe, Ray, LiteLLM, vLLM, LangGraph, Arize Phoenix, or LangSmith
Experience with Kafka consumers, GPU workloads, inference optimization, model routing, or AI cost governance
Experience working in cross-functional product teams involving AI, backend, and frontend engineers

Original job Senior SRE Engineer (MLOps) - AI posted on GrabJobs ©. To flag any issues with this job please use the Report Job button on GrabJobs.

Apply Now

Auto-Apply to Similar Jobs

Share Job

Get your Resume Reviewed for Free

Automate Job Applications for Similar Jobs

Auto-Apply to Senior SRE Engineer Jobs with your AI JobCopilot

Auto-Apply with AI

Similar Senior SRE Engineer Jobs in Saudi Arabia

Get your Resume Reviewed for Free

Email address

Why are you reporting this job?

I think it’s a discriminatory or offensive

I think it’s fraudulent or a scam

I think it’s trying to sell something unrelated to the job / it’s asking for money

I think it contains incorrect or broken information

Other

All Job Ads are subject to GrabJobs’s Terms of Service. We allow users to flag postings that may be in violation of those terms. Job Ads may also be flagged by GrabJobs moderation team. However, no moderation system is perfect, and flagging a posting does not ensure that it will be removed.

Setup your job alert:

Frequency

By activating job alerts, I agree to GrabJobs Terms & Privacy Policy. I can unsubscribe to job alerts anytime. Skip

Senior SRE Engineer (MLOps) - AI

Job Description - Senior SRE Engineer (MLOps) - AI

Description

Key Responsibilities

Similar Senior SRE Engineer Jobs in Saudi Arabia

Mobile Apps