Number of Applicants
:000+
Let AI Supercharge Your Job Hunt!
JobCopilot scans 500,000+ company career sites daily to find jobs for you
· Operate and manage Kubernetes or
OpenShift clusters for multi-node orchestration
· Deploy and manage LLMs and other AI
models for inference using Triton Inference Server or custom endpoints
· Automate CI/CD pipelines for model
packaging, serving, retraining, and rollback using GitLab CI or ArgoCD
· Set up model and infrastructure
monitoring systems (Prometheus, Grafana, NVIDIA DCGM)
· Implement model drift detection,
performance alerting, and inference logging
· Manage model checkpoints, reproducibility
controls, and rollback strategies
· Track deployed model versions using
MLFlow or equivalent registry tools
· Implement secure access controls for
model endpoints and data artifacts
· Collaborate with AI / Data Engineer to
integrate and deploy fine-tuned datasets
· Ensure high availability, performance,
and observability of all AI services in production
· 3+ years experience in DevOps, MLOps, or
AI/ML infrastructure roles
· 10+ overall experience with solution
operations
· Proven experience with Kubernetes or
OpenShift in production environments, preferably certified.
· Familiarity with deploying and scaling
PyTorch or TensorFlow models for inference
· Experience with CI/CD automation tools with
Open Shift / Kubernetes
· Hands-on experience with model registry
systems (e.g., MLFlow, KubeFlow)
· Experience with monitoring tools (e.g.,
Prometheus, Grafana) and GPU workload optimization
· Strong scripting skills (Python, Bash)
and Linux system administration knowledge
Auto-Apply to AI/MLOps Engineer Jobs with your AI JobCopilot
Copyright © 2026 Grabjobs Pte.Ltd. All Rights Reserved.