Job Description - Staff Architect, Data & AI Platform
We're building a unified data and AI/ML model development and training platform—similar in ambition to those platforms at other media technology companies—to accelerate our entire AI/ML lifecycle from data preparation and feature computation to experimentation, large-scale distributed training, evaluation, deployment, and governance & observability. A major focus is creating horizontal, reusable components—feature pipelines, embedding services, training/evaluation frameworks, SDKs, and model management APIs—to simplify the end‑to‑end AI/ML lifecycle. You'll partner closely with AI/ML researchers/engineers, data engineers, and product teams to deliver a paved-path developer experience: from a laptop run to multi-node, GPU-accelerated training; experiment tracking & lineage; model packaging with evaluation hooks; and continuous delivery to production. Architect a cohesive, integrated platform that integrate underlying data and AI/ML compute orchestration systems (Kubernetes, Ray, Spark), feature/experiment/model stores, and enable batch, distributed training, and multi‑stage pipelines while ensuring consistent resource usage patterns. Establish unified pipeline definitions for dataset preparation flows, reliable, scalable training and inference routines, and evaluation workloads to enable reproducibility, repeatability and reuse Define stable, well‑versioned APIs and Python SDKs, which follow common architectural patterns and provide framework‑native integrations (PyTorch/Ray, etc.), to enable teams to build declarative AI/ML workflows spanning dataset selection, preparation, training, tuning, evaluation, and model registration/deployment/serving. Design and evolve the experimentation, model checkpointing/registration, and other AI/ML metadata systems to support lineage, observability, lineage, and evaluation to standardize how teams track, assess, compare, and promote models. Define best practices for AI/ML version control, build/test/evaluate/tune CI/CD, continuous training, and dependency management Collaborate with AI/ML researchers, engineers, platform engineering/MLOps and infrastructure, security, and product stakeholders to align roadmap priorities and architectural sequencing with workflow pain points. Partner with platform engineering/MLOps and infrastructure to define observability stacks for metrics, drift indicators, performance regressions, training/inference health signals, production reliability (SLIs/SLOs), monitoring, and incident response. Desired Background BS in Computer Science, Mathematics, Engineering, or equivalent technical field. Master's Highly preferred. Proven experience architecting large-scale distributed systems and integrated data and AI/ML platforms (e.g., training, serving, workflow orchestration, data pipelines). Expert-level proficiency in Python and one of Go/Java/C++ and building production-grade services/APIs/SDKs Deep knowledge and extensive experience with GPU infrastructure, distributed systems, orchestration engines, AI/ML toolchains, query engines, pipeline technologies i.e., Ray, Spark, Metaflow/Flyte/Airflow/Argo/Kubeflow, Kubernetes. Deep expertise with AI/ML ecosystem tooling such as PyTorch/TensorFlow/Hugging Face, MLOps stack: experiment tracking, feature stores, model registries/lineage (MLFlow, WnB, Feast, etc), and AI/ML-related AWS/GCP managed services (VertexAI, SageMaker, etc) Proven ability to scale AI/ML platforms emphasizing reproducibility and repeatability by integrating data preprocessing, feature access, and model training/evaluation routines through workflow orchestration/pipelines, developer-oriented SDKs, feature/model CI/CD, automated evaluation, safe rollouts, and monitoring Strong collaboration skills across research, engineering, MLOps, infrastructure, and product; ability to translate cross‑functional needs into scalable platform abstractions. Experience designing feature stores or embedding services tightly integrated with training pipelines.
All Job Ads are subject to GrabJobs’s Terms of Service. We allow users to flag postings that may be in violation of those terms. Job Ads may also be flagged by GrabJobs moderation team. However, no moderation system is perfect, and flagging a posting does not ensure that it will be removed.
Be the first to receive the latest Others Full-Time Jobs in India.
Setup your job alert:
By activating job alerts, I agree to GrabJobs Terms & Privacy Policy. I can unsubscribe to job alerts anytime.
Skip