Job Responsibilities:
- Design, provision, and maintain cloud resources across AWS (primary), with capabilities to work in Azure and Google Cloud environments
- Manage end-to-end infrastructure for full-stack GenAI applications including:
- Database systems (Aurora, RDS, DynamoDB, DocumentDB, etc.)
- Security groups and IAM policies
- VPC architecture and network design
- Container orchestration (ECS, EKS, Lambda)
- Storage solutions (S3, EFS, etc.)
- CDN configuration (CloudFront)
- DNS management (Route53)
- Load balancing and auto-scaling
- Design feature stores, vector stores, data ingestion frameworks, and lakehouse architectures
- Manage data governance, lineage, masking, and access controls around data products
- Design and implement serverless solutions using AWS Lambda, API Gateway, and EventBridge
- Optimize serverless applications for performance, cost, and scalability
- Implement event-driven architectures and asynchronous processing patterns
- Manage serverless deployment pipelines and monitoring
- Architect and implement comprehensive disaster recovery strategies
- Design multi-region failover capabilities with automated recovery procedures
- Implement RTO/RPO requirements through backup strategies and replication
- Build auto-failover mechanisms using Route53 health checks and failover routing
- Create and maintain disaster recovery runbooks and testing procedures
- Ensure data durability through cross-region replication and backup strategies
- Build and maintain a self-service platform enabling rapid experimentation and testing of GenAI applications
- Implement Infrastructure as Code (IaC) using Terraform for consistent and repeatable deployments
- Create streamlined CI/CD pipelines that support local-to-dev-to-prod workflows
- Design systems that minimize deployment time and maximize developer productivity
- Establish quick feedback loops between development and deployment
- Implement comprehensive monitoring, observability, and alerting solutions
- Set up logging aggregation and analysis tools
- Ensure high availability and disaster recovery capabilities Optimize cloud costs while maintaining performance
- DevOps Excellence
- Champion DevOps best practices across the organization
- Automate infrastructure provisioning and application deployment
- Implement security best practices and compliance requirements
- Create documentation and runbooks for operational procedures
Basic Qualifications:
- Technical Skills
- 5+ years of hands-on experience with AWS services
- 2+ years of hands-on experience with Databricks
- Expert-level knowledge of AWS core services (EC2, VPC, IAM, S3, RDS, Lambda, ECS/EKS)
- Expert-level knowledge of Databricks capabilities
- Familiarity with SageMaker, Bedrock, or Anthropic/Claude API integration
- Strong proficiency with Terraform for infrastructure automation
- Demonstrated experience with containerization (Docker, Kubernetes)
- Solid understanding of networking concepts (subnets, routing, security groups, VPN)
- Experience with CI/CD tools (Jenkins, GitLab CI, GitHub Actions, AWS CodePipeline)
- Proficiency in scripting languages (Python, Bash, PowerShell)
- Extensive experience with AWS Lambda, API Gateway, ECS, Step Functions
- Knowledge of serverless frameworks (SAM, Serverless Framework)
- Experience with event-driven patterns using SNS, SQS, EventBridge
- Understanding of serverless best practices and optimization techniques
- Proven experience designing and implementing DR strategies in AWS
- Expertise in multi-region architectures and data replication
- Experience with AWS backup services and cross-region failover
- Knowledge of RTO/RPO planning and implementation
- Hands-on experience with Route53 health checks and failover routing policies
- Primary: AWS (extensive experience required)
- Secondary: Azure and Google Cloud Platform (working knowledge)
- Multi-cloud architecture understanding
- Experience with monitoring tools (CloudWatch, Datadog, Prometheus, Grafana)
- Log management systems (ELK stack, Splunk, CloudWatch Logs) APM tools and distributed tracing
Preferred Qualifications
- AWS certifications (Solutions Architect, DevOps Engineer)
- Databricks Certifications
- Experience with open-source LLMs, embedding models, and RAG-based applications
- Experience with chaos engineering and resilience testing
- Knowledge of security frameworks and compliance (SOC2, HIPAA, PCI)
- Experience implementing complex build systems for mono-repo micro-services architectures
- Background in building developer platforms or internal tools Experience with Infrastructure as Code testing frameworks