K

Site Reliability Engineering (SRE)

icon building Company : Koantek
icon briefcase Job Type : Full Time

Number of Applicants

 : 

000+

Click to reveal the number of candidates who applied for this job.
icon loader
Apply Now
icon loader Apply Now

Let AI Supercharge Your Job Hunt!

JobCopilot scans 500,000+ company career sites daily to find jobs for you

Never miss an opportunity Save hours by auto-filling applications forms Land more interviews with tailored applications
happy man
thunder iconActivate JobCopilot

Job Description - Site Reliability Engineering (SRE)

About the Role:

We are seeking a highly skilled and experienced SREDatabricks Platform Administrator to join our DataOperations Team. In this critical role, you will be responsible for the availability, performance,Reliability and scalability of our enterprise Databricks platform. You will blend deep expertise in Databricks administration with SRE principles to automate operations, proactively identify and resolve issues, and ensure a seamless experience for our data engineering, data science, and analytics teams. You will champion best practices for platform governance, security, and cost optimization, playing a pivotal role in our data ecosystem.


Key Responsibilities:

Platform Operations & Reliability:

Design, implement, and maintain the Databricks platform infrastructure across multiple cloud environments (AWS, Azure,or GCP).

Ensure high availability, disaster recovery, and business continuity of Databricks workspaces, clusters, and associated services.

Develop and implement robust monitoring, alerting, and logging solutions for the Databricks platform using tools like Prometheus, Grafana, ELK stack, or cloud -native monitoring services (CloudWatch, Azure Monitor, GCP Operations Suite).

Proactively identify and address performance bottlenecks, resource constraints, and potential issues within the Databricks environment.

Participate in on -call rotations to respond to and resolve critical incidents swiftly, performing root cause analysis (RCA) and implementing preventative measures.

Manage and optimize Databricks clusters, including auto -scaling,instance types, and cluster policies, for both interactive and job compute workloads to ensure cost -effectiveness and performance.

Automation & Tooling:

Develop and maintain Infrastructure as Code (IaC) using tools like Bicep/Terraform or CloudFormation to automate the provisioning, configuration, and management of Databricks resources.

Automate repetitive operational tasks, deployments, and environment provisioning using scripting languages (Python,Bash) and CI/CD pipelines (Jenkins, Azure DevOps, GitLab CI).

Build and maintain custom tools and scripts to enhance Databricks platform capabilities, improve observability, and streamline workflows.

Security & Governance:

Implement and enforce Databricks security best practices, including identity and access management (IAM) with Unity Catalog, SSO integration (Azure AD, Okta), service principals, and granular access controls (RBAC, row -level/column -level security).

Ensure compliance with organizational security policies, data governance standards, and regulatory requirements (e.g., GDPR,HIPAA, industry -specific compliance).

Conduct security audits and vulnerability assessments of the Databricks environment.

Manage secrets using Databricks secrets or a cloud provider secret manager.

Performance Optimization & Cost Management:

Analyze Databricks usage patterns, DBU consumption, and cloud resource costs to identify opportunities for optimization and efficiency gains.

Implement strategies for cost control, including spot instances utilization, intelligent cluster resizing, and effective use of instance pools.

Work with data teams to optimize Spark jobs, notebooks, and SQL queries for performance and cost.

Collaboration & Mentorship:

Collaborate closely with data engineers, data scientists, architects, and other SREs to understand their requirements and provide expert guidance on Databricks best practices.

Provide technical leadership and mentorship to junior administrators and engineers, fostering a culture of reliability and operational excellence.

Stay up -to -date with the latest Databricks features, cloud services, and SRE methodologies, evaluating and recommending new technologies.



Original job Site Reliability Engineering (SRE) posted on GrabJobs ©. To flag any issues with this job please use the Report Job button on GrabJobs.
Apply Now
Share Job
Share Job

Auto-Apply to Site Reliability Engineering (SRE) Jobs with your AI JobCopilot

thunder icon Auto-Apply with AI

Similar Site Reliability Engineering (SRE) Jobs in India

GrabJobs is the no1 job portal in India, connecting you to thousands of jobs fast! Find the best jobs in India, apply in 1 click and get a job today!

Mobile Apps

Copyright © 2026 Grabjobs Pte.Ltd. All Rights Reserved.