Cluster Lifecycle Management: Lead the evaluation, planning, configuration, and physical/virtual deployment of multiple large-scale CPU + GPU clusters. System Administration: Perform expert-level Linux system administration, including kernel tuning, security hardening, and OS lifecycle management (e.g., RHEL, Ubuntu, or Rocky Linux). Workload Management: Act as the subject matter expert for SLURM, managing complex partitioning, resource quality of service (QoS), and scheduling optimization for mixed workloads. Infrastructure Design: Architect and build the physical and logical infrastructure for HPC, including high-speed fabric integration (InfiniBand/Ethernet) and power/cooling planning. Software Stack & Modules: Maintain and curate the HPC application stack using software management tools like LMOD or Tcl Modules, ensuring researchers have access to optimized compilers, libraries (MPI, CUDA), and applications. GPU Optimization: Spec and tune GPU environments (e.g., NVIDIA H100/B200), focusing on GPUDirect, NVLink topologies, and containerized runtimes like Apptainer/Singularity. Troubleshooting & Performance: Conduct deep-dive root cause analysis for complex system failures and performance bottlenecks across compute, network, and software layers. Cross-Functional Leadership: Closely own infrastructure projects by coordinating with Networking (low-latency fabric) and Security (compliance, identity management) to ensure all builds meet enterprise standards. Experience with GPU-aware MPI implementations and performance profiling tools (e.g., NVIDIA Nsight, Tau). Knowledge of container orchestration in HPC (e.g., Kubernetes for AI/ML workloads alongside SLURM). Certifications such as RHCE (Red Hat Certified Engineer) or relevant NVIDIA/InfiniBand technical training. Education: BS/MS in Computer Science, Electrical Engineering, or a related field. HPC Experience: 6+ years of hands-on experience managing production-grade HPC clusters. Scheduler Expertise: Deep proficiency in SLURM administration, including writing custom prolog/epilog scripts and managing GRES (Generic Resources) for GPUs. Linux Mastery: Advanced knowledge of Linux internals, shell scripting (Bash), and at least one high-level language (Python or Go). Automation: Extensive experience with configuration management and provisioning tools (e.g., Ansible, Terraform, xCAT, or Warewulf). Networking: Familiarity with HPC-specific networking such as InfiniBand (NDR/HDR) and RoCE v2.
All Job Ads are subject to GrabJobs’s Terms of Service. We allow users to flag postings that may be in violation of those terms. Job Ads may also be flagged by GrabJobs moderation team. However, no moderation system is perfect, and flagging a posting does not ensure that it will be removed.
Be the first to receive the latest Others Full-Time Jobs in India.
Setup your job alert:
By activating job alerts, I agree to GrabJobs Terms & Privacy Policy. I can unsubscribe to job alerts anytime.
Skip