Descripción del trabajo - Senior HPC Cluster Engineer

This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Senior HPC Cluster Engineer based in Spain.

This role sits at the core of a next-generation AI cloud infrastructure environment, focused on building and optimizing large-scale high-performance computing systems. You will work on complex GPU and InfiniBand cluster architectures that power AI and HPC workloads at scale. The position involves deep system-level engineering, performance tuning, and hands-on troubleshooting across distributed infrastructure. You will contribute directly to improving reliability, efficiency, and scalability of compute platforms used for advanced AI and data-intensive applications. Working in a highly technical engineering culture, you will collaborate with experts across systems, networking, and virtualization. This is a high-impact role where your work directly influences the performance of large-scale cloud and AI workloads.

Accountabilities:

Own the performance optimization and reliability of large-scale GPU clusters and InfiniBand networking environments supporting HPC workloads:

Tune and optimize GPU cluster performance and InfiniBand fabric efficiency to ensure high throughput and low-latency computing.

Diagnose, troubleshoot, and resolve complex system-level issues across GPU, network, and compute layers.

Integrate and validate new hardware components into existing HPC infrastructure, including support for GPUs and related accelerators.

Work across virtualization and orchestration layers (KVM/QEMU, Kubernetes) to ensure seamless hardware utilization and deployment.

Develop and improve automation for monitoring, fault detection, and proactive remediation in distributed compute environments.

Configure, manage, and maintain GPU devices, PCIe systems, and InfiniBand networks to ensure stability and scalability.

Requirements:

We are looking for a highly experienced systems engineer with strong expertise in HPC and low-level infrastructure:

5+ years of experience in system-level software engineering with a focus on performance, scalability, or infrastructure optimization.

3+ years of hands-on experience with Linux systems administration, debugging, and performance tuning.

Strong understanding of server and hardware architecture including PCIe, NICs, GPUs, and Linux kernel-level behavior.

Proficiency in C, C++, Go, or Python for systems or performance-oriented development.

Experience working with distributed or HPC environments and solving complex infrastructure challenges.

Strong analytical and problem-solving skills with the ability to work on deep technical issues independently.

Familiarity with GPU clusters, InfiniBand networking, and large-scale compute systems is highly desirable.

Experience with KVM/QEMU or containerized orchestration environments is a plus.

Exposure to distributed computing frameworks or libraries such as MPI or NCCL is advantageous.

Benefits:

Competitive compensation package.

Career development and continuous learning opportunities in advanced AI and HPC systems.

Flexible working arrangements and remote-friendly culture across Europe.

Opportunity to work on cutting-edge AI infrastructure and large-scale distributed systems.

Collaborative engineering environment with high technical ownership.

Exposure to international teams and world-class engineering challenges.

How Jobgether works:

We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.

We appreciate your interest and wish you the best!

Why Apply Through Jobgether?

Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.

#LI-CL1

We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses and identifying potential inconsistencies or verification signals in application materials based on available information. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.

Original job Senior HPC Cluster Engineer posted on GrabJobs ©. To flag any issues with this job please use the Report Job button on GrabJobs.

Senior HPC Cluster Engineer

Descripción del trabajo - Senior HPC Cluster Engineer

Accountabilities:

Requirements:

Benefits:

Similar Senior HPC Cluster Engineer Jobs in Spain

Aplicaciones móviles