ജോലി വിവരണം

Role overview

KAUST’s Supercomputing Laboratory is looking for an experienced Senior HPC Systems Administrator to help run and improve a large-scale high-performance computing environment. The role centers on the daily operation of a cluster with around 600 CPU and GPU nodes, together with storage platforms, InfiniBand and Ethernet networks, and other core HPC services. You will also provide practical support to researchers and end users working in computational science, engineering, big data, and AI/ML.

Key responsibilities

Respond to user requests quickly and professionally through phone, in-person support, email, and the ticketing system while maintaining strong service standards.
Set up, administer, and optimize HPC infrastructure such as compute nodes, high-speed storage, InfiniBand, Ethernet, and configuration management tooling like Ansible or Puppet.
Operate cluster management platforms, monitoring solutions, and related support services used to keep HPC systems running smoothly.
Install and maintain the Slurm scheduler, manage QOS rules, accounts, accounting, and automation code written in Python and C++.
Create and improve automation scripts in Bash and Python to reduce manual administration work.
Deploy and support container platforms for HPC workloads, including Singularity/Apptainer and Docker.
Regularly benchmark CPUs, memory, networking, and storage to verify performance and identify tuning opportunities across hardware, drivers, and applications.
Apply security controls such as node hardening, kernel patching, and general compliance measures across the environment.
Administer parallel file systems such as Lustre, GPFS, Weka, or Vast, including tuning performance and planning capacity.
Support research teams and external partners by working closely with faculty, researchers, collaboration groups, and industry contacts alongside application support specialists.
Build software utilities and tools when needed to support research workloads on cluster systems and related subsystems.
Lead proof-of-concept efforts and technology assessments from start to finish, while reviewing industry practices and recommending platform improvements.
Work with vendors and third-party providers to log issues and follow them through to resolution.
Maintain internal documentation, operating procedures, and training content in the wiki.
Keep current with HPC developments through ongoing learning, conferences, and professional networking, and use benchmarking findings to guide future hardware choices.

Required qualifications and competencies

Strong experience supporting computational science, engineering, data analysis, and AI applications in HPC environments.
Deep Linux administration expertise, especially with RHEL, Rocky Linux, or CentOS in large-scale systems.
Working knowledge of HPC programming languages and models such as Fortran, C/C++, Python, MPI, OpenMP, CUDA, and OpenACC.
Proven background managing complex HPC infrastructure including parallel storage, schedulers, InfiniBand/Ethernet networks, and monitoring tools.
Hands-on experience with configuration management systems such as Ansible or Puppet.
Understanding of scientific computing, analytics, and AI/ML software commonly used in HPC settings.
Knowledge of project management methods and practices.
Ability to support research activities in a collaborative HPC environment.
Strong analytical thinking, troubleshooting ability, and sound decision-making.
Initiative to identify improvements and carry work through to completion.
Ability to manage several projects at the same time and deliver quality results on schedule.
Comfort working with researchers, application teams, and vendors across functions.
Effective communication skills in English, both spoken and written, including technical reports and presentations.

Education and experience

A bachelor’s or master’s degree in computer science, computer engineering, information systems, or an equivalent field is required. The role also expects at least five years of experience supporting large-scale computing platforms and related subsystems, along with experience in hardware troubleshooting, root-cause documentation, parallel storage administration, HPC benchmarking, workload managers such as Slurm, LSF, or PBS, and Linux system administration. Experience with Kubernetes or container orchestration is preferred.

Additional information

This position involves supporting a highly collaborative, international environment and contributing to ongoing HPC improvements, benchmarking initiatives, and procurement planning. The role is based onsite in Makkah, Saudi Arabia.

HPC Senior Systems Administrator

നിങ്ങൾ എവിടെ ജോലി ചെയ്യും