In September 2023, the migration to Slurm, a new workload manager and job scheduling system, was successfully completed on IT4Innovations supercomputers. Several significant factors prompted this radical change, bringing several benefits for both our users and their research projects.

The idea of abandoning the PBS Scheduler and switching to Slurm arose repeatedly in the past, usually when designing and procuring new supercomputers. Initial concerns regarding the potential difficulty and complexity of the migration itself, not only for users but also for administrators, were eventually dispelled due to increasing problems that began to occur with the existing PBS Scheduler.

The first significant factor was the unsatisfactory behaviour of the PBS Scheduler and its lack of reliability, which stemmed from substantial changes in the PBS source code. These problems resulted in error rates that had a negative impact on the operation of our supercomputers.

The second factor was the different nature of the job scheduler, which created barriers to effective collaboration with our collaborating centres and, thus, joint activities such as training and presentations of research results.

The last, but by no means least important, factor was the increasing number of users who already had hands-on experience using Slurm from other supercomputing centres. The deployment of the Slurm-equipped LUMI supercomputer, to which our users have access, has also contributed to this. Over time, it has become the dominant scheduler in supercomputing.

As a result of these considerations, the decision was made to deploy Slurm on our Complementary systems and prepare to deploy Slurm on our core computing systems. This year has seen the implementation of Slurm on the Barbora (July 2023) and Karolina (September 2023) supercomputers.

User benefits of Slurm running on IT4Innovations supercomputers

The benefits to users of deploying Slurm on IT4Innovations supercomputers are significant. Thanks to this migration, our users can expect a significantly faster response to their requests. The waiting time to run their compute jobs has been reduced from 5 to 10 minutes to 10 seconds. The Slurm system on the Karolina supercomputer also facilitates sharing know-how between different supercomputing centres. It dramatically simplifies the work of users who do not have to rewrite their jobs for different technologies.

 

More information:
https://docs.it4i.cz/general/job-submission-and-execution/
https://docs.it4i.cz/general/karolina-slurm/