Contributions for Resource and Job Management in High Performance Computing

Abstract : High Performance Computing is characterized by the latest technological evolutions in computing architectures and by the increasing needs of applications for computing power. A particular middleware called Resource and Job Management System (RJMS), is responsible for delivering computing power to applications. The RJMS plays an important role in HPC since it has a strategic place in the whole software stack because it stands between the above two layers. However the latest evolutions in hardware and applications layers have provided new levels of complexities to this middleware. Issues like scalability, management of topological constraints, energy efficiency and fault tolerance have to be particularly considered, among others, in order to provide a better system exploitation from both the system and user point of view. This dissertation provides a state of the art upon the fundamental concepts and research issues of Resources and Jobs Management Systems. It provides a multi-level comparison (concepts, functionalities, performance) of some Resource and Jobs Management Systems in High Performance Computing. An important metric to evaluate the work of a RJMS on a platform is the observed system utilization. However studies and logs of production platforms show that HPC systems in general suffer of significant un-utilization rates. Our study deals with these clusters' un-utilization periods by proposing methods to aggregate otherwise un-utilized resources for the benefit of the system or the application. More particularly this thesis explores RJMS level mechanisms: 1) for increasing the jobs valuable computation rates in the high volatile environments of a lightweight grid context, 2) for improving system utilization with malleability techniques and 3) providing energy efficient system management through the exploitation of idle computing machines. The experimentation and evaluation in this type of contexts provide important complexities due to the inter-dependency of multiple parameters that have to be taken into control. In this thesis we have developed a methodology based upon real-scale controlled experimentation with submission of synthetic or real workload traces.
Complete list of metadatas

Cited literature [168 references]  Display  Hide  Download

https://hal-auf.archives-ouvertes.fr/tel-01499598
Contributor : Yiannis Georgiou <>
Submitted on : Friday, March 31, 2017 - 4:51:11 PM
Last modification on : Friday, October 25, 2019 - 1:31:23 AM
Long-term archiving on : Saturday, July 1, 2017 - 3:37:29 PM

Licence


Public Domain

Identifiers

  • HAL Id : tel-01499598, version 1

Collections

Citation

Yiannis Georgiou. Contributions for Resource and Job Management in High Performance Computing. Distributed, Parallel, and Cluster Computing [cs.DC]. Université de Grenoble, 2010. English. ⟨tel-01499598⟩

Share

Metrics

Record views

270

Files downloads

659