Spartan Update April 2019
Having resolved the problem that caused the outage in March, we now have Spartan back up and running*. Our plan for 2019, as part of the Petascale Campus Initiative (PCI), was always to reassess the underlying cluster filesystem, CephFS, which supports every major function of Spartan and has been a factor in all the unplanned outages over the last two years.
Spartan was created to be highly flexible and scalable, and its modular design allows us to upgrade and replace components without having to change the whole system.
We have already conducted a thorough architectural review and are about to implement the first recommendation: splitting critical operational storage components from the general storage. This will see the main storage that supports Slurm operations, the shared applications in /usr/local, and user home directories, moved to an enterprise storage service. This will improve the stability of the service and should provide faster loading of software from /usr/local.
Further, we are reviewing alternative cluster filesystems such as Lustre and BeeGFS. In the interim, measures are being taken to improve the network paths between the existing storage and compute nodes, and the staging of data between MediaFlux (our Research Data Management platform) and Spartan.
By June 2019 we undertake to:
- employ external experts to ensure our existing CephFS configuration is optimised for performance and stability
- complete an investigation into alternative cluster file systems such as Lustre, BeeGFS and Spectrum Scale (GPFS) as potential replacements or complements to CephFS
- improve network paths between CephFS and Spartan to reduce the routing complexity of the data flow pathways
- work with users to maximise/optimise their storage use to reduce some of the burden on CephFS
- migrate critical storage loads to enterprise storage service (this will be carried out in the regular scheduled July maintenance window).
We are already fully engaged with you to determine other priorities for improvements to services through PCI funding over 2019 and will provide updates as work proceeds. We look forward to working together to build the research computing environment you need for your research.
If you would like any more information, or would like to provide feedback, please feel free to contact me directly, or anyone in the Spartan support team.
Dr Bernard Meade
on behalf of Research Platform Services
* Read full outage report here: https://dashboard.hpc.unimelb.edu.au/papers/Spartan_storage_problems-Post_report-Mar2019.pdf