The Fresco open failure data repository for computer site and machine failures for large scale clusters is an effort from the Dependable Computing Systems Laboratory (DCSL) research group of Purdue University with the support of NSF (grants CNS-1405906, CNS-1513197).
This initial data set is collected from one of Purdue University’s production HPC computing cluster. Over a duration of 6 months starting from October 2014, we collected different types of data from the cluster to a User centric workload study. Towards this we have collected information pertaining to the jobs that are submitted by the users and information from all the nodes of the cluster used for scheduling jobs. Using the data set spanning six months, we analyzed over 489,000 jobs submitted by over 300 different users. In conjunction, we also analyzed roughly 3.8K user tickets (but they are not shared due to privacy concerns). The data set consists of the following components
- Accounting statistics extracted from job scheduler
- Node level statistics extracted from monitoring tool
This data set is available in a collection of tab-separated value (TSV) files hosted at https://www.rcac.purdue.edu/fresco/index_v1.0.html.
This dataset can be referenced as:
Saurabh Bagchi, Suhas Raveesh Javagal, Subrata Mitra, Stephen Harrell, Charles Schwarz, "FRESCO - The Open Data Repository for Workloads and Failures in Large-scale Computing Clusters: Conte dataset October 2014-March 2017," 2015, doi:10.13019/M2VC7C.
This project is sponsored by the National Science Foundation under grants CNS-1405906 and CNS-1513197.
Cite this work
Researchers should cite this work as follows: