A Survey of Methods for Handling Disk Data Imbalance


Shuangshuang Yuan1, Peng Wu1, Yuehui Chen1 and Qiang Li2, 1University of Jinan, China, 2State Key Laboratory of High-end Server & Storage Technolog, China


Class imbalance exists in many classification problems, and since the data is designed for accuracy, imbalance in data classes can lead to classification challenges with a few classes having higher misclassification costs. The Backblaze dataset, a widely used dataset related to hard discs, has a small amount of failure data and a large amount of health data, which exhibits a serious class imbalance. This paper provides a comprehensive overview of research in the field of imbalanced data classification. The discussion is organized into three main aspects: data-level methods, algorithmic-level methods, and hybrid methods. For each type of method, we summarize and analyze the existing problems, algorithmic ideas, strengths, and weaknesses. Additionally, the challenges of unbalanced data classification are discussed, along with strategies to address them. It is convenient for researchers to choose the appropriate method according to their needs.


Classification, Disk Failure Prediction, Imbalanced Dataset, Data Processing