Performance modeling of a distributed file-system

Published in arXiv, 2013
Pre-print

Abstract: Data centers have become center of big data processing. Most programs running in a data center processes big data. The storage requirements of such programs cannot be fulfilled by a single node in the data center, and hence a distributed file system is used where the the storage resource are pooled together from more than one node and presents a unified view of it to outside world. Optimum performance of these distributed file-systems given a workload is of paramount important as disk being the slowest component in the framework. Owning to this fact, many big data processing frameworks implement their own file-system to get the optimal performance by fine tuning it for their specific workloads. However, fine-tuning a file system for a particular workload results in poor performance for workloads that do not match the profile of desired workload. Hence, these file systems cannot be used for general purpose usage, where the workload characteristics shows high variation. In this paper we model the performance of a general purpose file-system and analyse the impact of tuning the file-system on its performance. Performance of these parallel file-systems are not easy to model because the performance depends on a lot of configuration parameters, like the network, disk, under lying file system, number of servers, number of clients, parallel file-system configuration etc. We present a Multiple Linear regression model that can capture the relationship between the configuration parameters of a file system, hardware configuration, workload configuration (collectively called features) and the performance metrics. We use this to rank the features according to their importance in deciding the performance of the file-system.

sidebyside sidebyside

In this project we aim to model the performance of a parallel file system, specifically GlusterFS, in terms of the hardware and software configuration. The performance of the file system is affected by the value of different parameters (configuration values for the hardware and software settings) and can be predicted if the value of these parameters is known. However, an latent factor here is the the workload. The behavior of an application can change the peformance of the file system even if it is configured in the best setting possible. Typically, a file system is configured to serve files of reasonable size, MB-GB. However, these settings do not work for small files, and the performance of the file system is severly slowed.

drawing

Source: Ceph: A Scalable, High-Performance Distributed File System

We aim to model the performance of the file system given a workload.