Hong Kong Baptist University (HKBU) Research Cluster on Data Analytics and Artificial Intelligence in X

Energy-efficient Training of Multiple Deep Learning Models on GPU Clusters
Principal Investigatgor: Prof. Xiaowen CHU ( Department of Computer Science )

In the past decade, we have witnessed a proliferation of GPUs in the deep learning community to train complex deep neural network models (or deep models for brevity). Compared with contemporary CPUs, GPUs significantly improve computation speed and memory bandwidth, and hence reduce training time. Many organizations and companies have built GPU clusters to speed up their deep learning jobs. Although there are several new hardware accelerators (such as TPU and FPGA) being researched and developed, we believe that GPUs will remain an important option for training deep models due to their broad user base and complete ecosystem. This project will focus on GPU clusters.

Motivation I: Performance

For many real-world AI applications such as image classification and speech recognition, deep models need to be trained using many GPUs as a way to reduce training time. Some recent projects in distributed training algorithms have achieved promising speedup on up to hundreds of GPUs for a single training job. However, a GPU cluster is often shared by many users. When multiple training jobs are running simultaneously, they compete for resources like GPUs, disk I/O, and network bandwidth. Our preliminary study showed that resource competitions can significantly degrade training performance. In the proposed project, we will investigate how to effectively allocate resources and design schedules for many training jobs such that overall performance can be improved by avoiding or reducing potential resource competitions.

Motivation II: Operational Cost

Although GPUs are much more powerful and energy-efficient than CPUs (in terms of Flops/Watt), they still consume significant power. For example, a single Nvidia DGX-1 GPU server consumes up to 3,200 Watts of electricity, 75% of which are used by its 8 GPU cards. As the electricity cost of the GPU cluster dominates its overall operational costs, it is crucial to reduce the cluster’s power consumption without affecting the overall performance.


There are a number of energy-efficient solutions that have been proposed in the literature for traditional CPU based clusters, among which dynamic voltage and frequency scaling (DVFS) and resource allocation and task scheduling (RATS) are two of the most important tools. DVFS allows the processor or memory module to use more (or less) power to increase (or decrease) working frequency. It has been widely used by mobile processors. But still, there exist a few challenges for GPU clusters. First, modern GPUs contain hundreds to thousands of cores and a complex memory hierarchy. Both GPU cores and GPU memory support DVFS, which makes modeling the impact of GPU DVFS on the performance and power consumption of deep learning jobs challenging. Second, the general RATS problems for heterogeneous GPUs are usually NP-hard. For large-scale clusters with many training jobs, an efficient heuristic solution with good performance guarantees is important.

Our tasks:

This project will aim to design energy-efficient RATS solutions for a GPU cluster that runs a set of deep learning training jobs. As performance and power prediction are fundamental components in solving the RATS problem, we propose to carry out the following four research tasks. First, we will build up an open data set of the performance and power usage of an abundant set of GPU kernels with different DVFS configurations. Second, we will develop quantitative performance and power models for training deep models on multiple GPUs to consider the effect of DVFS. Third, we will tackle the online RATS problem, in which training jobs arrive over time and each job is modeled by a directed acyclic graph (DAG) that contains a set of computing and communication tasks. Besides designing traditional list scheduling based solutions, we will also design theoretically sound online algorithm with competitive ratio through integer linear programing (ILP) and primal-dual techniques, such that the worst-case performance is guaranteed to be within a certain bound of the optimal offline strategy based on complete future job information. Finally, we will implement a prototype of our job management and scheduling system, with which we will evaluate our proposed solutions using real-world experiments.

Related Publications:

  • S. Shi, Q. Wang, and X.-W. Chu, “Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs,” IEEE DataCom 2018, Athens, Greece, August 2018. (Best Paper Award)
  • Q. Wang and X.-W. Chu, “GPGPU Power Estimation with Core and Memory Frequency Scaling,” ACM Performance Evaluation Review, Vol. 45, No. 2, pages 73-78, October 2017.
  • X. Mei, Q. Wang, and X.-W. Chu, “A Survey and Measurement Study of GPU DVFS on Energy Conservation,” Digital Communications and Networks, Vol. 3, No. 2, Pages 89-100, May 2017.
  • V. Chau, X.-W. Chu, H. Liu, and Y.-W. Leung, “Energy Efficient Job Scheduling with DVFS for CPU-GPU Heterogeneous Systems,” ACM e-Energy 2017, Hong Kong, May 2017.
  • X. Mei, X.-W. Chu, Y.-W. Leung, H. Liu, and Z. Li, “Energy Efficient Real-time Task Scheduling on CPU-GPU Hybrid Clusters,” IEEE Infocom 2017, Atlanta, GA, USA, 1-4 May, 2017.
  • X. Mei and X.-W. Chu, “Dissecting GPU Memory Hierarchy through Microbenchmarking,” IEEE Transactions on Parallel and Distributed Systems, Vol. 28. No. 1, pages 72-86, Jan 2017.

Grant Support:

This project is supported by the Research Grants Council (RGC), Hong Kong SAR, China (Project HKBU 12200418).

For further information on this research topic, please contact Prof. Xiaowen, CHU.