Using "AlexNet," 64 GPUs in parallel have achieved 27x the speed of a single GPU for what Fujitsu is claiming is the world's fastest.
Fujitsu Laboratories Ltd has developed software technology to use multiple GPUs to enable high-speed deep learning powered by the application of supercomputer software parallelization technology.
Conventionally, you would accelerate deep learning by using multiple computers equipped with GPUs, networked and arranged in parallel. The problem with this arrangement is that the effects of parallelization become progressively harder to achieve with increasing time required to share data between computers with more than 10 computers being simultaneously used.
The lab applied its technology to Caffe, an open source deep learning framework in wide use. To confirm effectiveness, they evaluated the technology on AlexNet, multi-layered neural network for image recognition. Here the technology was confirmed to have achieved learning speeds with 16 and 64 GPUs that are 14.7 and 27 times faster, respectively, than a single GPU.
Fujitsu claims these are the world's fastest processing speeds, representing an improvement in learning speeds of 46% for 16 GPUs and 71% for 64 GPUs.
The company developed two technologies to speed learning processing. The first one is a supercomputer software technology that executes communications and operations simultaneously and in parallel. The second changes processing methods according to the characteristics of the size of shared data and the sequence of deep learning processing.
These two technologies limit the increase in waiting time between processing batches even with shared data of a variety of sizes.
- Scheduling technology for data sharing This automatically controls the priority order for data transmission so that data needed at the start of the next learning process is shared among the computers in advance for multiple continuous operations (Figure 1).
Figure 1: With existing technology (left), because the data sharing processing of the first layer, necessary to begin the next learning process, is carried out last, the data sharing processing delay is longer. By carrying out the data sharing processing for the first layer during the data sharing processing for the second layer (right), the wait time until the start of the next learning process can be shortened.
- Processing technology to optimise operations for data size For processing in which operation results are shared with all computers, when the original data volume is small, each computer shares data and then carries out the same operation, eliminating transmission time for the results.
When the data volume is large, processing is distributed and processing results are shared with the other computers for use in the following operations. By automatically assigning the optimal operational method based on the amount of data, this technology minimises the total operation time (Figure 2).
Figure 2: Automatically assigning the optimal operational method based on the amount of data minimises the total operation time.
Using this technology, the time required for deep learning R&D can be shortened, such as in the development of unique neural network models for the autonomous control of robots and automobiles or for healthcare and finance, such as with pathology classification or stock price forecasting.
Fujitsu Laboratories plans to commercialise the technology as part of its AI technology, the Human Centric AI Zinrai, during its fiscal 2016.