Machine Learning at Speed: Optimization Code Increases Performance by 5x

Training large-scale machine learning models can be made more efficient by optimizing network communication.
KAUST collaborated to increase machine learning speed on parallelized computing systems by inserting lightweight optimization code into high-speed network devices.
This technology, called “in-network aggregate, ” was developed by researchers at Intel, Microsoft, and the University of Washington. It can dramatically improve speed using readily available programmable networking hardware.
Machine learning is an essential step in artificial intelligence (AI), allowing it to interact with and “understand” the world. It involves significant amounts of labeled training data. This is what gives the model its fundamental advantage. The better the AI model performs when exposed to new inputs, the more data it is trained on.

Machine learning has improved and used larger models and diverse datasets to fuel the recent surge in AI applications. However, performing machine-learning computations is highly demanding and increasingly requires large numbers of computers to run the algorithm simultaneously.
Marco Canini, a member of the KAUST research group, says that “how to train deep-learning model at a large scale” is a complex problem. The AI models can have billions of parameters and run on hundreds of processors. Communication among processors can quickly become a performance bottleneck in such systems due to incremental model updates.
The team discovered a solution in a new network technology that Barefoot Networks had developed, which is a division within Intel.
“We use Barefoot Networks’ new programmable network hardware to offload some of the work during distributed machine learning training,” explains Amedeo sapio, a KAUST alumnus. He has since joined the Barefoot Networks group at Intel. “Using the new programmable networking hardware to move data rather than the network means we can perform computations along network paths,” said Amedeo Sapio, a KAUST alumnus.
SwitchML’s key innovation is that the SwitchML platform allows the network hardware to perform data aggregation tasks at each step of the machine-learning process’s model update phase. This offloads a portion of the computational load and dramatically reduces data transmission.
Canini says that although the programmable switch data plane can perform operations quickly, its capabilities are limited. “So, our solution had to be easy enough to work with the hardware but flexible enough to address challenges like little onboard memory. SwitchML solves this problem by co-designing the communication system and distributed training algorithm. This allows for an acceleration of up to 5.5 times over the current state-of-the-art.

Leave a Reply

Your email address will not be published. Required fields are marked *