Microsoft Corp. today previewed a new Azure instance for training artificial intelligence models that targets the emerging class of advanced, ultra-large neural networks being pioneered by the likes of OpenAI.
The instance, called the ND A100 v4, is being touted by Microsoft as its most powerful AI-optimized virtual machine to date.
The ND A100 v4 aims to address an important new trend in AI development. Engineers usually develop a separate machine learning model for every use case they seek to automate, but recently, a shift has started toward building one big, multipurpose model and customizing it for multiple use cases. One notable example of such an AI is the OpenAI research group’s GPT-3 model, whose 175 billion learning parameters allow it to perform tasks as varied as searching the web and writing code.
Microsoft is one of OpenAI’s top corporate backers. The company has also adopted the multipurpose AI approach internally, disclosing in the instance announcement today that such large AI models are used to power features across Bing and Outlook.
The ND A100 v4 is aimed at helping other companies train their own supersized neural networks by providing eight of Nvidia Corp.’s latest A100 graphics processing units per instance. Customers can link multiple ND A100 v4 instances together to create an AI training cluster with up to “thousands” of GPUs.
Microsoft didn’t specify exactly how many GPUs are supported. But even at the low end of the possible range, assuming a cluster with a graphics card count in the low four figures, the performance is likely not far behind that of a small supercomputer. Earlier this year, Microsoft built an Azure cluster for OpenAI that qualified as one of the world’s top five supercomputers, and that cluster had 10,000 GPUs.
In the new ND A100 v4 instance, what facilitates the ability to cluster together GPUs is a dedicated 200-gigabit per second InfiniBand network link provisions to each chip. These connections allow the graphics cards to communicate with each across instances. The speed at which GPUs can share data is a big factor in how fast they can process that data, and Microsoft says its the ND A100 v4 VM offers 16 times more GPU-to-GPU bandwidth than any other major public cloud.
The InfiniBand connections are powered by networking gear supplied by Nvidia’s Mellanox unit. To support the eight onboard GPUs, the new instance also packs a central processing unit from Advanced Micro Devices Inc.’s second-generation Epyc series of server processors.
The end result is what the company describes as a big jump in AI training performance. “Most customers will see an immediate boost of 2x to 3x compute performance over the previous generation of systems based on Nvidia V100 GPUs with no engineering work,” Ian Finder, a senior program manager at Azure, wrote in a blog post. He added that some customers may see performance improve by up to 20 times in some cases.
Microsoft’s decision to use Nvidia chips and Mellanox gear to power the instance shows how chipmaker is already reaping dividends from its $6.9 billion acquisition of Mellanox, which closed this year. Microsoft’s own investments in AI and related develop have likewise helped it win customers. Today’s debut of the new AI instance was preceded by the Tuesday announcement that the U.S. Energy Department has partnered with the tech giant to develop AI disaster response tools on Azure.