Virtualized AI: Deep Learning Needs More than Just More Compute Power
Is the recent progress in deep learning true artificial intelligence? A widely-discussed article by Google’s Francois Chollet discusses the skill acquisition-based approach to gathering intelligence – the one currently in use in modern DL. He argues that with huge data sets available for training models, AI is mastering skill-acquisition but not necessarily the “scope, generalization difficulty, priors, and experience” that true AI should incorporate. Even with our progress in AI, and specifically DL, we are nowhere near the limits of what DL can achieve with bigger, better-trained more accurate models, those that take into account not only skill but experience, and generalization of that experience.
Understandably, this has put intense focus on computing power, particularly the hardware that enables data scientists to run complex training experiments. Nvidia increasing sees DL as a key market for its GPUs and bought Mellanox to speed communication inside a GPU cluster. With its recent acquisition of Habana, Intel is likely betting that custom AI accelerator hardware is a better match. Other AI-first hardware includes Cerebras’s massive chip in a custom box that’s designed for the specific types of intensive, long-running workloads that training DL models require. In the cloud, Google’s Tensor Processing Units offer another bespoke option.
For companies running their own DL workloads, more compute is generally better. Whether exotic AI accelerators or tried-and-tested GPUs, quicker model training means more iterations, faster innovation and reduced time-to-market. It may even mean we can achieve “strong” AI (i.e., AI than goes beyond “narrow AI,” which is the capability of doing a single, discrete task) quicker. In 2020, continuing the trend of recent years, companies will invest in ever-more AI hardware, in an effort to satisfy data scientists’ demands for compute to run bigger models to solve more complex business problems.
But hardware isn’t the whole picture. The conventional computing stack – from processor to firmware to virtualization, abstraction, orchestration and operating layers through to end-user software – was designed for traditional workloads, prioritizing high-availability, short-duration operations.
Training a DL model, though, is the opposite of this sort of workload. While running a model, an experiment may need 100 percent of all the computing power of one or multiple processors for hours or even days at a time.
Part of the challenge is that, while developing a DL algorithm, data scientists have two basic use-patterns for compute resources. The first phase of development is building a model, which includes writing new code and debugging it until the model is ready. During this phase, they tend to use a single GPU often but for short periods of time.
The second phase is training, where the model consumes all the training data and adjusts its parameters. This might consume more than a single GPU and could even take a whole cluster working for days. Sometimes data scientists want to try training a few variations of the same model in parallel to see which performs better.
In a large company, computing resources for DL are typically provided by the IT department. Perhaps each data scientist is statically allocated a fixed amount of physical resources, say a GPU or two for building and training models. Inevitably, this means that expensive processors are sitting idle. Alternatively, a data science team might share their processing power and have to squabble over who gets to tie up the Nvidia DGX AI supercomputer for three days and who has to wait their turn.
All of this also creates challenges for enterprise IT. The IT department has limited visibility into how data science teams are using their expensive compute resources. Meanwhile, the C-suite doesn’t really understand how their GPU resources are being used and whether that usage matches their business goals. Should they invest money in more hardware? Should they hire more data science teams? Or is the issue in the workflow, with both idle resources and data scientists, unable to utilize them, having to wait for compute time.
Every minute a GPU or AI accelerator is idle is an opportunity cost. IT departments face under-utilization of their GPUs while data science teams see their productivity damaged because, from their point of view, the hardware is ‘in use’ and can’t train a new model until it’s finished with its current job. If unused GPUs could be used at full capacity, it would allow faster model training and more iterations and faster time to market.
This is the challenge that companies are beginning to face. Better hardware and more of it might well be necessary, but it isn’t sufficient if the software stack isn’t set up to also make efficient and effective use of that hardware.
The fundamental question of how to share hardware efficiently isn’t new. Some of the challenges that data scientists face could be solved by looking again at how virtualization solved this problem in traditional computing.
Traditional computing uses virtualization to share a single physical resource between multiple workloads. But what if instead of sharing a single physical resource, virtualization was used to create a pool of resources, allowing DL projects to consume as much of the shared resources as they need in an elastic, dynamic way? A virtualized AI infrastructure for DL would run a single workload on multiple shared physical resources. Ideally these resources could be dynamically allocated to the experiments that need them the most, allowing IT administrators to manage resources efficiently, reducing idle GPU time and increasing cluster utilization.
The software stack for DL needs to evolve along with the chips, both to get the most out of individual training experiments and to better optimize running multiple experiments in parallel. Companies will need a full stack, AI-first solution that accounts for the needs of both DL work loads and, critically, DL organizations.