Facebook’s AutoScale decides if AI inference runs on your phone or in the cloud
In a technical paper published on Arxiv.org this week, researchers at Facebook and Arizona State University lifted the hood on AutoScale, which shares a name with Facebook’s energy-sensitive load balancer. AutoScale, which could theoretically be used by any company were the code to be made publicly available, leverages AI to enable energy-efficient inference on smartphones and other edge devices.
Lots of AI runs on smartphones — in Facebook’s case, the models underpinning 3D Photos and other such features — but it can result in decreased battery life and performance without fine-tuning. Deciding whether AI should run on-device, in the cloud, or on a private cloud is therefore important not only for end users but for the enterprises developing the AI. Datacenters are expensive and require an internet connection; having AutoScale automate deployment decisions could result in substantial cost savings.
For each inference execution, AutoScale observes the current execution rate, including the architectural characteristics of the algorithm and runtime variances (like Wi-Fi, Bluetooth, and LTE signal strength; processor utilization; voltage; frequency scaling; and memory usage). It then selects hardware (processors, graphics cards, and co-processors) that are expected to maximize energy efficiency while satisfying quality of service and inference targets based on a lookup table. (The table contains the accumulated rewards — values that spur on AutoScale’s underlying models to complete goals — of the previous selections.) Next, AutoScale executes inference on the target defined by the selected hardware while observing its result, including energy, latency, and inference accuracy. Based on this and before updating the table, the system calculates a reward indicating how much the hardware selection improved efficiency.
As the researchers explain, AutoScale taps reinforcement learning to learn a policy to select the best action for an isolated state, based on accumulated rewards. Given a processor, for example, the system calculates a reward with a utilization-based model that assumes (1) processor cores consume a variable amount of power; (2) cores spend a certain amount of time in busy and idle states; and (3) energy usage varies among these states. By contrast, when inference is scaled out to a connected system like a datacenter, AutoScale might calculate a reward using a signal strength-based model that accounts for transmission latency and the power consumed by a network.
To validate AutoScale, the coauthors of the paper ran experiments on three smartphones, each of which was measured with a power meter: the Xiaomi Mi 8 Pro, the Samsung Galaxy S10e, and the Motorola Moto X Force. To simulate cloud inference execution, they connected the handsets to a server via Wi-Fi, and they simulated local execution with a Samsung Galaxy Tab S6 tablet connected to the phones through Wi-Fi Direct (a peer-to-peer wireless network).
After training AutoScale by executing inference 100 times (resulting in 64,000 training samples) and compiling and generating 10 executables containing popular AI models, including Google’s MobileBERT (a machine translator) and Inception (an image classifier), the team ran tests in a static setting (with consistent processor, memory usage, and signal strength) and a dynamic setting (with a web browser and music player running in the background and signal inference). Three scenarios were devised for each:
- A non-streaming computer vision test scenario where a model performed inference on a photo from the phones’ cameras.
- A streaming computer vision scenario where a model performed inference on a real-time video from the cameras.
- A translation scenario where translation was performed on a sentence typed by the keyboard.
The team reports that across all scenarios, AutoScale beat baselines while maintaining low latency (less than 50 milliseconds in the non-streaming computer vision scenario and 100 milliseconds in the translation scenario) and high performance (around 30 frames per second in the streaming computer vision scenario). Specifically, it resulted in a 1.6 to 9.8 times energy efficiency improvement while achieving 97.9% prediction accuracy and real-time performance.
Moreover, AutoScale only ever had a memory requirement of 0.4MB, translating to 0.01% of the 3GB RAM capacity of a typical mid-range smartphone. “We demonstrate that AutoScale is a viable solution and will pave the path forward by enabling future work on energy efficiency improvement for DNN edge inference in a variety of realistic execution environment,” the coauthors wrote.
One of the most popular existing techniques — neural ordinary equations (ODEs) — have an important limitation in that they can’t account for random interactions, meaning that they can’t update the state of a system as random events occur. (Think trades by other people that affect a company’s share price or a virus picked up at a hospital that changes a person’s health status.) The system has to be updated manually on some schedule to account for these, which means that the model isn’t truly mapping to reality.
Neural SDEs have no such limitation. That’s because they represent continuous changes in state as they occur.
As the coauthors of the paper explain, neural SDEs generalize ODEs by adding instantaneous noise to their dynamics. This and other algorithmic tweaks allow tens of thousands of variables (parameters) to be fitted to a neural SDE, making it a fit for modeling things like the motion of molecules in a liquid, allele frequencies in a gene pool, or prices in a market.
In one experiment, the team trained ODE and neural SDE models on a real-world motion capture data set comprising 23 walking sequences partitioned into 15 training, 3 validation, and 4 test sequences. After 400 iterations, they observed improved predictive performance from the neural SDEs compared with the ODEs — the former had a mean squared error of 4.03% versus the ODE’s 5.98% (lower is better).
“Building on the early work of Einstein, these SDEs enable models to represent continuous changes in state as they occur and to do so at scale,” a Vector Institute spokesperson told VentureBeat via email. “Non-neural SDEs are used in finance and health today, but their scale is limited. As mentioned at the top, neural SDEs introduce the new chance to apply AI at scale to large complex financial systems without having to make the big … compromises that have typically been required.”