Source – theregister.co.uk
The Raspberry Pi is one of the most exciting developments in hobbyist computing today. Across the world, people are using it to automate beer making, open up the world of robotics and revolutionise STEM education in a world overrun by film students. These are all laudable pursuits. Meanwhile, what is Microsoft doing with it? Creating squirrel-hunting water robots.
Over at the firm’s Machine Learning and Optimization group, a researcher saw squirrels stealing flower bulbs and seeds from his bird feeder. The research team trained a computer vision model to detect squirrels, and then put it onto a Raspberry Pi 3 board. Whenever an adventurous rodent happened by, it would turn on the sprinkler system.
Microsoft’s sciurine aversions aren’t the point of that story – its shoehorning of a convolutional neural network onto an ARM CPU is. It shows how organizations are pushing hardware further to support AI algorithms. As AI continues to make the headlines, researchers are pushing its capabilities to make it increasingly competent at basic tasks such as recognizing vision and speech.
As people expect more of the technology, cramming it into self-flying drones and self-driving cars, the hardware challenges are increasing. Companies are producing custom silicon and computing nodes capable of handling them.
Jeff Orr, research director at analyst firm ABI Research, divides advances in AI hardware into three broad areas: cloud services, on‑device, and hybrid. The first focuses on AI processing done online in hyperscale data centre environments like Microsoft’s, Amazon’s and Google’s.
At the other end of the spectrum, he sees more processing happening on devices in the field, where connectivity or latency prohibit sending data back to the cloud.
“It’s using maybe a voice input to allow for hands-free operation of a smartphone or a wearable product like smart glasses,” he says. “That will continue to grow. There’s just not a large number of real-world examples on‑device today.” He views augmented reality as a key driver here. Or there’s always this app, we suppose.
Finally, hybrid efforts marry both platforms to complete AI computations. This is where your phone recognizes what you’re asking it but asks cloud-based AI to answer it, for example.
The cloud: rAIning algorithms
The cloud’s importance stems from the way that AI learns. AI models are increasingly moving to deep learning, which uses complex neural networks with many layers to create more accurate AI routines.
There are two aspects to using neural networks. The first is training, where the network analyses lots of data to produce a statistical model. This is effectively the “learning” phase. The second is inference, where the neural network then interprets new data to generate accurate results. Training these networks chews up vast amounts of computing power, but the training load can be split into many tasks that run concurrently. This is why GPUs, with their double floating point precision and huge core counts, are so good at it.
Nevertheless, neural networks are getting bigger and the challenges are getting greater. Ian Buck, vice president of the Accelerate Computing Group at dominant GPU vendor Nvidia, says that they’re doubling in size each year. The company is creating more computationally intense GPU architectures to cope, but it is also changing the way it handles its maths.
“It can be done with some reduced precision,” he says. Originally, neural network training all happened in 32‑bit floating point, but it has optimized its newer Volta architecture, announced in May, for 16‑bit inputs with 32‑bit internal mathematics.
Reducing the precision of the calculation to 16 bits has two benefits, according to Buck.
“One is that you can take advantage of faster compute, because processors tend to have more throughput at lower resolution,” he says. Cutting the precision also increases the amount of available bandwidth, because you’re fetching smaller amounts of data for each computation.
“The question is, how low can you go?” asks Buck. “If you go too low, it won’t train. You’ll never achieve the accuracy you need for production, or it will become unstable.”
While Nvidia refines its architecture, some cloud vendors have been creating their own chips using alternative architectures to GPUs. The first generation of Google’s Tensor Processing Unit (TPU) originally focused on 8‑bit integers for inference workloads. The newer generation, announced in May, offers floating point precision and can be used for training, too. These chips are application-specific integrated circuits (ASICs). Unlike CPUs and GPUs, they are designed for a specific purpose (you’ll often see them used for mining bitcoins these days) and cannot be reprogrammed. Their lack of extraneous logic makes them extremely high in performance and economic in their power usage – but very expensive.
Google’s scale is large enough that it can swallow the high non-recurring expenditures (NREs) associated with designing the ASIC in the first place because of the cost savings it achieves in AI‑based data centre operations. It uses them across many operations, ranging from recognizing Street View text to performing Rankbrain search queries, and every time a TPU does something instead of a GPU, Google saves power.
“It’s going to save them a lot of money,” said Karl Freund, senior analyst for high performance computing and deep learning at Moor Insights and Strategy.
He doesn’t think that’s entirely why Google did it, though. “I think they did it so they would have complete control of the hardware and software stack.” If Google is betting the farm on AI, then it makes sense to control it from endpoint applications such as self-driving cars through to software frameworks and the cloud.
FPGAs and more
When it isn’t drowning squirrels, Microsoft is rolling out field programmable gate arrays (FPGAs) in its own data centre revamp. These are similar to ASICs but reprogrammable so that their algorithms can be updated. They handle networking tasks within Azure, but Microsoft has also unleashed them on AI workloads such as machine translation. Intel wants a part of the AI industry, wherever it happens to be running, and that includes the cloud. To date, its Xeon Phi high-performance CPUs have tackled general purpose machine learning, and the latest version, codenamed Knight’s Mill, ships this year.
The company also has a trio of accelerators for more specific AI tasks, though. For training deep learning neural networks, Intel is pinning its hopes on Lake Crest, which comes from its Nervana acquisition. This is a co‑processor that the firm says overcomes data transfer performance ceilings using a type of memory called HBM2, which is around 12 times faster than DDR4.
While these big players jockey for position with systems built around GPUs, FPGAs and ASICs, others are attempting to rewrite AI architectures from the ground up.
Knuedge is reportedly prepping 256-core chips designed for cloud-based operations but isn’t saying much.
UK-based Graphcore, due to release its technology in 2017, has said a little more. It wants its Intelligence Processing Unit (IPU) to use graph-based processing rather than the vectors used by GPUs or the scalar processing in CPUs. The company hopes that this will enable it to fit the training and inference workloads onto a single processor. One interesting thing about its technology is that its graph-based processing is supposed to mitigate one of the biggest problems in AI processing – getting data from memory to the processing unit. Dell has been the firm’s perennial backer.
Wave Computing is also focusing on a different kind of processing, using what it calls its data flow architecture. It has a training appliance designed for operation in the data centre that it says can hit 2.9 PetaOPs/sec.
Whereas cloud-based systems can handle neural network training and inference, Client-side devices from phones to drones focus mainly on the latter. Their considerations are energy efficiency and low-latency computation.
“You can’t rely on the cloud for your car to drive itself,” says Nvidia’s Buck. A vehicle can’t wait for a crummy connection when making a split second decision on who to avoid, and long tunnels might also be a problem. So all of the computing has to happen in the vehicle. He touts the Nvidia P4 self-driving car platform for autonomous in-car smarts.
FPGAs are also making great strides on the device side. Intel has Arria, an FPGA co‑processor designed for low-energy inference tasks, while over at startup KRTKL, CEO Ryan Cousens and his team have bolted a low-energy dual-core ARM CPU to an FPGA that handles neural networking tasks. It is crowdsourcing its platform, called Snickerdoodle, for makers and researchers that want wireless I/O and computer vision capabilities. “You could run that on the ARM core and only send to the FPGA high-intensity mathematical operations,” he says.
AI is squeezing into even smaller devices like the phone in your pocket. Some processor vendors are making general purpose improvements to their architectures that also serve AI well. For example, ARM is shipping CPUs with increasingly capable GPU areas on the die that should be able to better handle machine learning tasks.
Qualcomm’s SnapDragon processors now feature a neural processing engine that decides which bits of tailored logic machine learning and neural inference tasks should run in (voice detection in a digital signal processor and image detection on a built‑in GPU, say). It supports the convolutional neural networks used in image recognition, too. Apple is reportedly planning its own neural processor, continuing its tradition of offloading phone processes onto dedicated silicon.
This all makes sense to ABI’s Orr, who says that while most of the activity has been in cloud-based AI processors of late this will shift over the next few years as device capabilities balance them out. In addition to areas like AR, this may show up in more intelligent-seeming artificial assistants. Orr believes that they could do better at understanding what we mean.
“They can’t take action based on a really large dictionary of what possibly can be said,” he says. “Natural language processing can become more personalised and train the system rather than training the user.”
This can only happen using silicon that allows more processing at given times to infer context and intent. “By being able to unload and switch through these different dictionaries that allow for tuning and personalization for all the things that a specific individual might say.”
Research will continue in this space as teams focus on driving new efficiencies into inference architectures. Vivienne Sze, professor at MIT’s Energy-Efficient Multimedia Systems Group, says that in deep neural network inferencing, it isn’t the computing that slurps most of the power. “The dominant source of energy consumption is the act of moving the input data from the memory to the MAC [multiply and accumulate] hardware and then moving the data from the MAC hardware back to memory,” she says.
Prof Sze works on a project called Eyeriss that hopes to solve that problem. “In Eyeriss, we developed an optimized data flow (called row stationary), which reduces the amount of data movement, particularly from large memories,” she continues.
There are many more research projects and startups developing processor architectures for AI. While we don’t deny that marketing types like to sprinkle a little AI dust where it isn’t always warranted, there’s clearly enough of a belief in the technology that people are piling dollars into silicon.
As cloud-based hardware continues to evolve, expect hardware to support AI locally in drones, phones, and automobiles, as the industry develops.