Implement Photonic Tensor Cores for Machine Learning?
Researchers from George Washington University have reported an approach for building photonic tensor cores that leverages phase change photonic memory to implement a neural network (NN). Their novel architecture, reported online in AIP Applied Physics Review last week, promises both performance gains and power advantages over traditional GPUs and other tensor core devices. While several photonic neural network designs have been explored, a photonic tensor core to perform tensor operations is yet to be implemented.
Photonics[i] encompasses the broad family of light-based technologies that spans fiber optics through hybrid optoelectronics. Optical interconnect technology, for example, is an important area of research to improve memory-to-processor and even processor-to-processor bandwidth. High bandwidth and low power are among photonics’ attractions.
In their paper, Photonic tensor cores for machine learning, Mario Miscuglio and Volker Sorger argue that in the age of heterogeneous computing, photonic-based specialized processors have great potential to augment electronic systems and may perform exceptionally well in network-edge devices as well as 5G communications. A pre-trained, photonic tensor core neural network used for inferencing, for example, would consume very little power.
Miscuglio told HPCwire, “Besides the increased speeds and bandwidths that can come from working directly in the optical domain, leveraging on the intrinsic optical nature of signal travelling in optical fibers, the advantage of using the photonic architecture is the lower power consumption for performing inference which can be useful for intelligent optical low-power sensors.”
Broadly speaking, neural networks make heavy use of matrix-vector multiplications. No surprise the latest GPUs and TPUs are much better than CPUs at this kind of calculation. The researchers summarize the challenge nicely in the paper:
“For a general-purpose processor offering high computational flexibility, these matrix operations take place serially (i.e., one-at-a-time) while requiring continuous access to the cache memory, thus generating the so-called “von Neumann bottleneck.” Specialized architectures for NNs, such as Graphic Process Units (GPUs) and Tensor Process Units (TPUs), have been engineered to reduce the effect of the von Neumann bottleneck, enabling cutting-edge machine learning models. The paradigm of these architectures is to offer domain-specificity, such as optimization for convolutions or Matrix-Vector Multiplications (MVM) performing operations, unlike CPUs, in parallel and thus deployment of a systolic algorithm.
“GPUs have thousands of processing cores optimized for matrix math operations, providing tens to hundreds of TFLOPS (Tera FLoating Point OPerations) of performance, which makes GPUs the obvious computing platform for deep NN-based AI and ML applications. GPUs and TPUs are particularly beneficial with respect to CPUs, but when used to implement deep NN performing inference on large 2-dimensional datasets such as images, they are rather power-hungry and require longer computation runtime (>tens of ms). Moreover, smaller matrix multiplication for less complex inference tasks [e.g., classification of handwritten digits of the Modified National Institute of Standards and Technology database (MNIST)] are still challenged by a non-negligible latency, predominantly due to the access overhead of the various memory hierarchies and the latency in executing each instruction in the GPU.”
They propose a tensor core unit implemented in photonics that relies on photonic multiplexed (WDM, wavelength division multiplexing) signals, “weighted, after filtering, using engineered multi-state photonic memories based on Ge2Sb2Se5 wires patterned on the waveguide. The photonic memories are reprogrammed by selectively changing the phase (amorphous/crystalline) of the wires, using electrothermal switching through Joule heating induced by tungsten electrodes. The photonic memory programming can be realized in parallel (few microseconds), if needed, or alternatively, this photonic tensor core can operate as a passive system with a pre-SET kernel matrix.”
See two figures from the paper below depicting 1) the core and 2) the memory.
The phase change memory technology is a critical advance, said Miscuglio, “Each neuron in our brain stores and processes data at the same time. Similarly, in our architecture we use memory cells that can be written electronically and can store multi-bit weights and can be read optically by simply letting light interact with the material. Our photonic memories rely on broadband transparent phase change materials, which unlike other implementation based on more established GST (germanium-antimony-tellurium), are characterized by negligible losses in the amorphous state at the telecom wavelength.”
“This is important because it enables for deeper architectures which can potentially solve more complex tasks without using additional laser sources or amplifiers. We also propose a multi-state photonic memory (4-bit) architecture, which can be easily erased and written on chip, using electrothermal heaters. All the memories have dedicated circuitry and can be written in parallel, unlike other implementations which rely on cumbersome optical writing/erasing either on-chip or off-chip,” he said.
Miscuglio said the architecture does not map a specific network architecture but is a more general accelerator for neural networks. Exploiting its modular architecture, one could “straightforwardly use the photonic TPU for a series of operations including but not limited to matrix-matrix multiplication, such as vector matrix multiplication, convolutions. These algebraic operations are key operations of many complex scientific and societal problems.”
“We think that in the long-term data centers would greatly benefit from this architecture since much of the information that they are handling are already in the optical domain. We don’t think it will replace supercomputers but will be useful as a preprocessing unit to work synergistically with supercomputers on data closer to the edge of the network to sorting and correlate the signals looking for specific chunks of data or patterns and consequently reducing data traffic.”
At the time of the paper they had tested the multi-state low-losses photonic memories devices “showing performances, which are in excellent agreement with the simulations.” Miscuglio said, “We developed the architecture of the single photonic core which performs 4×4 matrix multiplication are currently working on the development of the first generation of the photonic tensor Core. Regarding a timeline, we plan to have an experimental demonstration of the single core within six months to one year, and a fully functioning multicore tensor processor within the next couple of years.”