Oracle open-sources Java machine learning library
Tribuo provides standard machine learning functionality including algorithms for classification, clustering, anomaly detection, and regression. Tribuo also includes pipelines for loading and transforming data and provides a suite of evaluations for supported prediction tasks. Because Tribuo collects statistics on inputs, Tribuo can describe the range of each input, for example. It also names features, managing feature IDs and output IDs under the hood to avoid ID conflicts and confusion when chaining models, loading data, and featurizing inputs.
A Tribuo model knows when it sees a feature for the first time, which is particularly useful when working with natural language processing. Models know what outputs are, with outputs being strongly typed. Developers do not need to wonder if a float is a probability, a regressed value, or a cluster ID. With Tribuo, each of these is a separate type; the model can describe types and ranges it knows about. Use of strongly typed inputs and outputs means Tribuo can track the model construction process, from the point data is loaded through train/test splits or dataset transformations to model training and evaluation. This tracking data is baked into all models and evaluations.
The Tribuo provenance system can generate a configuration that rebuilds the training pipeline to reproduce the model or evaluation. Also, a tweaked model can be built on new data or hyperparameters. Thus users always know what a Tribuo model is, where it came from, and how to create it.
Oracle sees Tribuo filling a gap in the marketplace for machine learning for enterprise applications. For example, whereas the Google-built TensorFlow library provides core algorithms for deep learning, Tribuo provides several machine learning algorithms, some of which are in TensorFlow and some of which are not, while also providing an interface to TensorFlow, said Oracle’s Adam Pocock, principal member of the Oracle Labs technical staff. And whereas the Apache Spark analytics engine is for large, distributed systems, Tribuo is for smaller computations that can fit on a single machine, Pocock said.
In addition to TensorFlow, Tribuo provides interfaces to XGBoost and the ONNX runtime, allowing models stored in the ONNX format or trained in TensorFlow and XGBoost to be deployed alongside native Tribuo models. Support for the ONNX model format allows deployment in Java of models trained using popular Python libraries such as PyTorch.