As part of our work on Yuma and Hivemind, we are always looking for ways to make text search faster. Naturally, we’re big fans of vector search and depend on it throughout our entire context-generation pipeline; we’re also naturally frustrated with how slow embedding can be. This has led us to explore all kinds of strategies for speeding the process up, from small transformer-based models on GPU machines to static models running on CPU.
You might remember static models from ten-or-so years ago (e.g. Word2Vec), but their promise is simple; they allow you to embed text inputs with zero active parameters, which makes deployments a lot smaller and inference significantly faster (multiple orders of magnitude). Unfortunately, this comes with a major hit in retrieval accuracy, which kind of defeats the purpose of the whole exercise. In the past couple of years, however, there’s been a renewed interest in static models to overcome poor retrieval accuracy; as vector search becomes more ubiquitous, so too does the impetus for a multiple-order-of-magnitude increase in embedding throughput. As a result of this renewed focus, new ways of training these models have emerged from folks like MinishLab with their potion models, able to ease some of the loss in retrieval accuracy.
One such new method of training comes from some work by Tom Aarsen at the SentenceTransformers team. Aarsen used more modern contrastive training methods to produce a highly performant static model with much higher accuracy than similar models; basically, the technique takes a larger transformers-based model and uses that to train a smaller static model. As far as retrieval accuracy of static embedding goes, this represented a large jump in the state-of-the-art. To us, this was a clear signal that there is a significant opportunity to rethink how static models are created and put into service.
Using the model, static-retrieval-mrl-en-v1, in Rust was kind of a pain, though. After all, the “model” is basically just a lookup table from token IDs to output weights, so we decided that the model runtime needed to be recomposed to match the model’s relative simplicity. We took the weights and tokenizer from the model Aarsen trained and directly materialized them to static globals at compile-time (using a relatively complicated build.rs) to skip the cost of materializing the model at runtime inside of an ML library, and significantly simplified the input tokenization pipeline to match. This encourages vectorization of the entire “inference” pipeline by the compiler, which works better than we could have ever hoped.
This results in an extreme boost in performance—two orders of magnitude faster than the original model (which itself is about 400x faster than a transformers-based model on CPU). Our model achieves this with no degradation in accuracy—it scores similarly on the NanoBEIR benchmark because it’s the same model with the same weights, simply materialized in a much thinner, purpose-built runtime.
Below, we compare our model’s performance with MinishLab’s smallest model, Aarsen’s StaticMRL, and the ubiquitous MiniLM-L6. We ran these benchmarks on an M4 Mac Mini and a Raspberry Pi 4 Model B.
If you can believe it, our model running on the Raspberry Pi was faster than any other model running on the Mac Mini.
We think there’s obviously a lot of territory to explore building these types of models that can be deployed anywhere with no special hardware requirements, and we look forward to open-sourcing our model deployment pipeline in the coming weeks.
In the meantime, reach out if you’d like early access to the crate.
P.S. thanks to Cyril for lending us Raspberry Pi hardware!