Speeding up corpus ingestion | Flower Computer Co.

“Most users don’t come to a knowledge base with the right query. They ask things like ‘catch me up’ or ‘what should I know before this call’ and expect the agent to understand what matters. Flower Co’s static embedding model made Enzyme’s compile step fast enough that we can build that understanding ahead of time, so agents start from a map of the corpus rather than guessing their way through it at query time.”

Most conversations with LLMs start with blank slates, empty chat windows, etc. Some users, however, ground their conversations in a corpus of material that should inform how they take shape, called a knowledge base. Although structurally similar to an agent interacting with a codebase, knowledge bases’ layouts are typically messier and do not have the navigational affordances of code.

This is why Enzyme exists; it turns an existing knowledge base into a highly performant memory system, allowing an agent to orient itself well enough to respond to even vague queries (e.g. ”catch me up”, “what do I need to know”).

Read more about Enzyme’s approach from Josh here or in the docs.

Using an existing corpus to influence how an LLM behaves obviously requires turning that corpus into agent-friendly material, typically by creating indices, graph representations, and other relational mappings of the content. These searchable surfaces allow for useful context to be compiled from the source material, which can improve the helpfulness or insight of an LLM. Enzyme does this by structuring the relations in the corpus through ‘catalysts’—questions formulated about the material during ingestion.

Currently, the ingestion step is a bottleneck for creating memory systems, as documents are chunked, embedded, and then saved to a database, cataloging the relationships between the chunks. By switching to our model, text embedding became sublinear in Enzyme’s ingestion time, which means it is no longer a bottleneck, allowing Enzyme to grapple with huge knowledge bases quickly. With this higher level of performance, Josh could expand the number of chunks informing Enzyme’s context graph, making the overall experience feel more intelligent.

Enzyme was originally using a local static embedding model released by the Minish team called Potion-2M. Our model, explained in more detail here, improved Enzyme’s ingestion times by 6x.

After integrating our model, Josh tested out the new version with a friend — Enzyme ingested their entire email history (~20k emails) in seconds, before the first query against it could even be written.

Our model also simplified the development and deployment process of Enzyme. Typically when using local models, a developer makes calls to a model that is separate from their application binary. Our model is compiled directly with the parent application, making shipping software much easier — not to mention smaller binaries since a model runtime is no longer needed.

While our model and other static models are much faster than embedding models with a transformer architecture, they are much less accurate, as they do not involve context surrounding a token to determine its embedded value. However, we’ve found that in many situations, this trade-off is reasonable, as raw embeddings are rarely used alone to surface results.

In Enzyme’s case, our model unlocked the ability to compute far more ‘catalysts’, which in turn increased the relevance of context for queries. We’ve found that being able to embed much more cheaply opens up more possibilities in structuring the relationships between text chunks, which more than makes up for the decrease in embedding accuracy.

Running embedding models without a GPU? Get in touch.