So I Wrote a Vector Database

Why and how I decided to do it.

There is usually a point when you're starting a new project where you need to solve a very specific problem and you have to go looking for what was already out there to see if it fits the bill. For me the project was trying to run retrieval augmented generation (RAG) on a set of medical documents, and the specific problem I had was that none of the existing solutions I saw¹ for LangChain fit my requirements: run from disk, be free, dont require me to run a service, and have a lang-chain VectorDatabase adapter. In essence I wanted a package with minimal dependencies that would not eat ram faster than my browser does. While there were some things that were free, you had to self-host a service that wanted a lot of memory. Or some offerings used disk but had a large dependency tree. They all serve a market of people doing this in some production (or production adjacent) environment, where you're losing money for every millisecond its trying to find the right vector to send back. All I wanted was something that I could use to run some medical questions through the model-du-jour on huggingface and get some answers without nuking my video/system memory using FAISS.²

Another phenomena that I see somewhat more frequently from those technically inclined is the ominous phrase "I could make that". This belief, while likely being responsible for many impactful innovations, is also one of the more damaging beliefs a middling engineer like myself can have. Unfortunately, on this occasion I did speak that incantation and a weekend vanished before my very eyes.

The Implementation

While sounding somewhat advanced, a vector database, at least the way I've chosen to implement one, is rather simple; its really just one big array. You tell it you want to add a vector, and it gives you back the index where it was added. It doesn't really care what the vector means, nor does it care if you keep track of the index. You are responsible for keeping track of what vector corresponds to what piece of data. The complexity comes from how you choose to implement the search of those vectors. The current iteration of TinyVec is a Minimum Viable (non)Product of what I needed in a vector database, but is built such that optimizations can be added in the future. I do hope to delve into some of the more advanced optimizations that new vector databases use to get their billion dollar valuations, but for now you're getting the basics.

Vectors can be complex pieces of data. The original purpose of this implementation was to store and retrieve vectors that are somewhere in the neighborhood of 1000-2000 features, a pretty average size when it comes to language modeling. A few thousand arrays of 1000 items doesn't sound like a lot, but in reality the overhead can be pretty killer. Python in particular can be pretty bad because everything is an object, complete with a reference counter and all the other bells and whistles that let it be so beginner friendly.³ To avoid incurring this overhead of the ~BILLIONS~ of floats I wanted to store, the obvious choice was numpy because there are very few cases in which you'd want to use a vector database and not already have numpy installed and many of the matrix math functions that are essential in a vector database have already been implemented by people who have a much stronger background in math than I do.

Continuing this trend of numpy efficiency, we can use the memmap()⁴ function to keep the big vector array on disk to satisfy our low memory requirements. All this does is keep the numbers in contiguous disk space and allow you to use it like any other numpy ndarray, indexing is done using file byte offsets. The major challenge here being that there is no clear way to resize an existing np.memmap due to it being contiguous disk space, so you have to do some special reallocation logic when you exceed the current size of your vector database. This could certainly be improved in the current TinyVec implementation because it doubles the disk space requirement for your dataset, but the best course of action would be to pass in whatever size you'll need when you initially create the database so that the disk space is only allocated once.

Results

Again pretty much all I have right now is a brute force search over all of the existing vectors, so the results are unsurprisingly not all that impressive. I tested this with SIFT, because that seemed like as good a test as any for a vector database. It takes us around 40ms to find the top k vectors in a database with a vector size of 128. Its the same for all k because we're checking the distance for every vector every time anyway. We also have perfect recall because we're checking every vector with the same metric that the evaluation set was created with, squared euclidean distance.

Localization	K	Recall	Time/Query (ms)	DB Size
Brute Force	1	1.0	38	10000
Brute Force	2	1.0	38	10000
Brute Force	3	1.0	38	10000
Brute Force	4	1.0	38	10000
Brute Force	5	1.0	38	10000

With only 10,000 items its a passable experience in terms of query time, most projects using this for basic RAG on personal datasets wont notice much lag when you compare 40ms to the minutes it could take to summarize a query with large context. On the other hand, when you look to a more likely scenario using this in production, it starts to fall apart very quick...

Localization	K	Recall	Time/Query (ms)	Base Size
Brute Force	1	1.0	7780	1000000
Brute Force	2	1.0	7623	1000000
Brute Force	3	1.0	7231	1000000
Brute Force	4	1.0	7912	1000000
Brute Force	5	1.0	7123	1000000

As you can see, its essentially useless right now. If any search or storefront company takes 7 seconds for a query, they're losing whatever customer was using their site almost immediately. Looking at the profiler for this call, we clearly spend most of our time comparing vectors:

So thats where I'll be looking to optimize next; not just the speed at which we do the comparisons (because I think numpy has a pretty good handle on that already) but mostly how many we have to do.