So I Wrote a Vector Database
Why and how I decided to do it.
There is usually a point when you're starting a new project where you need to solve a very specific problem and you have to go looking for what was already out there to see if it fits the bill. For me the project was trying to run retrieval augmented generation (RAG) on a set of medical documents, and the specific problem I had was that none of the existing solutions I saw
1Now that I looked harder at the listing, I see a couple that could have possibly fulfilled what I wanted but I'm happy with the exercise
for LangChain fit my requirements: run from disk, be free, dont require me to run a service, and have a lang-chain VectorDatabase adapter. In essence I wanted a package with minimal dependencies that would not eat ram faster than my browser does. While there were some things that were free, you had to self-host a service that wanted a lot of memory. Or some offerings used disk but had a large dependency tree. They all serve a market of people doing this in some production (or production adjacent) environment, where you're losing money for every millisecond its trying to find the right vector to send back. All I wanted was something that I could use to run some medical questions through the model-du-jour on huggingface and get some answers without nuking my video/system memory using FAISS.
2There are a few C++ libraries that could be used in this as well but I think it would have taken me longer to build them and call them in python than it did to write this project.
Another phenomena that I see somewhat more frequently from those technically inclined is the ominous phrase "I could make that". This belief, while likely being responsible for many impactful innovations, is also one of the more damaging beliefs a middling engineer like myself can have. Unfortunately, on this occasion I did speak that incantation and a weekend vanished before my very eyes.
The Implementation
While sounding somewhat advanced, a vector database, at least the way I've chosen to implement one, is rather simple; its really just one big array. You tell it you want to add a vector, and it gives you back the index where it was added. It doesn't really care what the vector means, nor does it care if you keep track of the index. You are responsible for keeping track of what vector corresponds to what piece of data. The complexity comes from how you choose to implement the search of those vectors. The current iteration of TinyVec is a Minimum Viable (non)Product of what I needed in a vector database, but is built such that optimizations can be added in the future. I do hope to delve into some of the more advanced optimizations that new vector databases use to get their billion dollar valuations, but for now you're getting the basics.
Vectors can be complex pieces of data. The original purpose of this implementation was to store and retrieve vectors that are somewhere in the neighborhood of 1000-2000 features, a pretty average size when it comes to language modeling. A few thousand arrays of 1000 items doesn't sound like a lot, but in reality the overhead can be pretty killer. Python in particular can be pretty bad because everything is an object, complete with a reference counter and all the other bells and whistles that let it be so beginner friendly.
3Python has no primitive types. Go call dir() on an integer in python and see what it says, I got 72 items for a single int var.
To avoid incurring this overhead of the ~BILLIONS~ of floats I wanted to store, the obvious choice was
numpy
because there are very few cases in which you'd want to use a vector database and not already have numpy installed and many of the matrix math functions that are essential in a vector database have already been implemented by people who have a much stronger background in math than I do.
Continuing this trend of numpy efficiency, we can use the
memmap()
4The bin files are so efficient, they dont even store their shape on disk. This means you need to know ahead of time at least one of the dimensions to figure out how to load the file.
function to keep the big vector array on disk to satisfy our low memory requirements. All this does is keep the numbers in contiguous disk space and allow you to use it like any other numpy
ndarray
, indexing is done using file byte offsets. The major challenge here being that there is no clear way to resize an existing
np.memmap
due to it being contiguous disk space, so you have to do some special reallocation logic when you exceed the current size of your vector database. This could certainly be improved in the current TinyVec implementation because it doubles the disk space requirement for your dataset, but the best course of action would be to pass in whatever size you'll need when you initially create the database so that the disk space is only allocated once.
Results
Again pretty much all I have right now is a brute force search over all of the existing vectors, so the results are unsurprisingly not all that impressive. I tested this with
SIFT, because that seemed like as good a test as any for a vector database. It takes us around 40ms to find the top
k
vectors in a database with a vector size of 128. Its the same for all
k
because we're checking the distance for every vector every time anyway. We also have perfect recall because we're checking every vector with the same metric that the evaluation set was created with, squared euclidean distance.
Localization | K | Recall | Time/Query (ms) | DB Size |
Brute Force | 1 | 1.0 | 38 | 10000 |
Brute Force | 2 | 1.0 | 38 | 10000 |
Brute Force | 3 | 1.0 | 38 | 10000 |
Brute Force | 4 | 1.0 | 38 | 10000 |
Brute Force | 5 | 1.0 | 38 | 10000 |
With only 10,000 items its a passable experience in terms of query time, most projects using this for basic RAG on personal datasets wont notice much lag when you compare 40ms to the minutes it could take to summarize a query with large context. On the other hand, when you look to a more likely scenario using this in production, it starts to fall apart very quick...
Localization | K | Recall | Time/Query (ms) | Base Size |
Brute Force | 1 | 1.0 | 7780 | 1000000 |
Brute Force | 2 | 1.0 | 7623 | 1000000 |
Brute Force | 3 | 1.0 | 7231 | 1000000 |
Brute Force | 4 | 1.0 | 7912 | 1000000 |
Brute Force | 5 | 1.0 | 7123 | 1000000 |
As you can see, its essentially useless right now. If any search or storefront company takes 7 seconds for a query, they're losing whatever customer was using their site almost immediately. Looking at the profiler for this call, we clearly spend most of our time comparing vectors:

So thats where I'll be looking to optimize next; not just the speed at which we do the comparisons (because I think numpy has a pretty good handle on that already) but mostly how many we have to do.