Vector Store with Chroma in Anvil

I thought I’d quickly show something which may come in handy for the community…

You CAN implement ChromaDB as a vector store directly in your anvil app.

First, switch your python version and add the following packages (you’ll need both for this to work):

chromadb
pysqllite3-binary

Once that’s built, in a server module you’ll need the following import statements:

__import__('pysqlite3')
import sys
sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')
import chromadb
from chromadb.utils import embedding_functions
from chromadb.config import Settings

chroma_client = chromadb.Client(Settings(allow_reset=True))

You can then easily create and manage your collections in your sessions, even using authenticated calls (with the (require_user=True) parameter on your server callables). In this example I’m using the user ID as the collection name for the duration of the session, to give me fine grained control over the data going into a RAG model. It’s fairly self explanatory, noting that you have to get the formatting right for the collection name of you’ll hit errors:

@authenticated_call
def create_collection():
  raw_id = anvil.users.get_user().get_id()
  sanitized_id = raw_id.replace("[", "").replace("]", "").replace(",", "_")
  collection_name = f"user_{sanitized_id}_session"
  existing_collections = chroma_client.list_collections()
  existing_collection_names = [col.name for col in existing_collections]
  if collection_name in existing_collection_names:
    collection = chroma_client.create_collection(name=collection_name)

@authenticated_call
def delete_collection():
  raw_id = anvil.users.get_user().get_id()
  sanitized_id = raw_id.replace("[", "").replace("]", "").replace(",", "_")
  collection_name = f"user_{sanitized_id}_session"
  existing_collections = chroma_client.list_collections()
  existing_collection_names = [col.name for col in existing_collections]
  if collection_name in existing_collection_names:
    chroma_client.delete_collection(name=collection_name)
    chroma_client.reset()

And that’s it really.

The Chroma docs are decent, but this guide is pretty damn nifty.

I’m using it with Llama-index and Ollama, and it’s ludicrously easy to get to grips with.

Hopefully knowing that we can do this without the need for Pinecone or other blackbox deps will also come in useful to someone else.

3 Likes

Where is the data stored?

In the config I’ve got up and running, it’s non persistent - and that suits. Meaning it only exists for the session, in short term memory (as the Chroma docs say “By default data stored in Chroma is ephemeral”). You set it to a path and persist it, but that’s not my use case. I’m guessing you could set a file path but haven’t tried it,

Awesome! I’m assuming you’re storing the Chroma data as a media object or using the files service.

I got chromadb working last summer but then an update with sqlite broke it - looks like you figured it out.

I also got FAISS working, also storing as a media object.

I still went with Pinecone for speed.

1 Like

I’m fully ephemeral right now, but I’m doing something pretty wildly experimental.

I was reading through your awesome chat bot and probably have some questions about FAISS and how you set the file path…

If you take a look at the Chroma docs and work out a way we can pass a file path, the community will be able to persist it, it’s beyond me right now as I’m tired :sleeping:

Edit: it’s something along the lines of adding a directory to Files and then passing that as the local to persist to, but I’m Zzzz.

I abandoned Chroma a while ago but with FAISS, it is simpler as it is one file so it was just a pickled Langchain document (list) object:

    media_obj = anvil.BlobMedia('text/plain', pickle.dumps(faiss_index))
    some_row_reference['faiss'] = media_obj

    # to load it:
    pickle.loads(some_row_reference['faiss'].get_bytes())

I remember Chroma being a little tricker as it stores data in a directory with multiple files, so I used the files service at the time. I didn’t do anything with editing the chromadb from within Anvil, though.

Come to think of it, the same pickling strategy might work with Chroma…

1 Like

I’m sure this bit from the Anvil docs is the answer:


You can use directories you’ve uploaded to the Data Files service in Python code.

For example, you can list all the files in a directory by passing the scandir() function the path of your directory. The following example displays all files in a directory path that don’t start with ‘.’.

@anvil.server.callable
def list_files_in_directory():
    # Get the path of my Data Files directory
    my_directory_path = data_files['my_directory']

    with os.scandir(my_directory_path) as directory:
        for file in directory:
            if not file.name.startswith('.') and file.is_file():
                print(file.name)

And then probably just pass data_files[‘my_directory’] into the Chroma settings:


client = chromadb.Client(Settings(
               persist_directory= data_files[‘my_directory’]
                                ))

*Totally untested

I’ll give this a more thorough test, as I’m not convinced…yet…

But if you add the Files service, and add an empty with the path ‘vectorstore’, this seems to be all you need to do:

chroma_client = chromadb.Client(Settings(allow_reset=True,
                                        persist_directory=data_files['vectorstore'])
                               )

I’ve no doubt this will go wrong!

The difference between using the file service vs using an actual file is that an actual file can be shared by two concurrent requests, while the file service can’t.

The first line of code in your post mentions sqlite. An sqlite database is stored in one file and can be accessed by multiple concurrent connections.

This would work with an actual file, but it would be impossible using the file service.

If your use case only uses the database file in read-only mode, or there is only one database file per client and you are sure that the same user doesn’t execute a second request before the first one has finished its job, then the file service will work.

Having a persistent server could help, because the server could use a file stored in the file system and use the file service as a backup:

  1. If the file exists in the file system, do your job, when you are done, store it in the file service
  2. If the file does not exist in the file system, first copy it there from the file service, then go to point 1

… but it is possible that I don’t know the full power of the file service, and the limitations described above don’t really exist.

Someone else was using langchain and had a problem with chromadb a few months ago :

…might be useful, but I’m not entirely sure. I wrote some throwaway concepts about using ephemeral random tempfile names so that two users would not be using the same DB if you did not want them to.

And yes, it looks like you already figured out that the directories in data files do not exists unless there is a path with a file in it. It uses a column of paths in a table to make the directories, so if no file exists, no directory exists either.

This clone in this post:

Has some server module code that lets you write new files directly to the data files service, if they still have not built that into the anvil server (it was marked TODO: as of a year ago )

1 Like

Great stuff. That’s mega helpful. As I’ve said, ephemeral suits me fine for the use case - and is working a treat - but having the options out there may help others with different use cases.

Thankfully, persistent is not the goal in my use case and it’s working a charm in memory. Hopefully the hive mind will find a solution to persistence at some point in the future.

Well I mean off the top of my head, the temp directory is shared between server modules I think? I’m not sure but afaik it’s “shared” but ephemeral at the app level.

So without any testing, one would think you could have a background task launch, make a copy of a .chroma file from a place stored in the data file service into a folder in the /temp directory and… well wait… for like an hour. After 1 hour goes by of running the background task, copy the temp file from the /temp directory back to the data files service* and have the background task exit. Then throw the task in a scheduler to run every hour.

*(IRL I would probably do a hash check on the files to see if there are even any changes before doing the read/write op but that’s just me, also I would throw the timer function into a try/finally block to make sure it always write out even if there was an exception, and I would also just be attempting to run this background task more frequently, but having it abort if it was already running)

This would probably keep your “not permanent file” at least pseudo persistent.

2 Likes

A background task that sleeps for an hour wouldn’t be too heavy, there would be no need to abort it if one it’s already running.

This would be similar to using a timer in the client to keep the session alive past the 30 minute timeout.

I don’t think you could still rely on two concurrent requests to be working on the same file, even if the file already exists. For example because the load balancer may decide to redirect your next call to another server. (I don’t know anything about the load balancer logic, I’m just speculating here.)

… but your trick could work well enough if you have a dedicated server (which I have), because you know that all your requests are managed in the same machine.

Interesting…
:thinking:

1 Like