Frequently Asked Questions and Commonly Encountered Issues¶
This section provides answers to frequently asked questions and information on commonly encountered problem when working with Chroma. These information below is based on interactions with the Chroma community.
404 Answer Not Found
If you have a question that is not answered here, please reach out to us on our Discord @taz or GitHub Issues
Frequently Asked Questions¶
Distances and Similarity¶
Chroma uses distance metrics to measure how dissimilar a result is from a query. A distance of 0 indicates that the two items are identical, while larger distances indicate greater dissimilarity. This approach starts at 0 and increases upward, aligning with the intuitive notion of distance.
In contrast, similarity metrics measure how similar two items are, often on a scale where higher values represent greater similarity. For example:
- Cosine Similarity ranges from -1 to 1, where:
- 1 indicates identical orientation (maximum similarity),
- 0 indicates orthogonality (no similarity),
- -1 indicates opposite orientation (maximum dissimilarity).
- Dot Product can range from negative to positive infinity, depending on the vectors' magnitudes and directions. When vectors are normalized and non-negative, the dot product ranges from 0 to 1.
Why Does Chroma Use Distance Metrics?¶
Chroma uses distance metrics because they provide a straightforward way to quantify dissimilarity between vectors. Consistency is another key aspect - irrespecitve of the distance metric used, distance results follow the same conceptual framework. Finally, distance is is an intuitive metric which makes Chroma more accessible to a wider audience.
What does Chroma use to index embedding vectors?¶
Chroma uses its own fork of HNSW lib for indexing and searching embeddings. In addition to HNSW, Chroma also uses a Brute Force index, which acts as a buffer (prior to updating the HNSW graph) and performs exhaustive search using the same distance metric as the HNSW index.
Alternative Questions:
- What library does Chroma use for vector index and search?
- What algorithm does Chroma use for vector search?
How to set dimensionality of my collections?¶
When creating a collection, its dimensionality is determined by the dimensionality of the first embedding added to it. Once the dimensionality is set, it cannot be changed. Therefore, it is important to consistently use embeddings of the same dimensionality when adding or querying a collection.
Example:
import chromadb
client = chromadb.Client()
collection = client.create_collection("name") # dimensionality is not set yet
# add an embedding to the collection
collection.add(ids=["id1"], embeddings=[[1, 2, 3]]) # dimensionality is set to 3
Alternative Questions:
- Can I change the dimensionality of a collection?
Can I use transformers
models with Chroma?¶
Generally, yes you can use transformers
models with Chroma. Although Chroma does not provide a wrapper for this, you
can use SentenceTransformerEmbeddingFunction
to achieve the same result. The sentence-transformer library will
implicitly do mean-pooling on the last hidden layer, and you'll get a warning about
it - No sentence-transformers model found with name [model name]. Creating a new one with MEAN pooling.
Example:
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction
ef = SentenceTransformerEmbeddingFunction(model_name="FacebookAI/xlm-roberta-large-finetuned-conll03-english")
print(ef(["test"]))
Warning
Not all models will work with the above method. Also mean pooling may not be the best strategy for the model.
Read the model card and try to understand what if any pooling the creators recommend. You may also want to normalize
the embeddings before adding them to Chroma (pass normalize_embeddings=True
to the SentenceTransformerEmbeddingFunction
EF constructor).
Should I store my documents in Chroma?¶
Note: This applies to Chroma single-node and local embedded clients. (Chroma version ca. 0.5.x)
Chroma allows users to store both embeddings and documents, alongside metadata, in collections. Documents and metadata are both optional and depending on your use case you may choose to store them in Chroma or externally, or not at all.
Here are some pros/cons to help you decide whether to store your documents in Chroma:
Pros:
- Keeps all the data in the same place. You don't have to manage a separate DB for the documents
- Allows you to do keyword searches on the documents
Cons:
- The database can grow substantially in size because documents are effectively duplicated - once for storing them as metadata for queries and another for the FTS5 index.
- Queries performance hit
"Dude, where's my data?"¶
If you are new to Chroma, you might be asking yourself: "Where is my data been stored?". As, per usual, the answer is: "It depends".
Generally Chroma uses PERSIST_DIRECTORY
to store the data, but when running in CLI mode, this is overridden by the CLI itself.
- Running in CLI mode (
--path
is not specified) data is stored in the./chroma_data
directory. - Running in Jupyter notebook, Colab or directly using
PersistentClient
(unlesspath
is specified or env varPERSIST_DIRECTORY
is set), data is stored in the./chroma
directory. - Running with docker compose (from source repo), the data is stored in docker volume named
chroma-data
(unless an explicit volume binding is specified) - Running with
docker run
(no volume binding with-v
) the data is stored in the container and is lost ☠️ when the container is removed.
In all other cases where env var, parameter or binding is specified, the data is stored in your specified directory.
Commonly Encountered Problems¶
Collection Dimensionality Mismatch¶
Symptoms:
This error usually exhibits in the following error message:
chromadb.errors.InvalidDimensionException: Embedding dimension XXX does not match collection dimensionality YYY
Context:
When adding/upserting or querying Chroma collection. This error is more visible/pronounced when using the Python APIs, but will also show up in also surface in other clients.
Cause:
You are trying to add or query a collection with vectors of a different dimensionality than the collection was created with.
Explanation/Solution:
When you first create a collection client.create_collection("name")
, the collection will not have knowledge of its
dimensionality so that allows you to add vectors of any dimensionality to it. However, once your first batch of
embeddings is added to the collection, the collection will be locked to that dimensionality. Any subsequent query or
add operation must use embeddings of the same dimensionality. The dimensionality of the embeddings is a characteristic
of the embedding model (EmbeddingFunction) used to generate the embeddings, therefore it is important to consistently
use the same EmbeddingFunction when adding or querying a collection.
Tip
If you do not specify an embedding_function
when creating (client.create_collection
) or getting
(client.get_or_create_collection
) a collection, Chroma wil use its default embedding function.
Large Distances in Search Results¶
Symptoms:
When querying a collection, you get results that are in the 10s or 100s.
Context:
Frequently when using you own embedding function.
Cause:
The embeddings are not normalized.
Explanation/Solution:
L2
(Euclidean distance) and IP
(inner product) distance metrics are sensitive to the magnitude of the vectors.
Chroma uses L2
by
default. Therefore, it is recommended to normalize the embeddings before adding them to Chroma.
Here is an example how to normalize embeddings using L2 norm:
import numpy as np
def normalize_L2(vector):
"""Normalizes a vector to unit length using L2 norm."""
norm = np.linalg.norm(vector)
if norm == 0:
return vector
return vector / norm
OperationalError: no such column: collections.topic
¶
Symptoms:
The error OperationalError: no such column: collections.topic
is raised when trying to access Chroma locally or
remotely.
Context:
After upgrading to Chroma 0.5.0
or accessing your Chroma persistent data with Chroma client version 0.5.0
.
Cause:
In version 0.5.x
Chroma has made some SQLite3 schema changes that are not backwards compatible with the previous
versions. Once you access your persistent data on the server or locally with the new Chroma version it will
automatically migrate to the new schema. This operation is not reversible.
Explanation/Solution:
To resolve this issue you will need to upgrade all your clients accessing the Chroma data to version 0.5.x
.
Here's a link to the migration performed by Chroma - https://github.com/chroma-core/chroma/blob/main/chromadb/migrations/sysdb/00005-remove-topic.sqlite.sql
sqlite3.OperationalError: database or disk is full
¶
Symptoms:
The error sqlite3.OperationalError: database or disk is full
is raised when trying to access Chroma locally or
remotely. The error can occur in any of the Chroma API calls.
Context:
There are two contexts in which this error can occur:
- When the persistent disk space is full or the disk quota is reached - This is where your
PERSIST_DIRECTORY
points to. - When there is not enough space in the temporary director - frequently
/tmp
on your system or container.
Cause:
When inserting new data and your Chroma persistent disk space is full or the disk quota is reached, the database will not be able to write metadata to SQLite3 db thus raising the error.
When performing large queries or multiple concurrent queries, the temporary disk space may be exhausted.
Explanation/Solution:
To work around the first issue, you can increase the disk space or clean up the disk space. To work around the second
issue, you can increase the temporary disk space (works fine for containers but might be a problem for VMs) or point
SQLite3 to a different temporary directory by using SQLITE_TMPDIR
environment variable.
SQLite Temp File
More information on how sqlite3 uses temp files can be found here.
RuntimeError: Chroma is running in http-only client mode, and can only be run with 'chromadb.api.fastapi.FastAPI'
¶
Symptoms and Context:
The following error is raised when trying to create a new PersistentClient
, EphemeralClient
, or Client
:
RuntimeError: Chroma is running in http-only client mode, and can only be run with 'chromadb.api.fastapi.FastAPI'
as the chroma_api_impl. see https://docs.trychroma.com/usage-guide?lang=py#using-the-python-http-only-client for more information.
Cause:
There are two possible causes for this error:
chromadb-client
is installed and you are trying to work with a local client.- Dependency conflict with
chromadb-client
andchromadb
packages.
Explanation/Solution:
Chroma (python) comes in two packages - chromadb
and chromadb-client
. The chromadb-client
package is used to
interact with
a remote Chroma server. If you are trying to work with a local client, you should use the chromadb
package. If you are
planning to interact with remote server only it is recommended to use the chromadb-client
package.
If you intend to work locally with Chroma (e.g. embed in your app) then we suggest that you uninstall
the chromadb-client
package and install the chromadb
package.
To check which package you have installed:
To uninstall the chromadb-client
package:
Working with virtual environments
It is recommended to work with virtual environments to avoid dependency conflicts. To create a virtual environment you can use the following snippet:
pip install virtualenv
python -m venv myenv
source myenv/bin/activate
pip install chromadb # and other packages you need
conda
or poetry
to manage your environments.
Default Embedding Function
Default embedding function - chromadb.utils.embedding_functions.DefaultEmbeddingFunction
- can only be used with
chromadb
package.
ValueError: You must provide an embedding function to compute embeddings
¶
Symptoms and Context:
The
error ValueError: You must provide an embedding function to compute embeddings.https://docs.trychroma.com/embeddings"
is frequently raised when trying to add embeddings to a collection using Python thin client (chromadb-client
package).
Cause:
To reduce the size of the chromadb-client
package the default embedding function which requires onnxruntime
package
is not included and is instead aliased to None
.
Explanation/Solution:
To resolve this issue you must always provide an embedding function when you call get_collection
or get_or_create_collection
methods to provide the Http client with the necessary information to compute embeddings.
Adding documents is slow¶
Symptoms:
Adding documents to Chroma appears slow.
Context:
You've tried adding documents to a collection using the add()
or upsert()
methods.
Cause:
There are several reasons why the addition may be slow:
- Very large batches
- Slow embeddings
- Slow network
Let's break down each of the factors.
Very large batches
If you are trying to add 1000s or even 10,000s of documents at once and depending on how much data is already in your collection Chroma (specifically the HNSW graph updates) can become a bottleneck.
To debug if this is the case you can reduce the size of the batch and see if the operation is faster. You can also check how many records are in the collection with count()
method.
Slow embeddings
This is the most common reason for slow addition. Some embedding functions are slower than others. To debug this you can try the following example by adjusting the embeding function to your own. What the code tests is how much it takes to compute the embedings and then to add them to the collection in separate steps such that each can be measured independenty.
from chromadb.utils import embedding_functions
import chromadb
import uuid
list_of_sentences = ["Hello world!", "How are you?"] # this should be your list of documents to add
# change the below EF definition to match your embedding function
default_ef = embedding_functions.DefaultEmbeddingFunction()
start_time = time.perf_counter()
embeddings=default_ef(list_of_sentences)
end_time = time.perf_counter()
print(f"Embedding time: {end_time - start_time}")
client = chromadb.PersistentClient(path="my_chroma_data")
collection = client.get_or_create_collection("my_collection")
start_time = time.perf_counter()
# this will add your documents and the generated embeddings without Chroma doing the embedding for you internally
collection.add(ids=[f"{uuid.uuid4()}" for _ in range(len(list_of_sentences))],documents=list_of_sentences, embeddings=embeddings)
end_time = time.perf_counter()
print(f"Chroma add time: {end_time - start_time}")
Slow network
If you are adding documents to a remote Chroma the network speed can become a bottleneck. To debug this you can with a local PersistentClient
and see if the operation is faster.