I’m trying to build a web app that calculates embedding vectors from a document. Suppose my Python program produces a 100 rows x 1,536 columns dataframe. The values are floating points e.g. 0.002441. How to store this in Anvil data table? Because data table needs to manually add column name one by one so it’s very tedious. Is there any way to create automatically these columns and store the data? Thank you!
Use one simple object column and store the whole dataframe in one row, or in 100 rows.
Some related info: On the lefthand side of the IDE there is a configuration button to configure Data Tables to Autocreate missing columns.
I agree, you presented it as related info, not as a solution.
If the question was about a smaller number of columns and the columns had sensible names, then the automatic creation would make sense
But we are talking about 1536 columns, and I don’t even think they have a name.
It would be an interesting experiment to enable that checkbox, create a table with 1536 columns and 100 rows, then open the table editor in the ide
For perspective, SQL Server allows 2^10 columns per table. This is 50% higher than that.
With this number of columns, I suspect that the column number itself is a datum.
I don’t know the details of how tables are managed, but I wouldn’t be surprised to find out that what Anvil sees as a table with 50 columns, for postgress is a table with a handful of columns.
Just speculating.
Maybe one day I will look at the open source code and figure it out.
Maybe.
As long as we’re speculating…
If most of the matrix values are zero (e.g., a sparse correlation matrix), and the column number is really a datum, then perhaps it’s best to store only the non-zero values – by their position (row+column) in the matrix.
I would recommend a vector store instead. Pinecone is beginner friendly and fully managed. Chroma is too and can probably be stored in your app as a directory using the files service. Haven’t tried FAISS but it seems somewhat popular. I heard MongoDB just came out with something for vector search - also fully managed.
Relational databases are not optimized for vector embedding storage and vector search. I’d still use the data tables to store some info about your documents, just not the embeddings.