I’ve been implementing UUID4 use as primary keys and starting wondering, what is the best way to store them?
So I created a table with three variants of the UUID:
UUID.int, 39 digits
column type: number
example: 112318701195841020876145123532031626108
UUID.hex, 32 hex characters
column type: text
example: 547fcb84c71041f6a85b66336efb8b7c
shortuuid 22 character string with expanded alphanumeric embedding.
column type text
example: H3x7Cxt7DvJkYjUoUAKiMw
I created 2,500 rows of UUIDs, each row carried the same UUID encoded into the three different variants.
I then tested the speed of .get() on random samples of the UUID variants. I ran tests across multiple background tasks and periodically refreshed the table with new UUIDs
here are the results of mean +/- standard deviation of .get() time in seconds:
get by int: 5.07e-02 +/- 1.49e-02, n=30000
get by str: 4.96e-02 +/- 1.16e-02, n=30000
short_uuid: 4.69e-02 +/- 1.30e-02, n=30000
I was surprised that int wasn’t the clear winner on retrieval speed.
Caution: Number column values are stored as Python floats (C/C++ doubles).
This gives them at most 53 bits of precision before the low-order bits are chopped off.
That is, integers larger than 9,007,199,254,740,991 (2^53-1) will likely be truncated. A 15-digit integer will surely be preserved. With more digits, you may accidentally find the wrong value. E.g., if you stored a key of
112318701195841020876145123532031626108
but went looking for
112318701195841020876145123532031626107
then you’d find it, even though it’s not supposed to be there.
I wouldn’t expect to see any difference. When you have an index, the type or format of the key doesn’t make much difference.
The time you are measuring is mostly python, postgres setting up the query, disk reading, data transferring back and forth, and, perhaps, maybe, 0.1% of the time spent comparing values in the index.
Perhaps the length of the string is what makes the biggest difference with millions of rows and with many searches, because all the caches that are involved start getting full (but if you have an index, you will never read million of values). More likely, longer strings will slow down the whole query because you are transferring more data (but if your query returns just a few rows, it makes no difference).
I would just make sure the index is doing its job. Anything more than that is overoptimization.
@p.colbert, I guess +1 for strings. FWIW, I did test round trips between the different storage values in the table and, at least for my sample, they all translated back to the same UUID object.
@stefano.menci I didn’t expect to see much difference either, I was just curious. But, you are right, I should be getting back to more pressing matters.