optimizing generated string for storing into a database

Question

I have a 64bit integer timestamp and a Sting username to be combined into one string and eventually stored into a database column. Leave aside why I can't store them in separate columns with appropriate type, my question is how to combine them to get better performance from the underlying database. That would be sqlite, PostgreSQL or MySQL, not sure yet.

I am imagining that they would be using b-trees as indexes and it would be bad to concat like (timestamp-username) because timestamp would generally always progress and tree would need balancing often. username-timestamp should be much better but still each user record would increase with every new entry. I was thinking to also put timestamp with reverse order of bits.

Anything else I can do? Some clever xor or whatever? What would be the reasonably best schema? Data will ever be accessed by requesting the exact generated string, no ranges and such.

The only requirements are to have relatively fast conversion between the generated string and source data in both ways.

UPDATE: Please guys, I'm reaching for information what kind of string would be better for storing as a primary key to a database (one of sqlite, mysql and postgresql). Perhaps the answer is that it doesn't matter, or depends on the DB engine. I don't have a particular problem with the schema I'm using or the caching solution. I'm just asking if there is any room to improve and how. I'll appreciate some on-topic answers.

UPDATE2: Great answers still not definitive for me: does incremented column makes the b-tree index on the column unbalanced? https://stackoverflow.com/a/2362693/520567

you said nevermind - but can you 'also' split the values to use as keys along side this concatenated version? — Randy, Mar 27 '13 at 21:24
"The only requirements are to have relatively fast conversion between the generated string and source data in both ways." --- store it as 2 columns. In this case the conversion would take 0ms. Does it count as a solution? — zerkms, Mar 27 '13 at 21:26
Store them into separate items with the appropiate types. Really. — wildplasser, Mar 27 '13 at 21:27
@Randy and others, I guess I have to say way I can't store separately - it is a cache that uses a DB to offload memory and it stores only strings. So I have no control over DB schema and whatever. I only have control over the string I pass to the caching solution. — akostadinov, Mar 27 '13 at 21:45
you can add a trigger that splits them when they are inserted... — Randy, Mar 28 '13 at 12:09

score 1 · Answer 1 · answered Mar 27 '13 at 21:39

There is a contradiction in your question, you specify you can't split them and store them in separate columns but then you're talking about indexing both parts separately - you can't do that without splitting them.

I can see you really have two choices:

Storing them in separate columns
Hash the output to lower the index memory footprint

Ideally you should store them in two columns and create a composite index if you will always search for them together in the same order. In that case its hard to give accurate advice without first giving more information - however generally username, timestamp would make logical sense if you query per user, or reversing it if you want to query by timestamp. You could also create an index on each column if you need to search on one or the other.

Hashing your generated string

INSERT INTO table (crc_hash_column, value_column_name)
values (CRC32(@generated_value), @generated_value)

would reduce the size to a 32bit integer (only 4bytes of index per row), much smaller than the equilivant VARCHAR or CHAR index space required.

If you take this approach then you should take measures to avoid collisions, due to the Birthday Paradox it will happen, and be more likely as your dataset grows. Even with collisions the extra filtering will still yield greater performance given the size of the index than the alternatives.

SELECT * FROM table
WHERE crc_hash_column = CRC32(@search_value) 
AND value_column_name = @searchvalue

Using the hash will cause a few more CPU cycles - but a CRC32 hash is very quick so even though you have to rehash each time you search this extra work is tiny for the benefits given over indexing larger amounts of data.

Generally I would prefer the first option, but its almost impossible to recommend without knowing your use case.

You should profile both options and see if they fit your requirements.

See my first comment to the question, I have only control over the string I pass to the database. Nothing else I can change. If I can make it such that DB works better, then good. If not, I guess I'm stuck. — akostadinov, Mar 27 '13 at 21:48
without any control over the database there is very little you can do to improve performance. Even if you were to optimise your input for a BTREE index without actually being able alter now much of the input is indexed it won't make any difference. Also a HASH index would make a better choice for raw performance as you've said you're not doing any range > < etc - however again having no control of the database doesn't provide you with any tools to be able to really improve anything. — Steve, Mar 27 '13 at 21:58
You 'may' be able to make very minor optimisations by doing something funky with your input but realistically these are going to be so small its probably not worth it - and often they would result in more data being returned from the DB what you save in raw query performance you lose in network traffic / compression / application layer evals etc — Steve, Mar 27 '13 at 22:01

score 0 · Answer 2 · answered Mar 27 '13 at 22:24

That you say you can't keep them in separate columns (you can't even set up a new table with a 1:1 relationship / mirror the data to materialized view with triggers / replace the existing table with a view on a corrected table structure ????!!!!) means that any solution will be an ugly hack.

Yes, how much the data changes and how it's structured will affect the efficiency of updates. However the purpose of an index is to speed up retrievals - you've given us no information about how the data is accessed / how it's likely to change.

I was thinking to also put timestamp with reverse order of bits

Why? this is more likely to speed up index fragmentation than decrease it.

MariaDB supports virtual columns - and indexes on virtual columns hence you can do silly things like throw the rules of normalization out of the window - but if you can't fix a trivial problem in the schema then replacing the DBMS probably won't be a very practical solution either.

Frankly, if it's worth spending time and money developing a bad solution to the problem that already costs as much as a proper solution, and will likely incur future costs then choosing the bad solution is waste of both time and money.

"this is more likely to speed up index fragmentation than decrease it", please explain why and which particular DB engines you are talking about. I'm just asking if this can be optimized and how. I don't know why do you think there is some problem I try to avoid fixing. Any generic solution like a caching solution has some limitations in particular usage scenarios. But one is not building custom solutions for every feature. — akostadinov, Mar 28 '13 at 06:31

optimizing generated string for storing into a database

2 Answers2

Linked