-1

I know that it is generally best practice to use an INTEGER field for your primary key, but unfortunately, due to an API that I am working with, I can only have a primary key with the format: CHAR_INT (example: ABC_12345).

I will have a lot of data (+1 billion records) and querying and insertion speeds are a priority, will using a CHAR_INT primary key have a large impact on speed? Or is it relatively negligible?

Also, would creating a numerical ID for the CHAR part of the string be more efficient? So using the previous example: ABC_12345 would become something like 1_12345. I know they'd both be strings, just wondering if there is any efficiency with using only numbers.

I'm using SQLite.

Thank you!

Zach
  • 1,243
  • 5
  • 19
  • 28

3 Answers3

5

There is no built-in data type "CHAR_INT" type as far as I know.

However, SQLite is quite flexible on typing, and allows any string to be the name of a type. SQLite is not strongly typed, so the value seems to be stored as a string.

Indexes on numbers are more efficient. One important reason is that numbers are fixed length. Strings are variable length, which adds overhead when storing key values in the index. Another reason is that hardware does a better job of supporting numeric comparisons. String comparisons become much more complicated when taking character sets and collations into account.

That said, the overhead on searching and maintaining the index is actually quite small compared to the benefits of using the index. So, I would not worry that the index only has strings. I would worry more about a tool that imposes such limitations however. You should be able to choose the keys you want in your tables.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
1

Sqlite has two types of tables.

The default ROWID table. Rowid tables are B*-Trees, with a signed 64-bit integer as their primary key (The rowid). If the column has a single INTEGER PRIMARY KEY column, this column is used an alias for the rowid. Any other PRIMARY KEY type, or composite primary keys of two or more columns, is simply a unique index.

So your CHAR_INT column (Sqlite is very forgiving about what it takes for column types; it's just a hint about how to try to store and compare values stored in that column, not an actual type), by Sqlite rules, has an integer affinity, but since things like ABC_123 can't be converted losslessly to integers, they get stored as strings. Inserting a row means updating both the main table and the primary key index (And any other indexes, of course). Looking up a row by key involves first looking it the corresponding rowid in the index, and then looking up that row of the main table. On the plus side, both lookups use O(log N) binary searches.

The other table type is WITHOUT ROWID. These tables use the same plain B-Tree data structure used by indexes, and use the table's primary key, no matter its type or how many columns, as the true primary key. Inserts only require updating one table (Plus additional indexes, of course), lookups only need to search one table, so it can be faster and take less disk space when your primary key is not an INTEGER.

Which one ends up being better overall depends on a bunch of factors, like how many other indexes are used with the table, how much data is stored in a row, the queries run on the tables, and many other things. The documentation suggests, among other advice, building databases with and without WITHOUT ROWID tables and benchmarking to see what is more suitable for a particular use.

Shawn
  • 47,241
  • 3
  • 26
  • 60
1

The type (type affinity) with one exception makes no real difference.

The exception being specifically the_column_name INTEGER PRIMARY KEY (with or without AUTOINCREMENT) which defines the column as an alias of the rowid column. INT PRIMARY KEY does not.

so the_column_name CHAR_INT PRIMARY KEY, the_column_name INT CHAR PRIMARY KEY or even the_column_name INT PRIMARY KEY are actually the same, the_column_name RUMPLESTILTSKIN PRIMARY KEY could even be used (although the latter would have a different type affinity).

It is the rules that determine the type affinity. There are 5. The rule which has the highest precedence is if the type has INT then the type affinity is INTEGER. RUMPLESTILTSKIN as a type drops through all rules except the last that is if none of the previous rules apply then the type affinity is NUMERIC.

3.1. Determination Of Column Affinity

The affinity of a column is determined by the declared type of the column, according to the following rules in the order shown:

If the declared type contains the string "INT" then it is assigned INTEGER affinity.

If the declared type of the column contains any of the strings "CHAR", "CLOB", or "TEXT" then that column has TEXT affinity. Notice that the type VARCHAR contains the string "CHAR" and is thus assigned TEXT affinity.

If the declared type for a column contains the string "BLOB" or if no type is specified then the column has affinity BLOB.

If the declared type for a column contains any of the strings "REAL", "FLOA", or "DOUB" then the column has REAL affinity.

Otherwise, the affinity is NUMERIC.

Note that the order of the rules for determining column affinity is important. A column whose declared type is "CHARINT" will match both rules 1 and 2 but the first rule takes precedence and so the column affinity will be INTEGER.

Datatypes In SQLite

Saying that the type affinity does not determine how the data is stored. Each and every column is stored according to the storage class that is dependant upon the data being stored.

Null is stored as a null, a string of numerics (enclosed or not as a string) as an integer. In short the data will be stored as SQLite determines and SQlite will try to store the data as efficiently as it can and in as little space as it can down to a byte being the smallest unit that is stored.

Consider the following :-

DROP TABLE IF EXISTS mytable1;
DROP TABLE IF EXISTS mytable2;
DROP TABLE IF EXISTS mytable3;
CREATE TABLE IF NOT EXISTS mytable1 (c1 CHAR_INT PRIMARY KEY); 
CREATE TABLE IF NOT EXISTS mytable2 (c1 INT PRIMARY KEY); 
CREATE TABLE IF NOT EXISTS mytable3 (c1 RUMPLEstiltSkin PRIMARY KEY);
-- INSERT INTO mytable1 VALUES(12345),('12345'),('a_12345'),('1_12345'),(x'0102030405'); -- fails due to unique constraint 12345 and '12345' are the same 
-- INSERT INTO mytable2 VALUES(12345),('12345'),('a_12345'),('1_12345'),(x'0102030405'); -- fails due to unique constraint 12345 and '12345' are the same 
-- INSERT INTO mytable3 VALUES(12345),('12345'),('a_12345'),('1_12345'),(x'0102030405'); -- fails due to unique constraint 12345 and '12345' are the same 
INSERT INTO mytable1 VALUES(12345),('54321'),('a_12345'),('1_12345'),(x'0102030405');
INSERT INTO mytable2 VALUES(12345),('54321'),('a_12345'),('1_12345'),(x'0102030405');
INSERT INTO mytable3 VALUES(12345),('54321'),('a_12345'),('1_12345'),(x'0102030405');
SELECT c1, typeof(c1) FROM mytable1;
SELECT c1, typeof(c1) FROM mytable2;
SELECT c1, typeof(c1) FROM mytable3;
  • The commented out INSERTS (if uncommented and run) fail with a UNIQUE conflict as SQLite considers 12345 the same as '12345'.

the typeof function returns the type of the column (storage type NOT column affinity)

The results come up as :-

enter image description here enter image description here enter image description here

Using anything but INTEGER PRIMARY KEY (there are some derivations) and thus an alias of the rowid is

  • about half as fast
  • has two indexes, the rowid (unless the TABLE is defined as WITHOUT ROWID) and the PRIMARY KEY.

    • Searching for a record with a specific rowid, or for all records with rowids within a specified range is around twice as fast as a similar search made by specifying any other PRIMARY KEY or indexed value. ROWIDs and the INTEGER PRIMARY KEY

  • handling numbers as opposed to strings will consume more space and therefore will reduce what data can be held in a buffer thus there would be some impact.

Searching an index, is relatively fast, there is, in comparison to the data itself, relatively little data and the data itself is only read.

Perhaps consider the following :-

DROP TABLE IF EXISTS mytable1;
DROP TABLE IF EXISTS mytable2;
DROP TABLE IF EXISTS mytable3;
CREATE TABLE IF NOT EXISTS mytable1 (pk INT PRIMARY KEY, name TEXT); 
CREATE TABLE IF NOT EXISTS mytable2 (pk CHAR_INT PRIMARY KEY, name TEXT); 
CREATE TABLE IF NOT EXISTS mytable3 (pk INT PRIMARY KEY, name TEXT) WITHOUT ROWID; 

INSERT INTO mytable1
    WITH RECURSIVE cte1(a,b) AS (
            SELECT 'ABC_'||CAST(abs(random()) AS TEXT),'some data' UNION ALL 
            SELECT DISTINCT (substr(a,1,4))||CAST(abs(random()) AS TEXT),'some data' FROM cte1 LIMIT 1000000
        )
    SELECT * FROM cte1
;

INSERT INTO mytable2
    WITH RECURSIVE cte1(a,b) AS (
            SELECT '1_'||CAST(abs(random()) AS TEXT),'some data' UNION ALL 
            SELECT DISTINCT (abs(random()) % 100)||'_'||CAST(abs(random()) AS TEXT),'some data' FROM cte1 LIMIT 1000000
        )
    SELECT * FROM cte1
;
INSERT INTO mytable3 SELECT * FROM mytable1;

SELECT * FROM mytable1 WHERE name LIKE('%me data%');
SELECT * FROM mytable2 WHERE name LIKE('%me data%');
SELECT * FROM mytable3 WHERE name LIKE('%me data%');

SELECT * FROM mytable3 WHERE name LIKE('%me data%');
SELECT * FROM mytable1 WHERE name LIKE('%me data%');
SELECT * FROM mytable2 WHERE name LIKE('%me data%');

SELECT * FROM mytable2 WHERE name LIKE('%me data%');
SELECT * FROM mytable3 WHERE name LIKE('%me data%');
SELECT * FROM mytable1 WHERE name LIKE('%me data%');

This creates 3 permutations all with 1,000,000 rows

  • mytable1 has the primary key as ABC_???? e.g. :-

    • enter image description here
  • mytable2 has the primary key as ??_???? e.g. :-

    • enter image description here
  • mytable3 is a copy of mytable1 BUT the table has been defined using WITHOUT ROWID

Timings

The SELECTS for all 3 are pretty close (multiple selects done and in different order to even out caching). The messages which include the timings are :-

SELECT * FROM mytable1 WHERE name LIKE('%me data%')
> OK
> Time: 0.672s


SELECT * FROM mytable2 WHERE name LIKE('%me data%')
> OK
> Time: 0.667s


SELECT * FROM mytable3 WHERE name LIKE('%me data%')
> OK
> Time: 0.702s


SELECT * FROM mytable3 WHERE name LIKE('%me data%')
> OK
> Time: 0.7s


SELECT * FROM mytable1 WHERE name LIKE('%me data%')
> OK
> Time: 0.675s


SELECT * FROM mytable2 WHERE name LIKE('%me data%')
> OK
> Time: 0.673s


SELECT * FROM mytable2 WHERE name LIKE('%me data%')
> OK
> Time: 0.676s


SELECT * FROM mytable3 WHERE name LIKE('%me data%')
> OK
> Time: 0.709s


SELECT * FROM mytable1 WHERE name LIKE('%me data%')
> OK
> Time: 0.676s
  • I believe mytable3, takes a bit of a hit because the scan (in this case) is on the PRIMARY KEY, rather than the rowid which is suitable/preferable for the other two.

In addition to the previous links, You may also wish to look at :-

MikeT
  • 51,415
  • 16
  • 49
  • 68