0

I have 5 text-fields which I want to import into a MySQL/MariaDB database. But there are two problems:

(1) The files are quite large: 0.5 GB to 10 GB
(2) All relevant keys have 40 characters

Point (1) I have to accept as it is and I can't change it. Point 2 is my concern. There are a lot of suggestions in the internet. For example, to use enum for varchar or to use numeric surrogates. There is no problem to add a surrogate key to a table. But the same surrogate key has to be added to other tables. And this is the point where I stuck.

Here the specific information about the files/tables:

  • table invoice has 3 columns and 20 Mio rows:

    • invoice_id (primary key) with distinct values = number of rows
    • praxis_id with 4,000 distinct values
    • patient_id with 4 Mio distinct values all columns are CHAR(40) and have a fixed length of 40.
  • table diagnose has 3 columns and 25 Mio rows:

    • invoice_id CHAR(40) 1.4 Mio distinct id
    • diagnose_type
    • diagnose_code
  • table patient has 5 columns with 5 Mio rows:

    • patient_id CHAR(40) not unique (4 Mio distinct pat_id)
    • praxis_id CHAR(40)
    • year of birth, sex etc.

For example, I want to join invoice with diagnose and patient. It makes sense to index the keys. One way would be to define invoice.invoice_id as primary key and for all other keys in table invoice I would add an index. The same with table diagnose (invoice_id with INDEX) and patient (patient_id as primary key).
The problem is that it took a long time to define invoice.invoice_id as primary key using:

ALTER TABLE invoice_id ADD PRIMARY KEY(invoice_id);

After one hour I killed the process. I think that one problem of performance arises from the kind of datatype of invoice_id in table invoice. One idea could be to add an autoincrementing surrogate key invoice_id_surr when loading the text file. But nevertheless the problem remains if I want to join with table diagnose since I have to join with invoice_id of table diagnose which has not the surrogate key invoice_id_surr as foreign key. I could add an index on diagnose.invoice_id but then I loose the advantage of having a surrogate key on table invoice.

I would be interested in a strategy how to cope with this problem: Several already existing tables which can be joint together but the keys are CHAR(40) and have no index.

Thanks for help.


UPDATE 1: Table specification
- keys have 40 characters [0-9][A-Z]
- These are tables which won't change anymore (no inserts)

-- invoice_id is primary key (unique)
-- patient_id and praxis id for foreign key and not unique in this table
CREATE TABLE invoice (
  invoice_id             CHAR(40) DEFAULT NULL
, praxis_id              CHAR(40) DEFAULT NULL
, patient_id             CHAR(40) DEFAULT NULL
, PRIMARY KEY (invoice_id2)
) ENGINE = InnoDB
;  

LOAD DATA LOCAL INFILE 'C:/data/invoice.txt'
INTO TABLE invoice
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES
;

-- invoice_id is not unique in this table
CREATE TABLE diagnose (
  invoice_id             CHAR(40)    DEFAULT NULL
, diagnose_katalog       VARCHAR(20) DEFAULT NULL
, diagnose_code          VARCHAR(20) DEFAULT NULL
) ENGINE = InnoDB
;
-- patient_id is not unique in this table since since patient may change praxis
CREATE TABLE patient (
  patient_id             CHAR(40)    DEFAULT NULL
, praxis_id              CHAR(40)    DEFAULT NULL
, sex                    CHAR(1)     DEFAULT NULL
, birth_year             SMALLINT UNSIGNED DEFAULT NULL
, zip_code               VARCHAR(20) DEFAULT NULL
) ENGINE = InnoDB
;
giordano
  • 2,954
  • 7
  • 35
  • 57

3 Answers3

1

You should avoid using natural keys as primary keys in your database for a variety of reasons, which can be found quite easy by googling.

Also, what does it mean "tables are given"? You should not insist on having a one-to-one correspondence between your text files and database tables. Instead, you should design your tables in the way that most suits your needs. Strive for normalization.

For example, you say that patient_id in patient "table" (you mean file, right?) is not unique. Clearly you need a table where patients are unique; so create a table with distinct patient_id and their attributes. patient_id should be a unique key on that table, but generate a numeric surrogate (an auto_increment field, for example) to serve as a primary key in that table. After that, for example, create a praxis table, with praxis_id as a unique key and a numeric surrogate for primary key. Then you can connect patient and praxis with a third table, as appropriate for many-to-many relationships. That way you normalize your database: a patient with its attributes is always a single row in a single patient table; whereas now you have the same patient entered multiple times into you current patient file/table, which will cause you trouble sooner or later.

zgguy
  • 226
  • 1
  • 5
  • @zgguv Thanks for answer. I didn't make myself clear. With "given" or "existing" tables I mean that the tables file has been exported from an external database as text-files and I can't change them. I don't know why they used strings with 40 Characters with no sense as key instead of numeric. Maybe, I was wrong to call this key natural (I called it so since it was delivered). Table **patient** has not unique _patient_id_ since it has historized information of the patient. To make it simple, forget **patient** table and look only on **invoice** and **diagnose** (1:n relation). (cont.) – giordano Jun 09 '15 at 16:53
  • @zgguv (cont) There is no problem to add an numerical surrogate key _invoice_ind2_ to **invoice**. But this surrogate key should be also added to **diagnose** as foreign key to be able to join **invoice**. _invoice_id2_ = **diagnose**. _invoice_id2_. So I have first to add a column **diagnose**. _invoice_id2_ and `update diagnose, invoice SET diagnose.invoice_id2 = invoice.invoice_id2 WHERE diagnose.invoice_id = invoice.invoice_id`.I can do that but than I should add index to **diagnose**. _invoice_id_ and **diagnose**. _invoice_id_. So, why not use directly the indexed invoice_id for joins? – giordano Jun 09 '15 at 17:07
  • It's a Baron of Munchhausen problem: to get out from the sea he pulls on his own hair. To avoid string as key for joins I have to perform joins with this strings. – giordano Jun 09 '15 at 17:15
  • I suppose your data is not static, but new data comes into the system regularly. Surrogate keys are good because you have total control over them. What if someone decides to change existing invoice ids, starts sending you non-unique invoice ids etc? Maybe it's unlikely, but things like that do happen in the real world. Second, you said yourself that your keys are 40 chars long. But you have only several million records, and an int key 4 bytes long suffices for over 4 billion records. It is much faster to join on 4 byte columns than 40 byte columns. You only do the join on char(40) column once. – zgguy Jun 09 '15 at 19:46
  • Thanks for reply. Your supposition is unfortunately false. If it would be correct there would be no discussion about this issue and I would agree with all your points. The data are static ("given"). 10 hours ago I started to load the data again adding an autoincremental key (`CREATE TABLE invoice invoice_id2 INTEGER NOT NULL AUTO_INCREMENT, invoice_id CHAR(40), praxis_id CHAR(40), patient_id CHAR(40) , PRIMARY KEY (invoice_id2) ENGINE = INNODB`) and it is still running. I didn't expect that it takes so long to add an autoincremental key. – giordano Jun 10 '15 at 05:57
  • 1
    How are you loading the data? It really should not take that long for a few million records. – zgguy Jun 10 '15 at 06:08
  • You're right. I use HeidiSQL and it was blocked. If I only load the data without the autoincremental part it takes 8 minutes. I try again. My loading code is `LOAD DATA LOCAL INFILE 'C:/01_data//invoice.txt' INTO TABLE invoice FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\r\n' IGNORE 1 LINES (invoice_id,praxis_id,patient_id) ;` – giordano Jun 10 '15 at 06:20
  • @zgguv You'right! It takes only 8 minutes. Thanks for the hint. I will go further and make a summary about my experience. – giordano Jun 10 '15 at 06:47
  • Great! HeidiSQL is just i client, the server does the actual loading of data so it shouldn't matter which client issues the statement. LOAD DATA INFILE is the "correct" way of loading data from text files. – zgguy Jun 10 '15 at 06:54
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/80136/discussion-between-giordano-and-zgguy). – giordano Jun 10 '15 at 07:34
1

Do you really have CHAR(40), not VARCHAR(40)? Are the values always 40 characters? Is the table CHARACTER SET utf8?

CHAR(40) utf8 takes 120 bytes always. If you store 'Z' in such a field, it still takes 120 bytes, not just 1. Even if you are importing 40 characters, declaring it VARCHAR will trim the trailing blanks as it is loaded.

At a minimum, I would do

ALTER TABLE foo
    MODIFY col1 VARCHAR(40) ...,  -- the "..." is other options for the col
    MODIFY col2 VARCHAR(40) ...,
    ...;

This would probably help "Point (1)" a lot. And make everything faster. (Caveat: It would take a long time to finish that ALTER.)

"Natural" PRIMARY KEYs are not evil. But use them where appropriate. In your case, an invoice_id is required to be unique for Business reasons, correct? How long is it? It is probably good as a PK.

What key(s) did you have on the table before you tried to do the ALTER? When initially building a table, you should have at least the PRIMARY KEY in place. (But too late now, I guess.)

Please provide SHOW CREATE TABLE for each table -- I am having to make too many guesses.

Adding a surrogate (AUTO_INCREMENT) key adds a level of indirection for all lookups; this may slow down some SELECTs (in addition to requiring changes to many SELECTs).

When adding secondary indexes to a table, base it on what would be of use to the SELECT statements you have. I discuss that in my index cookbook. Provide SELECTs here for further discussion. Do not blindly add an index for each column.

It sounds like invoice_id should be the PRIMARY KEY for invoice and an INDEX in diagnose.

If it is practical to reload the data, declare the fields VARCHAR, not CHAR, and have PRIMARY KEY(invoice_id) in invoice.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Thank for you insightful inputs and for your link. The link will be a reference for my further working with DB. Later I will add the table specification. For the moment: yes, all keys I mentioned have exact 40 alphanunerical characters. That's why I used CHAR(40) otherwise I would use VARHCHAR(40). It's simpler to reload the data with new table definition. I use utf8mb4 to avoid any encoding problems. I have German umlaute French accents (but keys have no such signs). Yes: _invoice_id_ is unique in **invoice** (primary key) but not in **diagnose** (foreign key). – giordano Jun 11 '15 at 20:38
  • Let's see the `LOAD DATA` statement; I don't see why `VARCHAR` won't work. For French and German, most characters are 1 byte in utf8 an ut8mb4; the accented characters are 2 bytes. So, exactly 40 characters in utf8mb4 takes 160 bytes for `CHAR(40)` and usually under 50 bytes for `VARCHAR(40)`. – Rick James Jun 11 '15 at 21:02
  • Even if the `LOAD` must go into `CHAR(40)`, you could promptly `ALTER` the columns to change to `VARCHAR(40)` and shrink the table by a factor of 3. – Rick James Jun 11 '15 at 21:04
  • LOAD has not to go into CHAR(40). I will change this to VARCHAR(40). Now, I understand: Due to utf8mb4 the advantage of CHAR(40) over VARCHAR(40) is lost. Initially I wanted to replace all keys (invoice_id,praxis_id, patient_id) with numeric keys. Replace means: after adding the numeric key drop the VARCHAR(40) keys. But maybe, it is not necessary. The alternative would be to load invoice with invoice_id as primary key and put index on patient_id and praxis_id in tables invoice and patient. – giordano Jun 12 '15 at 07:05
  • See if switching to `VARHAR` will improve things enough. If not, start another thread on how to turn them into `INT` ids (if you need help). – Rick James Jun 12 '15 at 19:08
0

I respond myself to my question. The reason for this posting was that I submitted queries, for example defining a primary key with ALTER TABLE ... and the process didn't stopped after several hours. As @zgguv mentioned the duration seems not plausible. I stopped the queries and restarted (sometimes after the third time) and the process finished after some minutes (about 10 minutes). I don't know why sometimes the queries hang-ups. This never happened to me untill now but the tables I used were much smaller. Lesson learnt is:

  • The long string keys should be replaced by numeric keys to make selections faster.

  • Replacing the string keys of tables of size 10 GB (text-table), respectively, 20 Mio row numbers are feasible (You have to pull on your own hair only once). A join between to indexed string keys took about 10 minutes.

  • If the duration of the query takes more than 30' (hang-up) then stop it and try again. It would be nice to know why this happened (InnoDB, MyISAM, HeidiSQL, ...) but this is another issue.

@zgguv Thanks for support and patience.

giordano
  • 2,954
  • 7
  • 35
  • 57