I have 5 text-fields which I want to import into a MySQL/MariaDB database. But there are two problems:
(1) The files are quite large: 0.5 GB to 10 GB
(2) All relevant keys have 40 characters
Point (1) I have to accept as it is and I can't change it. Point 2 is my concern. There are a lot of suggestions in the internet. For example, to use enum for varchar or to use numeric surrogates. There is no problem to add a surrogate key to a table. But the same surrogate key has to be added to other tables. And this is the point where I stuck.
Here the specific information about the files/tables:
table invoice has 3 columns and 20 Mio rows:
- invoice_id (primary key) with distinct values = number of rows
- praxis_id with 4,000 distinct values
- patient_id with 4 Mio distinct values all columns are CHAR(40) and have a fixed length of 40.
table diagnose has 3 columns and 25 Mio rows:
- invoice_id CHAR(40) 1.4 Mio distinct id
- diagnose_type
- diagnose_code
table patient has 5 columns with 5 Mio rows:
- patient_id CHAR(40) not unique (4 Mio distinct pat_id)
- praxis_id CHAR(40)
- year of birth, sex etc.
For example, I want to join invoice with diagnose and patient. It makes sense to index the keys. One way would be to define invoice.invoice_id as primary key and for all other keys in table invoice I would add an index. The same with table diagnose (invoice_id with INDEX) and patient (patient_id as primary key).
The problem is that it took a long time to define invoice.invoice_id as primary key using:
ALTER TABLE invoice_id ADD PRIMARY KEY(invoice_id);
After one hour I killed the process. I think that one problem of performance arises from the kind of datatype of invoice_id in table invoice. One idea could be to add an autoincrementing surrogate key invoice_id_surr when loading the text file. But nevertheless the problem remains if I want to join with table diagnose since I have to join with invoice_id of table diagnose which has not the surrogate key invoice_id_surr as foreign key. I could add an index on diagnose.invoice_id but then I loose the advantage of having a surrogate key on table invoice.
I would be interested in a strategy how to cope with this problem: Several already existing tables which can be joint together but the keys are CHAR(40) and have no index.
Thanks for help.
UPDATE 1: Table specification
- keys have 40 characters [0-9][A-Z]
- These are tables which won't change anymore (no inserts)
-- invoice_id is primary key (unique)
-- patient_id and praxis id for foreign key and not unique in this table
CREATE TABLE invoice (
invoice_id CHAR(40) DEFAULT NULL
, praxis_id CHAR(40) DEFAULT NULL
, patient_id CHAR(40) DEFAULT NULL
, PRIMARY KEY (invoice_id2)
) ENGINE = InnoDB
;
LOAD DATA LOCAL INFILE 'C:/data/invoice.txt'
INTO TABLE invoice
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES
;
-- invoice_id is not unique in this table
CREATE TABLE diagnose (
invoice_id CHAR(40) DEFAULT NULL
, diagnose_katalog VARCHAR(20) DEFAULT NULL
, diagnose_code VARCHAR(20) DEFAULT NULL
) ENGINE = InnoDB
;
-- patient_id is not unique in this table since since patient may change praxis
CREATE TABLE patient (
patient_id CHAR(40) DEFAULT NULL
, praxis_id CHAR(40) DEFAULT NULL
, sex CHAR(1) DEFAULT NULL
, birth_year SMALLINT UNSIGNED DEFAULT NULL
, zip_code VARCHAR(20) DEFAULT NULL
) ENGINE = InnoDB
;