0

I have this CSV file with about 16.916 records. When I load this into a MySQL, it only detects 15.945 records. Thats what MySQL says:

Records: 15945  Deleted: 0  Skipped: 0  Warnings: 0

Can someone tell my why MySQL ignores some records and how I can fix this?

I load the file using the LOAD function like this:

LOAD DATA LOCAL INFILE 'germany-filtered.csv'
INTO TABLE point_of_interest
FIELDS TERMINATED BY ','
    ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 LINES
(osm_id,lat,lng,access,addr_housename,addr_housenumber,addr_interpolation,admin_level,aerialway,aeroway,amenity,area,barrier,bicycle,brand,bridge,boundary,building,capital,construction,covered,culvert,cutting,denomination,disused,ele,embankment,foot,generator_source,harbour,highway,historic,horse,intermittent,junction,landuse,layer,leisure,ship_lock,man_made,military,motorcar,name,osm_natural,office,oneway,operator,place,poi,population,power,power_source,public_transport,railway,ref,religion,route,service,shop,sport,surface,toll,tourism,tower_type,tunnel,water,waterway,wetland,width,wood);

Thats the database schema I use:

CREATE TABLE point_of_interest (
    `poi_id` int(10) unsigned NOT NULL auto_increment,
    `lat` DECIMAL(10, 8) default NULL,
    `lng` DECIMAL(11, 8) default NULL,
    PRIMARY KEY  (`poi_id`),
    KEY `lat` (`lat`),
    KEY `lng` (`lng`),
    osm_id BIGINT,
    access TEXT,
    addr_housename TEXT,
    addr_housenumber TEXT,
    addr_interpolation TEXT,
    admin_level TEXT,
    aerialway TEXT,
    aeroway TEXT,
    amenity TEXT,
    area TEXT,
    barrier TEXT,
    bicycle TEXT,
    brand TEXT,
    bridge TEXT,
    boundary TEXT,
    building TEXT,
    capital TEXT,
    construction TEXT,
    covered TEXT,
    culvert TEXT,
    cutting TEXT,
    denomination TEXT,
    disused TEXT,
    ele TEXT,
    embankment TEXT,
    foot TEXT,
    generator_source TEXT,
    harbour TEXT,
    highway TEXT,
    historic TEXT,
    horse TEXT,
    intermittent TEXT,
    junction TEXT,
    landuse TEXT,
    layer TEXT,
    leisure TEXT,
    ship_lock TEXT,
    man_made TEXT,
    military TEXT,
    motorcar TEXT,
    name TEXT,
    osm_natural TEXT,
    office TEXT,
    oneway TEXT,
    operator TEXT,
    place TEXT,
    poi TEXT,
    population TEXT,
    power TEXT,
    power_source TEXT,
    public_transport TEXT,
    railway TEXT,
    ref TEXT,
    religion TEXT,
    route TEXT,
    service TEXT,
    shop TEXT,
    sport TEXT,
    surface TEXT,
    toll TEXT,
    tourism TEXT,
    tower_type TEXT,
    tunnel TEXT,
    water TEXT,
    waterway TEXT,
    wetland TEXT,
    width TEXT,
    wood TEXT
) ENGINE=InnoDB;

Update:

I already checked the first and last record but both exist. Also records with a lot of empty values like this do exist:

1503898236,10.5271308,52.7468051,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

Update 2:

Those are the records I found which are missing in the database:

4228380062,9.9386752,53.6135468,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Dammwild,,,,,,,,,,,,,,,,,,,,attraction,,,,,,,
4228278589,9.9391503,53.5960304,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Kaninchen,,,,,,,,,,,,,,,,,,,,attraction,,,,,,,
4228278483,9.9396935,53.5960729,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Onager,,,,,,,,,,,,,,,,,,,,attraction,,,,,,,
4226772791,8.8394263,54.1354887,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Familienlagune Perlebucht,,,,,,,,,,,,,,,,,,,,attraction,,,,,,,

It seems that nearly all of the records with the osm_id starting with a 4 are missing. Thats strange.

Peter
  • 1,679
  • 2
  • 31
  • 60
  • Probably not what you want to hear, but it would be very interesting to know what lines were ignored. – fvu Jun 11 '16 at 18:10
  • I'd like to know as well. I already checked the first and last record but both exist. I do't want to check every record. – Peter Jun 11 '16 at 18:12
  • I updated the question with some of the missing records. Maybe that helps finding the cause. – Peter Jun 11 '16 at 18:23
  • The five records (Update 2) are loaded on the table in a small test. – wchiquito Jun 11 '16 at 19:35
  • Thats right. Also when I duplicate all records in my CSV file MySQL recognizes 31890 (=2x15945) from the 33832 records. So it seems that MySQL doesn't discard them because it thinks they are duplicates. – Peter Jun 11 '16 at 19:51
  • There are records with quotes like this: `"Plastiblock" (ehemals Block 683)` to be replaced by `\"Plasteblock\" (ehemals Block 683)`. – wchiquito Jun 13 '16 at 12:22
  • Even those records get parsed in a minimal example. To test it I splitted the CSV in files with 1 line each (16.916 files) and loaded every single of them using a self written bash script. After this every record was in the database. So I guess the records are all valid and the problem is somewhere else. – Peter Jun 14 '16 at 06:52

2 Answers2

0

try this to see if you have duplicate id in the file:

show the file

# cat mycsv.csv
6991,10.4232704,49.4970160,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Bauernhaus aus Seubersdorf,,,,,,,,,,,,,,,,,,,,attraction,,,,,,,
4228380062,9.9386752,53.6135468,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Dammwild,,,,,,,,,,,,,,,,,,,,attraction,,,,,,,
4228278589,9.9391503,53.5960304,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Kaninchen,,,,,,,,,,,,,,,,,,,,attraction,,,,,,,
4228278483,9.9396935,53.5960729,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Onager,,,,,,,,,,,,,,,,,,,,attraction,,,,,,,
4226772791,8.8394263,54.1354887,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Familienlagune Perlebucht,,,,,,,,,,,,,,,,,,,,attraction,,,,,,,
4228278589,9.9391503,53.5960304,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Kaninchen,,,,,,,,,,,,,,,,,,,,attraction,,,,,,,

count the lines

# wc -l mycsv.csv
6 mycsv.csv

remove duplicate ids and count again

# cut -d',' -f1 mycsv.csv | sort | uniq | wc -l
5
Bernd Buffen
  • 14,525
  • 2
  • 24
  • 39
  • Thanks for your answer. It seems `cut` doesn't find any duplicated lines. I get 16920 lines before and after executing this command. Have you tried this solution with the CSV file I linked in my question? – Peter Jun 11 '16 at 18:52
  • @Peter - found the error :-). you have duplicate keys. the reason is that the id is to big for a INT field. so it wars truncated and you have duplications. change ** poi_id to BIGINT** and every thing is fine – Bernd Buffen Jun 11 '16 at 19:10
  • 1
    `poi_id` is `AUTO_INCREMENT`, not loaded from the .CSV. – wchiquito Jun 11 '16 at 19:20
  • @Peter - sorry, my mistake. i get the same error - Query OK, 15945 rows affected (0.95 sec) Records: 15945 Deleted: 0 Skipped: 0 Warnings: 0 – Bernd Buffen Jun 11 '16 at 19:30
0

I didn't found the reason why MySQL ignores some of the records, so I searched for workarounds. There are 2 solutions which work for me:

Split the CSV file into multiple parts

split -l 10 file.csv

I figured out that if I split the CSV into multiple parts and load them into MySQL that it recognizes every record. However, this only worked for me if the file was pretty small (~10 records/file). So this solution wasn't viable for me.

Convert the CSV into MySQL Insert Statements

This part of a bash script converts the csv file into a SQL file containing INSERT INTO clauses:

cp file.csv inserts.sql
# replace empty CSV value with NULL
sed -r 's;^,|,$;NULL,;g
:l
s;,,;,NULL,;g
t l' -i inserts.sql

#replace " with '
sed -e ':a' -e 'N' -e '$!ba' -e 's/\"/\x27/g' -i inserts.sql

# enquote every value
sed 's/[^,][^,]*/"&"/g' -i inserts.sql

# replace ,, with ,NULL,NULL,
sed 's/,,/,NULL,NULL,/g' -i inserts.sql

# replace ,, with ,
sed 's/,,/,/g' -i inserts.sql

# add INSERT INTO table_name VALUES (NULL, before each line
# Note: The first value is NULL because its the primary key which is set from my table
sed 's/^/INSERT INTO table_name VALUES (NULL,/' -i inserts.sql

# add ); at the end of each line
sed 's/$/);/' -i inserts.sql

# replace ,); with );
sed 's/,);/);/g' -i inserts.sql

Note: I do not guarantee that this solution works with all CSV files, so check the generated SQL file before using it.

Peter
  • 1,679
  • 2
  • 31
  • 60