sqlite database design with millions of 'url' strings - slow bulk import from csv

Question

I'm trying to create an sqlite database by importing a csv file with urls. The file has about 6 million strings. Here are the commands I've used

create table urltable (url text primary key);
.import csvfile urldatabase

After about 3 million urls the speed slows down a lot and my hard disk keeps spinning continuously. I've tried splitting the csv file into 1/4th chunks but I run into the same problem.

I read similar posts on stackoverflow and tried using BEGIN...COMMIT blocks and PRAGMA synchronous=OFF but none of them helped. The only way I was able to create the database was by removing the primary key constraint from url. But then, when I run a select command to find a particular url, it takes 2-3 seconds which won't work for my application. With the primary key set on url, the select is instantaneous. Please advise me what am I doing wrong.

[Edit] Summary of suggestions that helped :

Reduce the number of transactions
Increase page size & cache size
Add the index later
Remove redundancy from url

Still, with a primary index, the database size is more than double the original csv file that I was trying to import. Any way to reduce that?

Try creating another column with a hash of the url and make that unique. — JoshRoss, Jun 04 '11 at 22:11
If the data doesn't have to be unique, why not add a regular index to the url column? — JoshRoss, Jun 05 '11 at 12:57
@JohnRoss : About your first comment, If I make the hash column unique, how will I handle hash collisions? About the second comment, the url data is unique. — user, Jun 05 '11 at 18:46
If you know the input data is unique, why make the database verify that on insert? If you pick a hashing algorithm with good entropy, like SHA1, you probably won't get a collision. I was thinking of what had to be checked on insert, when all your inputs start with `http://www.` that's 11 characters that could otherwise be ignored when checking the validity of each insert. — JoshRoss, Jun 06 '11 at 02:25
@JoshRoss Nice suggestion, for my case I could split urls and ignore the common portion. — user, Jun 06 '11 at 09:31

score 5 · Answer 1 · answered Jun 04 '11 at 22:06

Increase your cache size to something large enough to contain all of the data in memory. The default values for page size and cache size are relatively small and if this is a desktop application then you can easily increase the cache size many times.

PRAGMA page_size = 4096;
PRAGMA cache_size = 72500;

Will give you a cache size of just under 300mb. Remember page size must be set before the database is created. The default page size is 1024 and default cache size is 2000.

Alternatively (or almost equivalently really) you can create the database entirely in an in-memory database and then use the backup API to move it to an on-disk database.

score 3 · Accepted Answer · answered Jun 04 '11 at 21:11

3

A PRIMARY KEY or UNIQUE constraint will automatically generate an index. An index will dramatically speed up SELECTs, at the expense of slowing down INSERTs.

Try importing your data into a non-indexed table, and then explicitly CREATE UNIQUE INDEX _index_name ON urltable(url). It may be faster to build the index all at once than one row at a time.

answered Jun 04 '11 at 21:11

dan04

87,747
23
163
198

isn't that what's happening when I use BEGIN...COMMIT block.. i.e. indexing at the end – user Jun 05 '11 at 06:43
2

@buffer, it's not exactly the same, dan04 is right, creating the index after the inserts are done can significantly speed up the overall process. However, with that much data and a small cache size, you can still get a lot of disk thrashing so the better solution is to increase the cache size (or better yet--do both). – Samuel Neff Jun 05 '11 at 18:08
thank you @dan04 & @Samunel Neff. Your approach does help in creating the database. Keeping url as the primary key more than doubles the size of the db. How can I cut down on that? A 600MB text file of url translates to 1.5GB of database with primary index on url. – user Jun 06 '11 at 10:22
1

@buffer, you can't cut down on the size and still use sqlite. SQLite always internally stores data in pages indexed by a numeric sequence and additional indexes are stored in separate pages. If you just want to store that data once and all you have is that list of URLs, then you can use a data structure and alternate storage mechanism that is more efficient and specific. I've even heard that some people used the tree code within SQLite separately from SQLite. – Samuel Neff Jun 06 '11 at 18:28

sqlite database design with millions of 'url' strings - slow bulk import from csv

2 Answers2

Linked