11

THE GIST

Before performing a massive 40+ GB import of the English Wikipedia, I had to temporarily remove indexes and auto-increment fields from three tables ('page', 'revision', and 'text') to handle the load. Now I have finally successfully imported the English Wikipedia to my local machine and created a local mirror (MediaWiki API). Yay!

However, I now need to re-create the indexes and auto-increment fields in less than a decade. Luckily, (1) I took plenty of screen-shots of the relevant tables in phpmyadmin before I removed the indexes and fields; (2) I can explain with extreme precision the steps I took before the import; and (3) this shouldn't be too difficult for anyone fluent in MySQL. Unfortunately, I have no expertise in MySQL whatsoever, so "baby steps" explanations would be extremely helpful.

PRECISELY WHAT I DID (PREPARING FOR THE IMPORT):

Steps 1, 2, 3: This image depicts the table page before I modified the field page_id by clicking 'Change' and un-checking 'Auto-Increment' (in preparation for importing). I performed precisely the same modifications for the field rev_id in table revision and old_id in table text but omitted the screen-shots to avoid redundancy.

table 'page' before modification of 'page_id'

Step 4: This image depicts the indexes for the table page before I dropped all of them.

indexes for table 'page' before I dropped them

Step 5: This image depicts the indexes for the table revision before I dropped all of them.

indexes for table 'revision' before I dropped them

Step 6: This image depicts the indexes for the table text before I dropped all of them.

indexes for table 'text' before I dropped them

WHAT I NEED NOW (RESTORING AFTER THE IMPORT):

I just need to restore the original indexes and auto-increment fields without waiting a hundred years.

Set-up details: PHP 5.3.8 (apache2handler), MySQL 5.5.16 (InnoDB), Apache 2.2.21, Ubuntu 12.04 LTS, MediaWiki 1.19.0 (private wiki)

Brian Schmitz
  • 1,023
  • 1
  • 10
  • 19
  • 1
    +1 for a well-written question, but I'm afraid with data sets of this size, creating indexes is going to take a while no matter how you do it. If this was MyISAM, you could have avoided dropping the indexes before importing: in MyISAM tables, you can just disable them and enable them again after you finish importing your data, in which case MySQL will automatically repair the index by sorting. But even then, sorting 40gb of data is going to take a while even on a fast system. And you're using InnoDB, in which this is not a possibility AFAIK. – Daan Jun 06 '12 at 21:23
  • I don't think disabling and enabling is faster than just building a new index. You start from zero, sort the data and write the index. – usr Jun 06 '12 at 21:34
  • Remember I am a complete MySQL novice. I know what I did, but have no idea how to sort, re-index, etc. so it would be very helpful if you explained with code. If it turns out disabling and enabling using MyISAM made re-indexing drastically faster, I could always rebuild everything from scratch -- I've done it so many times, I could easily do it again in about 5 hours. The key is that I know almost nothing about MySQL, so please be specific and provide code examples if possible. – Brian Schmitz Jun 06 '12 at 21:54
  • 1
    Maybe you can turn off durability and transaction logging for MySQL during the index build. AFAIK this is possible. In any case, you *have* to create the indexes and there's no magic switch for doing that fast. – usr Jun 06 '12 at 22:37

1 Answers1

4

I really like Wikipedia so I'll try to help.

You need to use a lot of

ALTER TABLE

Add primary keys

ALTER TABLE page ADD PRIMARY KEY (page_id);
ALTER TABLE revision ADD PRIMARY KEY (rev_id);
ALTER TABLE text ADD PRIMARY KEY (old_id);

Add auto increments back

ALTER TABLE page MODIFY COLUMN page_id INT(10) UNSIGNED NOT NULL AUTO_INCREMENT;

I need the table descriptions for all tables before continuing. If rev_id and old_id are same definitions as page_id then:

ALTER TABLE revision MODIFY COLUMN rev_id INT(10) UNSIGNED NOT NULL AUTO_INCREMENT;
ALTER TABLE text MODIFY COLUMN old_id INT(10) UNSIGNED NOT NULL AUTO_INCREMENT;

Add unique keys

ALTER TABLE page ADD UNIQUE name_title(page_namespace, page_title);
ALTER TABLE revision ADD UNIQUE rev_page_id(rev_page, rev_id);

Other indexes

ALTER TABLE page ADD INDEX page_random(page_random);
ALTER TABLE page ADD INDEX page_len(page_len);
ALTER TABLE page ADD INDEX page_redirect_namespace(page_is_redirect, page_namespace, page_len);
ALTER TABLE revision ADD INDEX rev_timestamp(rev_timestamp);
ALTER TABLE revision ADD INDEX page_timestamp(rev_page, rev_timestamp);
ALTER TABLE revision ADD INDEX user_timestamp(rev_user, rev_timestamp);
ALTER TABLE revision ADD INDEX user_text_timestamp(rev_user_text, rev_timestamp);

Again, there may be column definitions that change this stuff. You need to provide the CREATE TABLE info.

Brian Schmitz
  • 1,023
  • 1
  • 10
  • 19
saccharine
  • 894
  • 8
  • 22
  • Is this what you mean? This is the revision table http://imageshack.us/photo/my-images/38/revid.png/ and the text table is visible at the top of http://imageshack.us/photo/my-images/59/textindex.png/ If not, where should I look? I've got a ton of other screen-shots. – Brian Schmitz Jun 06 '12 at 22:54
  • 2
    @BrianSchmitz Yup made a few changes and now everything should work. In future, use "SHOW CREATE TABLE table_name" instead of screenshots of phpmyadmin. – saccharine Jun 06 '12 at 23:15
  • Excellent. I expect re-indexing should take some time, but would you be able to make a very, very rough ballpark guesstimate of the order of magnitude of processing time we're talking about here? Hours? Days? (I'm doing everything on a fairly beefy laptop.) If not, I understand since I know I haven't provided much info. – Brian Schmitz Jun 06 '12 at 23:47
  • 1
    @BrianSchmitz Make sure you run each command in the order I listed, and leave the indexes for last. I have no clue how long it'll take since I have no clue how many rows you have. This might give you more of a clue. http://stackoverflow.com/questions/2483205/mysql-how-long-to-create-an-index – saccharine Jun 07 '12 at 00:15
  • Are you sure the indexes should come last? The validation of the foreign keys can make use of them to be faster. – usr Jun 07 '12 at 10:24
  • @usr Didn't see any fks. Don't really have much exp with this though so you may be right. – saccharine Jun 07 '12 at 18:03
  • I saw [this blog](http://blog.aptivate.org/2008/12) on the subject from 2008 (slightly different table structure), The blogger appears to have used a slightly different approach, although I would be interested to know what both of you think about the approach and how it compares/differs from saccharine's approach. The blogger also appears to use a slightly different approach for removing indexes. – Brian Schmitz Jun 07 '12 at 19:56
  • 1
    @BrianSchmitz It does the same thing as my code except it combines the commands together. Should be the same order of magnitude in terms of speed. The dropping indexes is same. Have you not finished reindexing yet? – saccharine Jun 07 '12 at 22:21
  • @saccharine I'm on the step right before actually adding the indexes. When I ran `ALTER TABLE page ADD UNIQUE name_title(page_namespace, page_title);` I got this error message: `ERROR 1062 (23000): Duplicate entry '0-Main_Page' for key 'name_title'`. The other query `ALTER TABLE revision ADD UNIQUE rev_page_id(rev_page, rev_id);` worked just fine, however. – Brian Schmitz Jun 07 '12 at 23:54
  • Sorry I'm resorting to a phpmyadmin screenshot again. I searched for Main_Page in the page table and saw this http://postimage.org/image/uf1jf2t5p/. I have a hunch that the issue lies in the fact that a Main_Page with index 0 existed in the database before the import because the MediaWiki API probably included its own version of 0-Main_Page by default, which probably differs slightly from the Wikipedia 0-Main_Page. Once again on a hunch I'm guessing I would drop one of the 0-Main_Page entries and try running the query again, but let me know if you think I'm in the wrong ballpark. – Brian Schmitz Jun 08 '12 at 00:37
  • Oops, I meant to say "a Main_Page with page_namespace 0". – Brian Schmitz Jun 08 '12 at 00:50
  • 1
    @BrianSchmitz You're absolutely right. You're trying to define a unique key as a combination of page:namespace and page_title but you have 2 rows with page_namespace of 0 and page_title of main_page. Drop one and you should be fine – saccharine Jun 08 '12 at 01:02
  • Everything works beautifully. For anyone who is curious, it took on the order of 12 hours to completely re-index Wikipedia on a Dell XPS 15 laptop with an i7 processor and 2 GB of RAM allocated to a Virtual Box machine (Ubuntu 12.04 LTS) hosted on Windows 7. Of those 12 hours, 9 were spent on the two queries involving the text table. After re-indexing, Wikipedia roughly doubled in size, taking up a whopping 100 GB. – Brian Schmitz Jun 08 '12 at 04:15