parsing/scanning through a 17gb xml file

Question

I am trying to parse the stackoverflow dump file (Posts.xml- 17gb) .It is of the form:

<posts>
<row Id="15228715" PostTypeId="1" />
.
<row Id="15228716" PostTypeId="2" ParentId="1600647" LastActivityDate="2013-03-05T16:13:24.897"/>
</posts>

I have to 'group' each question with their answers. Basically find a question (posttypeid=1) find its answers using parentId of another row and store it in db .

I tried doing this using querypath (DOM), but it kept exiting(139) . My guess is because of the large size of the file, my PC couldn't handle it, even with huge swap.

I considered xmlreader, but as I see it using xmlreader, the program would be reading through the file a whole lot of times(find question, look for answers, repeat a lot of times) and hence is not viable. Am I wrong ?

Is there any other method/way ?

Help!

It is a one time parsing.

@Baba Not really. See http://lxr.php.net/xref/PHP_TRUNK/ext/simplexml/simplexml.c#2249. It'll load the whole file into memory and then iterate over it. — bwoebi, Jun 02 '13 at 10:17
XMLReader will iterate through each node without loading the whole file into memory, but it will only do so once unless you state otherwise. — Anthony Sterling, Jun 02 '13 at 10:25
It's funny to have to put 15 of the 16 GB into swap and then read from it. it'll take hours. — bwoebi, Jun 02 '13 at 10:27
I'd consider to fread through the file, from one tag to the next, manually. I don't see any other way here to not have to swap? — bwoebi, Jun 02 '13 at 10:27
@bwoebi Wow ... Interesting .... but even `fseek` would not work since its `XML` — Baba, Jun 02 '13 at 10:36

score 5 · Accepted Answer · edited Jun 03 '13 at 09:20

I considered xmlreader, but as I see it using xmlreader, the program would be reading through the file a whole lot of times(find question, look for answers, repeat a lot of times) and hence is not viable. Am I wrong ?

Yes you are wrong. With XMLReader you specify your own how often your want to traverse the file (you normally do it once). For your case I see no reason why you should not be able to even insert this 1:1 on each <row> element. You can decide per the attribute which database (table?) you would like to insert into.

I normally suggest a set of Iterators that make traversing with XMLReader easier. It's called XMLReaderIterator and allows to foreach over the XMLReader so that the code is often easier to read and write:

$reader = new XMLReader();
$reader->open($xmlFile);

/* @var $users XMLReaderNode[] - iterate over all <post><row> elements */
$posts = new XMLElementIterator($reader, 'row');
foreach ($posts as $post)
{
    $isAnswerInsteadOfQuestion = (bool)$post->getAttribute('ParentId')

    $importer = $isAnswerInsteadOfQuestion 
                ? $importerAnswers 
                : $importerQuestions;

    $importer->importRowNode($post);
}

If you are concerned about the order (e.g. you might fear that some answers parent's aren't available while the answers are), I would take care inside the importer layer, not inside the traversal.

Depending if that happens often, very often, never or quite never I would use a different strategy. E.g. for never I would insert directly into database tables with foreign key constraints activated. If often, I would create an insert transaction for the whole import in which the key constraints are lifted and re-activated at the end.

Thanks for your time, you were right that I was wrong. I was successfully able to use xmlreader to solve my predicament. 2 things I noticed: the answers were always after the question (obviosuly) and the answers were present within the next few thousand rows (few exceptions are ok), so wrote the script to cache a few thousand questions, look for their answers , array_splice half the array and repeat! Thanks again — gyaani_guy, Jun 03 '13 at 17:27

score 2 · Answer 2 · answered Jun 02 '13 at 13:22

2

Because the way you are processing this large file isn't sequential but requires direct access, I think the only viable option is to load the data into an XML database.

answered Jun 02 '13 at 13:22

Michael Kay

156,231
11
92
164

agree. Sometimes I think people overlook the most simple of solutions. – Woot4Moo Jun 02 '13 at 13:26

score 1 · Answer 3 · edited Mar 14 '14 at 01:55

Using PHP xmlreader seems to be the right thing to do.

Reason: Because of your statement:

I have to 'group' each question with their answers. Basically find a question (posttypeid=1) find its answers using parentId of another row and store it in db.

What I understand is you like to build a database with questions an answers. Therefore, there is no reason to do the "grouping" on the XML level. Put all relevant information in the database and do the grouping on the DB level - with db commands (sql ...).

What you have to is use something like "Using the target parser method" E.g [High-performance XML parsing in Python with xml (Even if it is for Python, it's a good start). This should be possible with XMLReader.

parsing/scanning through a 17gb xml file

3 Answers3

Linked

Related