How to split a huge xml-file to smaller files after nth occurrence of certain tag?

Question

I have a 30 GB xml-file, and I would like to split it to smaller files.

The data in the file is like this:

<film>.....</film>
.
.
.
.
.
.
<film>.....</film>

I could use "split -l" but the problem is that some film-elements contain text-data with line breaks. So one film-element may take more than one line.

What I would like to do is to split it so that each new smaller file would contain for example 3000 film-elements. So it should split it after every 3000th film-tag...

I am using Mac OS X and I would like to have an awk solution.

I tried to use this split file on Nth occurrence of delimiter but didn't succeed... It didn't split the files after ending film-tags...

gawk **IS** an awk, just like every tool named awk as is some variant (awk, nawk, tawk, mawk, gawk, /usr/xpg4/bin/awk, etc...) so saying "it's awk not gawk" doesn't really make sense. Can you install gawk? You're missing a ton of extremely useful functionality without it. [edit] your question contain concise, testab;e sample input and expected output so we can start to help you. Replace all those `...`s with truly representative sample values and if you have to deal with line breaks in some locations, include those in your samples. — Ed Morton, Mar 23 '17 at 21:30

score 3 · Answer 1 · answered Mar 23 '17 at 21:43

3

A job for streaming XSLT 3.0:

<xsl:transform version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:mode streamable="true" on-no-match="shallow-copy"/>    
  <xsl:template match="/*">
    <xsl:for-each-group select="*" group-adjacent="(position()-1) idiv 3000">
      <xsl:result-document href="chunk{position()}.xml">
        <xsl:copy>
          <xsl:copy-of select="."/>
        </xsl:copy>
      </xsl:result-document>
    </xsl:for-each-group>
  </xsl:template>
</xsl:transform>

This is going to be much more robust than an awk solution because it actually parses the XML so it guarantees well-formed input and well-formed output. When you're processing 30Gb, you can't check the output by hand, so there's a grave danger of undetected garbage if you fail to anticipate everything that can arise in the input (e.g. a film with "film" in its title). So working properly on the structure of the markup is much safer.

The other thing is that if your input is well-formed XML, it has a wrapper element around the <film> elements, and if the output is to be processed as XML, it will need a similar wrapper element. The XSLT solution handles this for free.

As you may have noticed, this stylesheet can split ANY xml file into chunks, and of course the chunk size could easily be supplied as a parameter.

answered Mar 23 '17 at 21:43

Michael Kay

156,231
11
92
164

1

Is what you posted a script and, if so, what's the name of the tool that would execute that script and how would the OP execute it on OSX against his input? If it's not a script - what is it? I did google "streaming XSLT 3.0" but didn't want to wade through the first hit, http://www.saxonica.com/html/documentation/sourcedocs/streaming/xslt-streaming.html – Ed Morton Mar 23 '17 at 21:48
@EdMorton I agree that the answer would benefit from a working example. I found this: http://www.saxonica.com/documentation9.5/using-xsl/commandline.html but didn't tried it. A little custom Java program could also be an option. For any professional, repeating use case I would definitely recommend to use XSLT. – hek2mgl Mar 23 '17 at 22:11
1

If the answer to "What command do I use to run it?" is "Create a custom Java program" then I don't care how robust it is... :-). Isn't there some tool that just uses it? – Ed Morton Mar 23 '17 at 22:34
1

I'm a little surprised that anyone should be handling XML without at least some awareness of XSLT - it's a bit like finding an electrician without a screwdriver. As with any tool, however, there's a little bit of a learning curve even when you're running a program that's been written for you. At present there are two streaming XSLT 3.0 processors available, Saxon-EE (Java) and Exselt (.NET) and you will need to install an evaluation copy of one or the other. For Saxon it's then simply `java -jar saxon9ee.jar -s:source.xml -xsl:style.xsl -o:outdir` on the command line. – Michael Kay Mar 23 '17 at 22:57
1

@MichaelKay The fact that there is no open source implementation so far (as it sounds) limits the usefulness significantly. I was not aware of that. – hek2mgl Mar 23 '17 at 23:19
Yes, but xsltproc won't handle 30Gb. Unfortunately your requirements are getting a bit beyond what the open source tools are capable of. – Michael Kay Mar 24 '17 at 00:37
@dawg You most definitely cannot run streaming XSLT 3.0 with xsltproc. It hasn't even been updated to XSLT 2.0. (It was a great piece of open source software written by an enthusiast who then stopped work on it because he needed to earn a living...) – Michael Kay Mar 24 '17 at 09:53
@MichaelKay screwdrivers are readily available in any hardware store and have many uses beyond electrical work. What you're describing is more like a general contractor/handyman being asked to replace a fuse but not being aware of a pair of VR goggles designed to display electrical currents that are only available by special order from google after going through a screening process. Sure, it's great technology but unless anyone can pick it up at their local hardware store on a whim you shouldn't be surprised when people (esp. non-electricians - we're not creating the XML!) haven't heard of it. – Ed Morton Mar 24 '17 at 12:33

Ed Morton · Accepted Answer · 2017-03-23T21:55:24.190

2

Chances are something like this is what you're looking for:

awk '{ gsub(/@/,"@A"); gsub(/}/,"@B"); gsub(/<\/film>\n?/,"}") } 1' file |
awk -v RS='}' -v ORS='</film>' '
    (NR%3000)==1 { close(out); out="out"++cnt }
    { gsub(/@B/,"}"); gsub(/@A/,"@"); print > out }
'

but without sample input/output it's a guess and, of course, untested.

edited Mar 23 '17 at 21:55

answered Mar 23 '17 at 21:40

Ed Morton

188,023
17
78
185

@ed morton, your script is working fine. I need to prepend some text to the file after split. any idea how to do with above script? – Kumar V Mar 23 '18 at 13:19
Ask a new question with it's own concise, testable sample input and expected output. – Ed Morton Mar 23 '18 at 14:35

score 1 · Answer 3 · edited May 23 '17 at 11:46

When Ed Morton posts an awk solution it is usually a small tutorial for low level users like me...

But in any case since i have been working on this exercise for the last hour and a half, i thought to take my risk to post this solution which is a transformation by the link you already found

$ awk '$0 ~/<film.*>/{++delim} {file = sprintf("chunk%s", int(delim/7)); print >file; }' file4

Testing:
I used a small bash loop to create a small film file with 50 records and split those films by 7 for testing:

$ for ((i=1;i<50;i++));do echo -e "<film$i>..............</film$i>" >>file4;done
$ head file4
<film1>..............</film1>
<film2>..............</film2>
<film3>..............</film3>
<film4>..............</film4>
<film5>..............</film5>
<film6>..............</film6>
<film7>..............</film7>
<film8>..............</film8>
<film9>..............</film9>
<film10>..............</film10>

$ awk '$0 ~/<film.*>/{++delim} {file = sprintf("chunk%s", int(delim/7)); print >file; }' file4 
$ cat chunk0
<film1>..............</film1>
<film2>..............</film2>
<film3>..............</film3>
<film4>..............</film4>
<film5>..............</film5>
<film6>..............</film6>

Another test in which each film has some newlines:

$ for ((i=1;i<50;i++));do echo -e "<film$i>...\n...\n...\n.....</film$i>" >>file4;done
$ head -n20 file4
<film1>...
...
...
.....</film1>
<film2>...
...
...
.....</film2>
<film3>...
...
...
.....</film3>
<film4>...
...
...
.....</film4>
<film5>...
...
...
.....</film5>


$ awk '$0 ~/<film.*>/{++delim} {file = sprintf("chunk%s", int(delim/7)); print >file; }' file4 

$ ls chunk*
chunk0  chunk1  chunk2  chunk3  chunk4  chunk5  chunk6  chunk7

$ cat chunk1
<film7>...
...
...
.....</film7>
<film8>...
...
...
.....</film8>
<film9>...
...
...
.....</film9>
<film10>...
...
...
.....</film10>
<film11>...
...
...
.....</film11>
<film12>...
...
...
.....</film12>
<film13>...
...
...
.....</film13>

Well, in both cases seems to work ok. Mind that in this configuration input file is splitted per 7 films - not per 7 lines. You can change this number to whatever.

Unfortunately, this doesn't seem to work with me... I think it has the same problem than the original code that it doesn't find the correct tags... It just makes one file which is identical to the original. But thanks anyway :) — jubakala, Mar 24 '17 at 11:11
@jubakala Thank you for your feedback. Just to understand something: The way that i simulated your file and the test i made in my Debian does it match your real case or i have got it wrong? — George Vasiliou, Mar 24 '17 at 12:38
In the real case, the film-elements are not numbered like , etc... They're just ... But I don't think that is the problem, because I changed it when I tested it. — jubakala, Mar 24 '17 at 20:40
@jubakala I don't know why this method fails. I also tested this code on non gnu awk (FreeBSD) and seems to work fine. — George Vasiliou, Mar 24 '17 at 20:54
@jubakala By the way, i did a second test with a file containing just (no numbers) using the same regex (`awk '$0 ~/` and worked fine — George Vasiliou, Mar 24 '17 at 22:07

How to split a huge xml-file to smaller files after nth occurrence of certain tag?

3 Answers3