1

I have one file with several elements <elem>...</elem>. I need to split this file into n files with m elements each one (argument passed to awk command I am using). For example if my original file has 40 elements, I would want to split in 3 files (10 elements, 13 elements and 17 elements).

The problem is that the original file has elements with different structures.

EDITED AFTER fedorqui comment:
I use as awk command as files I want to get at the end of the process. 
That means If I need 3 files with m1, m2 and m3 elements, I will 
execute 3 awk with different parameters

Example of input (file.txt) (5 elements)

<elem>aaaaaaaa1</elem>
<elem>aaaaaaaa2</elem>
<elem>bbbbbbbb
bbbbbbbbb
bbbbbbbbb</elem>
<elem>bbbbbbbb2</elem>
<elem>ccccc

cccc</elem>

As you can see, 1st/2nd/4th element is in one line, 3rd element is in 3 lines without blank lines and 5h element is in 3 lines with an blank line.

Blank lines between elements is not a problem but blank lines inside an element fails

Example of desired output:

file_1.txt (2 elements)

<elem>aaaaaaaa1</elem>
<elem>aaaaaaaa2</elem>

file_2.txt (2 elements)

<elem>bbbbbbbb
bbbbbbbbb
bbbbbbbbb</elem>
<elem>bbbbbbbb2</elem>

file_3.txt (1 element)

<elem>ccccc

cccc</elem>

AWK command

(suffixFile is the suffix number of the file. For example fileAux_1.txt, fileAux_2.txt...)

Attempt1

awk -v numElems=$1 -v suffixFile=$2 '{
    for(i=1;i<=numElems;i++) {
        printf "<doc>"$i > "fileAux_" suffixFile".txt"
    }
}' RS='' FS='<doc>' file.txt

Works except for blank lines inside an element. I understand why it fails, because RS='' tells awk to split by blank lines

Attempt 2

awk -v numElems=$1 -v suffixFile=$2 '{
    for(i=1;i<=numElems;i++) {
        printf $i > "fileAux_" suffixFile".txt"
    }
}' RS='<doc>' FS='<doc>' file.txt

Another aproach but it also fails

¿Can anyone help me?

Thanks in advance!

javi
  • 11
  • 2
  • how do you determine what should go to `file_1` and what do `file_2`? Is it based on the first letter in the content of the `` tag? – fedorqui Jan 09 '15 at 12:44
  • 1
    It is not important which goes to which file (I pass two arguments to awk command for handling this question). I mean that I use (for example) "awk -v a=3 -v b=1 .... file.txt" for sending 3 elements to file_1.txt and "awk -v a=2 -v b=2 ... file.txt" for sending 2 elements to file_2.txt. – javi Jan 09 '15 at 12:46
  • I have edited the post with this clarifitation (thanks) – javi Jan 09 '15 at 12:52
  • You are detailing the question very well, but still it is a bit unclear to me the whole problem. Could you provide a sample input that is working, together with the output? Then, show a sample input that does not (I assume it is the one you are showing now) and indicate exactly what part of the output is wrong? – fedorqui Jan 09 '15 at 12:55
  • Ok fedorqui. I will answer you in a reply because the edit textarea is worse than the reply textarea – javi Jan 09 '15 at 13:03
  • No, no, please edit your question, answers are for answering :) You can make the edit section bigger by dragging it down. Just post the command you use that works (`awk -v numElems=1`, etc) and the one that does not. – fedorqui Jan 09 '15 at 13:10
  • Ouch, I did it right now. Sorry for the inconvenience :( – javi Jan 09 '15 at 13:13
  • Give a read to [How do I ask a good question?](http://stackoverflow.com/help/how-to-ask), specially the section "Include just enough code to allow others to reproduce the problem. For help with this, read How to create a Minimal, Complete, and Verifiable example." – fedorqui Jan 09 '15 at 13:15
  • Hi fedorqui. I think now is clarified (I have changed title and I have given you and example in the reply) – javi Jan 09 '15 at 13:38
  • I've gone through your code, done some research and could not find a proper solution. As it is a xml file, I would suggest using some XML parser in Python, for example. – fedorqui Jan 10 '15 at 14:02

1 Answers1

0

Assuming I understood your challenge correctly, here is my attempt:

$ cat script.sh 
#!/bin/bash

awk -v numElems=$1 -v suffixFile=$2 '
        /<elem>/{var++}
        /<\/elem>/{var--; count++} 
        {if(count < numElems || (count == numElems && var == 0)) {
                print $0 >> "file_"suffixFile".txt"
        } else {
                print $0
        } }' $3

The script mainly keeps track of the <elem> and </elem> closures with the var and counts the pairs with count. Then an if statement decides whether to push the line to the file or not. Once the total number of elements is reached, the rest of the file is returned so you can reiterate the process using pipes.

Here is an example of how to run it with the final output:

$ ./script.sh 2 1 file.txt | ./script.sh 2 2 | ./script.sh 1 3
$ tail -n +1 file_*
==> file_1.txt <==
<elem>aaaaaaaa1</elem>
<elem>aaaaaaaa2</elem>

==> file_2.txt <==
<elem>bbbbbbbb
bbbbbbbbb
bbbbbbbbb</elem>
<elem>bbbbbbbb2</elem>

==> file_3.txt <==
<elem>ccccc

cccc</elem>
Emer
  • 3,734
  • 2
  • 33
  • 47
  • wow I just realised this was asked more than 2 years ago! sorry for the late reply :P – Emer May 31 '17 at 22:40