0

I have a single long text file that contains a list os 3D coordinates. The beginning of the file is composed by a header like this:

10112
2455
121.417670 172.321300 1.704072
0.997697 0.067831 -0.000222
-0.067831 0.997697 0.000207
0.000236 -0.000191 1.000000
0.997697 0.067831 -0.000222 0
-0.067831 0.997697 0.000207 0
0.000236 -0.000191 1.000000 0
121.417670 172.321300 1.704072 1

After that starts the list of coordinates. All the lines are composed by 3 to 7 numbers. For example:

0.001686 0.812066 -1.686245 0.074434
0.001695 0.816359 -1.692300 0.087190
0.001699 0.818673 -1.694508 0.097398
...

The total length of the list is equal to the product of the first two numbers of the header (10112*2455). These are PTX files, that contain 3D points from laser scanning in text format.

The point is that the file is a concatenation of headers and coordinates, and I want to split the file breaking it on the header. The ideal solution would split the file on the two consecutive single integer lines. I was looking for a generic solution using, for example, csplit, but csplit reads one line at a time, so it cannot detect the two consecutive lines.

As last resort, I will write a piece of software by myself, but I prefer to find a solution based on CLI tools (Awk?), if available.

Is there any idea?

Thank-you

Edit: examples

Let's say I have a file with the following content:

2
3
121.417670 172.321300 1.704072
0.997697 0.067831 -0.000222
-0.067831 0.997697 0.000207
0.000236 -0.000191 1.000000
0.997697 0.067831 -0.000222 0
-0.067831 0.997697 0.000207 0
0.000236 -0.000191 1.000000 0
121.417670 172.321300 1.704072 1
6.001686 0.812066 -1.686245 0.074434
3.001695 0.816359 -1.692300 0.087190
6.001699 0.818673 -1.694508 0.097398
2.001686 0.812066 -1.686245 0.074434
1.001695 0.816359 -1.692300 0.087190
0.001699 0.818673 -1.694508 0.097398
3                                         <--- cut before this line
1
421.417670 172.321300 1.704072
0.997697 0.067831 -0.000222
-0.067831 0.997697 0.000207
0.000236 -0.000191 1.000000
0.997697 0.067831 -0.000222 0
-0.067831 0.997697 0.000207 0
0.000236 -0.000191 1.000000 0
421.417670 172.321300 1.704072 1
1.001686 0.812066 -1.686245 0.074434
2.001695 0.816359 -1.692300 0.087190
3.001699 0.818673 -1.694508 0.097398

In this case I should end up with two files, cut just before the first of the two lines composed by a single integer.

As an alternative, knowing that the two single number lines say how many points compose the section, we can say that the first output file is composed by the first 2*3+10=16 (10 lines of header and 6 of data) lines, and the second file is composed by the subsequent 3*1+10=13 (always 10 lines fo header and this time 3 of data) lines.

  • Do you mean that there are many headers within the same file and you want to break based on this? – fedorqui Sep 17 '14 at 12:21
  • Exactly. I just need to split the original file before each occurrence of something that looks like a header, that starts with two lines composed by i single integer number each. – Federico Russo Sep 17 '14 at 12:35

1 Answers1

1

So you want to split a file into different ones, printing the header in all of them.

This can do it, you just have to assign the number of lines to store in the parameter -v lines=XX and number of lines of header you want to store -v head=YY:

awk -v lines=5 -v head=2
     'NR<=head{header[NR]=$1; next}
      !((NR-3)%lines) {file="output_"++count; for (i=1;i<=head;i++) print header[i] > file}
      {print > file}
     ' file

One-liner:

awk -v lines=5 -v head=2 'NR<=head{header[NR]=$1; next} !((NR-3)%lines) {file="output_"++count; for (i=1;i<=head;i++) print header[i] > file} {print > file}' file

For your specific sample input, giving head=2 and lines=5, it returns two files:

$ cat output_1
10112
2455
121.417670 172.321300 1.704072
0.997697 0.067831 -0.000222
-0.067831 0.997697 0.000207
0.000236 -0.000191 1.000000
0.997697 0.067831 -0.000222 0
$ cat output_2
10112
2455
-0.067831 0.997697 0.000207 0
0.000236 -0.000191 1.000000 0
121.417670 172.321300 1.704072 1

If what you want is to split the file for every header you find, this should do:

awk '(!flag && NF==1) {header[1]=$1; flag=1; next} (flag && NF==1) {header[2]=$1; flag=0; file="output_"++count; printf "%d\n%d\n", header[1], header[2] > file; next} {print > file}' file

Explanation

  • (!flag && NF==1) {header[1]=$1; flag=1; next} if no flag is set, assume it is the first line of the header and store it.
  • ( flag && NF==1) {header[2]=$1; flag=0; file="output_"++count; printf "%d\n%d\n", header[1], header[2] > file; next} if flag is set, it means that we already captured the first line of the header and we are in the second one. For this, unset the flag, generate the file name as output_ + number and populate with the stored header.
  • {print > file} on the rest of the cases, print the current line into the file.

Given your sample file, it returns output_1 and output_2:

$ cat output_1
2
3
121.417670 172.321300 1.704072
0.997697 0.067831 -0.000222
-0.067831 0.997697 0.000207
0.000236 -0.000191 1.000000
0.997697 0.067831 -0.000222 0
-0.067831 0.997697 0.000207 0
0.000236 -0.000191 1.000000 0
121.417670 172.321300 1.704072 1
6.001686 0.812066 -1.686245 0.074434
3.001695 0.816359 -1.692300 0.087190
6.001699 0.818673 -1.694508 0.097398
2.001686 0.812066 -1.686245 0.074434
1.001695 0.816359 -1.692300 0.087190
0.001699 0.818673 -1.694508 0.097398
$ cat output_2
3
1
421.417670 172.321300 1.704072
0.997697 0.067831 -0.000222
-0.067831 0.997697 0.000207
0.000236 -0.000191 1.000000
0.997697 0.067831 -0.000222 0
-0.067831 0.997697 0.000207 0
0.000236 -0.000191 1.000000 0
421.417670 172.321300 1.704072 1
1.001686 0.812066 -1.686245 0.074434
2.001695 0.816359 -1.692300 0.087190
3.001699 0.818673 -1.694508 0.097398
Tom Fenech
  • 72,334
  • 12
  • 107
  • 141
fedorqui
  • 275,237
  • 103
  • 548
  • 598
  • Thank-you. But I need to preserve all the contents of the original file, just splitting it in different files. The original file needs tuo be cut just before each header. – Federico Russo Sep 17 '14 at 12:19
  • @FedericoRusso then update your question with some sample representative input and desired output. Now it is not very clear what you mean. – fedorqui Sep 17 '14 at 12:21