1

I have a large text file that has a repeating set of data with the header -XXXX- and the footer $$$$ for each entry. There are around 20k entries and I would like to separate it out into files of 500 entries each.

I've been toying around with awk and am using the command below which it close. Each file starts with -XXXX- but every file after the first has a partial entry at the end.

awk "/-XXXX-/ { delim++ } { file = sprintf(\"file%s.sdf\", int(delim / 500)); print > file; }" < big.sdf

For example:

-XXXX-
Beginning
Middle
End
$$$$
-XXXX-
Beginning

I instead want each file to end right after the $$$$.

I am using awk on Windows.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
macaday
  • 5
  • 4
  • When you say 'every file after the first has a partial entry at the end', are you describing the input data file(s) or the output you are currently getting? – Jonathan Leffler Aug 30 '16 at 20:52
  • The output I'm currently getting. The first file correctly cuts off right below the $$$$. Subsequent files include a partial entry after -XXXX-. NOTE: I've found if I run the above code in a Cygwin shell on my Windows box I get the correct behavior but if I run it through the command prompt in Windows it messes up as described above. – macaday Sep 06 '16 at 22:13
  • It is going to be very hard for me to work out what's going on. I don't have access to any Windows machines any more — for the first time in a couple of decades — so I can't try replicating the problem very easily. What you say sounds peculiar. Are the files terminated with a newline (CRLF on Windows)? If not, that might account for some of what you're seeing. – Jonathan Leffler Sep 07 '16 at 04:45
  • Which files do you mean? Input or output? – macaday Sep 08 '16 at 18:55
  • Input files, primarily, but I'm trying to make sure I understand the questions, since presumably there are ongoing problems (you wouldn't be asking questions still if there weren't). You could revise the question to make it clear what you're talking about at each point; it would be good if you could show an input file with, say, 6 entries in it, and what you get when you split that into 3 files of 2 entries, and/or 2 files of 3 entries each. If worst comes to worst, send me email (see my profile) with a sample data file and your script and sample output files (what you get and what you want). – Jonathan Leffler Sep 08 '16 at 19:21

1 Answers1

2

So if each set of data between -XXXX- and $$$$ is a record, you want to write 500 records at a time to separate files? It seems like you need two counters - one for the output filename that just goes up, and another for the number of records in the current "batch", which goes up to 500, but then gets reset to zero for the next batch. Something like:

BEGIN {fctr=1 ; rctr=0 ; file=("file" fctr ".sdf")}
/^\$\$\$\$$/ {print > file ; rctr+=1}
rctr==500 {fctr+=1 ; file=("file" fctr ".sdf") ; rctr=0}
!/^\$\$\$\$$/ {print > file}
  • Line 1 sets the initial values and starts off with file1.sdf

  • Line 2 matches the footer of each record, and we increment the record counter every time we see one (as well as writing out the current footer)

  • Line 3 is for when we reach 500 records. First move to the next filename, then reset the record count back to zero

  • Line 4 is for all the regular lines. Just send them to whatever is the current filename

Ian McGowan
  • 3,461
  • 3
  • 18
  • 23