break text file into multiple text files

Question

I have tried my best to understand a very similar StackOverflow question, but I cannot for the life of me make either the proposed gawk or split solutions to work in my case.

I have a large text file consisting of 288 proposals, each of which is 300 to 500 words long and in a varying number of paragraphs (so no consistent line count). Each proposal is headed, however, by an identifier of the following nature: --###-- or --####--. There is no closing marker -- though I suppose I could insert one by doing some regex search and replace on the original file before splitting it into multiple files. What I want is a collection of 288 individual text files, each of which is named by the number between the two dashes. If it makes things any easier, I can easily split the file between those proposals headed by three numbers and those by four numbers.

In a nutshell, I want to do this:

#! /bin/env bash or python

Split all_proposals.txt into 121.txt, 122.txt, etc.

Where all_proposals.txt consists of:

  --121--

  One Line Title of Proposal

  Followed by several paragraphs each on a line of variable length.

  Another paragraph for effect.

  --122--

  More lines indeterminate in number.

It seems like your question is, "Will you please write this code for me," which isn't what Stack Overflow is for. Have you tried something that didn't work? — Ned Batchelder, Jun 18 '12 at 20:12
`/bin/env` is not the standard location. You want `/usr/bin/env`. — William Pursell, Jun 18 '12 at 20:56
@NedBatchelder: Previous attempts include, but are not limited to: `csplit abstracts.txt '/--[0-9][0-9][0-9]--/' '{186}'`, `csplit -f abs abstracts.txt '/--[0-9][0-9][0-9]--/' '{186}'`, `awk '/--\d/ {f=1;c++} {print > "session."i}' abstracts.txt`, and `gawk -vRS='\n--\[0-9]{3}--\n' -vprefix="file" '{print > prefix "ab-"NR".tx t"}' abstracts_no_id.txt` ... so, yes, I tried a variety of things that didn't work. — John Laudun, Jun 19 '12 at 01:02
And that doesn't include the four Python scripts I tried and the bash script I tried. Nor does it include my reaching out to local folks for help. I admit upfront that my scripting and command line fu is quite weak. — John Laudun, Jun 19 '12 at 01:04
@WilliamPursell ... yes, thanks. I was just faking a hash-bang there, so I wasn't, I confess, paying attention. — John Laudun, Jun 19 '12 at 01:05

score 1 · Answer 1 · answered Jun 18 '12 at 20:19

1

Just set the name of the output file each time you see a line with the header:

awk '/--[0-9]*--/ {split( $0, a, "--" ); output=a[2]".txt" }
    { print > output }' all_proposals.txt

Note that this prints the header line into the file. If you don't want that, add a next command in the action sequence for the headers.

answered Jun 18 '12 at 20:19

William Pursell

204,365
48
270
300

That is the most amazing bit of `awk` I have ever seen. Like my own previous tries at `awk` the output is a duplicate of the original file but is simply re-named with the first header. I feel like there is some vital bit of information that I have missed that would make the problem obvious to someone. – John Laudun Jun 20 '12 at 03:19

score 0 · Answer 2 · answered Jun 18 '12 at 20:25

0

You can solve this in python using regular expressions in only a few lines. Have a look at the docs;

The idea with this then, is to search for your identifier, which in this case may be with an expression like

r'(--[0-9]*--)'

In particular, have a look at re.split

answered Jun 18 '12 at 20:25

arpd

1

I have an albeit tentative grasp on the regex module, and I had encountered `re.split` before, and so I can see that I could read the file in as a big string and then split it using a regex pattern. What I am not yet any good at is understanding how to walk a script through writing each of the new, small strings to separate files. – John Laudun Jun 19 '12 at 01:19

score 0 · Accepted Answer · answered Jun 18 '12 at 20:26

0

You can use perl:

#!/usr/bin/perl
open(FI,"file.txt");
read(FI,$_,10000000);
close(FI);
@arr = split('--###--');
$cnt=0;
for $c (@arr)
{
    open(FO,">$cnt.txt");
    print FO $c;
    close(FO);
    $cnt++;
}

answered Jun 18 '12 at 20:26

amaksr

7,555
2
16
17

Does perl's `split` consider the "#" characters a regex replacement for the numbers that will be in the file? – jdi Jun 18 '12 at 20:35
I replaced `--###--` with the way too plodding `--[0-9][0-9][0-9]--` and it worked: I have a directory full of smaller texts. Two things to add to this: first, they don't have their header name, which is not a deal breaker, and, second, Perl remains beyond my kin. – John Laudun Jun 19 '12 at 01:25
@user14664130 -- I hope the check mark for getting me closest to an answer still counts for you despite the question being closed. (I'm sorry so few people found it useful.) – John Laudun Jun 19 '12 at 20:29

break text file into multiple text files

3 Answers3