-2

I have tried my best to understand a very similar StackOverflow question, but I cannot for the life of me make either the proposed gawk or split solutions to work in my case.

I have a large text file consisting of 288 proposals, each of which is 300 to 500 words long and in a varying number of paragraphs (so no consistent line count). Each proposal is headed, however, by an identifier of the following nature: --###-- or --####--. There is no closing marker -- though I suppose I could insert one by doing some regex search and replace on the original file before splitting it into multiple files. What I want is a collection of 288 individual text files, each of which is named by the number between the two dashes. If it makes things any easier, I can easily split the file between those proposals headed by three numbers and those by four numbers.

In a nutshell, I want to do this:

#! /bin/env bash or python

Split all_proposals.txt into 121.txt, 122.txt, etc.

Where all_proposals.txt consists of:

  --121--

  One Line Title of Proposal

  Followed by several paragraphs each on a line of variable length.

  Another paragraph for effect.

  --122--

  More lines indeterminate in number.
Community
  • 1
  • 1
John Laudun
  • 407
  • 1
  • 9
  • 19
  • 6
    It seems like your question is, "Will you please write this code for me," which isn't what Stack Overflow is for. Have you tried something that didn't work? – Ned Batchelder Jun 18 '12 at 20:12
  • `/bin/env` is not the standard location. You want `/usr/bin/env`. – William Pursell Jun 18 '12 at 20:56
  • @NedBatchelder: Previous attempts include, but are not limited to: `csplit abstracts.txt '/--[0-9][0-9][0-9]--/' '{186}'`, `csplit -f abs abstracts.txt '/--[0-9][0-9][0-9]--/' '{186}'`, `awk '/--\d/ {f=1;c++} {print > "session."i}' abstracts.txt`, and `gawk -vRS='\n--\[0-9]{3}--\n' -vprefix="file" '{print > prefix "ab-"NR".tx t"}' abstracts_no_id.txt` ... so, yes, I tried a variety of things that didn't work. – John Laudun Jun 19 '12 at 01:02
  • And that doesn't include the four Python scripts I tried and the bash script I tried. Nor does it include my reaching out to local folks for help. I admit upfront that my scripting and command line fu is quite weak. – John Laudun Jun 19 '12 at 01:04
  • @WilliamPursell ... yes, thanks. I was just faking a hash-bang there, so I wasn't, I confess, paying attention. – John Laudun Jun 19 '12 at 01:05

3 Answers3

1

Just set the name of the output file each time you see a line with the header:

awk '/--[0-9]*--/ {split( $0, a, "--" ); output=a[2]".txt" }
    { print > output }' all_proposals.txt

Note that this prints the header line into the file. If you don't want that, add a next command in the action sequence for the headers.

William Pursell
  • 204,365
  • 48
  • 270
  • 300
  • That is the most amazing bit of `awk` I have ever seen. Like my own previous tries at `awk` the output is a duplicate of the original file but is simply re-named with the first header. I feel like there is some vital bit of information that I have missed that would make the problem obvious to someone. – John Laudun Jun 20 '12 at 03:19
0

You can solve this in python using regular expressions in only a few lines. Have a look at the docs;

The idea with this then, is to search for your identifier, which in this case may be with an expression like

r'(--[0-9]*--)'

In particular, have a look at re.split

arpd
  • 1
  • I have an albeit tentative grasp on the regex module, and I had encountered `re.split` before, and so I can see that I could read the file in as a big string and then split it using a regex pattern. What I am not yet any good at is understanding how to walk a script through writing each of the new, small strings to separate files. – John Laudun Jun 19 '12 at 01:19
0

You can use perl:

#!/usr/bin/perl
open(FI,"file.txt");
read(FI,$_,10000000);
close(FI);
@arr = split('--###--');
$cnt=0;
for $c (@arr)
{
    open(FO,">$cnt.txt");
    print FO $c;
    close(FO);
    $cnt++;
}
amaksr
  • 7,555
  • 2
  • 16
  • 17
  • Does perl's `split` consider the "#" characters a regex replacement for the numbers that will be in the file? – jdi Jun 18 '12 at 20:35
  • I replaced `--###--` with the way too plodding `--[0-9][0-9][0-9]--` and it worked: I have a directory full of smaller texts. Two things to add to this: first, they don't have their header name, which is not a deal breaker, and, second, Perl remains beyond my kin. – John Laudun Jun 19 '12 at 01:25
  • @user14664130 -- I hope the check mark for getting me closest to an answer still counts for you despite the question being closed. (I'm sorry so few people found it useful.) – John Laudun Jun 19 '12 at 20:29