0

I am trying to use csplit in BASH to separate a file by years in the 1500-1600's as delimiters.

When I do the command

csplit Shakespeare.txt '/1[56]../' '{36}'

it almost works, except for at least two issues:

  1. This outputs 38 files, not 36, numbered xx00 through xx37. (Also xx00 is completely blank.) I don't understand how this is possible.
  2. One of the files (why, it seems, that csplit returns 37 non-empty files instead of the 36 non-empty files I expected) doesn't begin with 15XX or 16XX -- it begins with "ACT 4 SCENE 15\n" (where \n is supposed to denote a newline or line break). I don't understand how csplit can match a new line/line break with a number.

When I do the command (which is what I want)

csplit Shakespeare.txt '/1[56][0-9][0-9]/' '{36}'

the terminal returns the error: csplit: 1[56][0-9][0-9]: no match plus listing all of the numbers it lists when the above is executed.

This especially doesn't make sense to me, since grep says otherwise:

grep -c "1[56][0-9][0-9]" Shakespeare.txt
36

grep -c "1[56].." Shakespeare.txt
36

Note: man csplit indicates that I have the BSD version from January 26, 2005. man grep indicates that I have the BSD version from July 28, 2010.

Chill2Macht
  • 1,182
  • 3
  • 12
  • 22
  • I don't understand the gnu comment-that's the bash on osx: `$ bash -version` shows `GNU bash version` 3.2.57(1)-release` – Dave Newton Sep 25 '17 at 21:06
  • 1
    What makes you think csplit is treating a newline as a number? `/../` is asking for any two characters, not just numbers. – jwodder Sep 25 '17 at 21:12
  • @DaveNewton Running that command on my system, I get: `GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin15) Copyright (C) 2007 Free Software Foundation, Inc.`. The point is just that on many help threads, solutions are presented which don't work for this version of BASH. Also the issue is more likely with command line functions, which BASH interacts with, rather than the shell itself. For example, when I do `man grep`, it says at the top "BSD general commands manual", not GNU or Linux. – Chill2Macht Sep 25 '17 at 21:23
  • @jwodder Valid point. I get the same result `36` too when I do `grep -c "1[56].." Shakespeare.txt`. I will edit the post. – Chill2Macht Sep 25 '17 at 21:27
  • Post a [mcve] so we can try to help you. By coming up with a minimal example you'll probably solve the problem yourself. – Ed Morton Sep 25 '17 at 23:34
  • @EdMorton There's literally no way I will be able to come up with a text file which I can post on here which accurately has all of the relevant features of the situation in terms of line breaks and spaces (it's the complete works of Shakespeare, after some preprocessing). I have found a solution (see the answer below), I just don't know why the solution works. – Chill2Macht Sep 25 '17 at 23:37
  • 1
    Of course you can. Take the file you have. Cut it in half and test one half. Does it reproduce the problem? If yes keep it, if not use the other half. Repeat until you have the smallest possible file that reproduces the problem. To be honest everything you're describing sounds like it makes perfect sense for a Windows-produced file that starts with a matching line but I'd have to actually see a [mcve] to be sure. – Ed Morton Sep 25 '17 at 23:39

1 Answers1

0

Based on the answer given here by user 'DRL' on 06-20-2008, I decided to try adding the -k option to csplit.

csplit -k Shakespeare.txt '/^1[56][0-9][0-9]/' '{36}'

This returned an error: csplit: ^1[56][0-9][0-9]: no match

However, it still gave (more or less) the desired output: files xx00.txt through xx36.txt (not xx37.txt), and each of the non-empty files, xx01.txt-xx36.txt had the expected/desired content. (In particular, no file began with "ACT 4 SCENE 15".

The man page for csplit says the following about the -k flag:

-k Do not remove output files if an error occurs or a HUP, INT or TERM signal is received.

Honestly I don't quite understand what this means, but I still have the following conjecture about why this solution worked/works:

Conjecture: csplit expects the beginning of the file to match the regex. Thus, since the beginning line of the file did not match ^1[56][0-9][0-9], it threw a tantrum and quit without the -k flag.

Nevertheless, I still don't understand why 1[56][0-9][0-9] did not work, maybe the same reason. And I definitely don't understand why 1[56].. did not work (i.e. why csplit produced a 37th file not beginning with the pattern).

Chill2Macht
  • 1,182
  • 3
  • 12
  • 22