Separating sections of a text file with a bash script

Question

I have a list:

    ### To Read:
    One Hundred Years of Solitude | Gabriel García Márquez
    Moby-Dick | Herman Melville
    Frankenstein | Mary Shelley
    On the Road | Jack Kerouac
    Eyeless in Gaza | Aldous Huxley
    ### Read:
    The Name of the Wind (The Kingkiller Chronicles: Day One) | Patrick Rothfuss | 6-27-2013
    The Wise Man’s Fear (The Kingkiller Chronicles: Day Two) | Patrick Rothfuss | 8-4-2013
    Vampires in the Lemon Grove | Karen Russell | 12-25-2013
    Brave New World | Aldous Huxley | 2-2014

I'd like to use something like python's string.split(' | ') to separate the various fields into separate strings, but since the two sections have different numbers of fields, I think I need to treat them differently. How do I go about selecting the lines in between '### To Read:' and '### Read:' and after '### Read:' and splitting them? Should I use awk or sed?

In Python, I know I can use this code: `x="kj,ui,rt,we,sd,ggh,hk,yu"; x.split(',')`, and it will return `['kj', 'ui', 'rt', 'we', 'sd', 'ggh', 'hk', 'yu']`. I'd like to do something similar, splitting the lines at the ' | ' pieces, to give me something like `['One Hundred Years of Solitude', 'Gabriel García Márquez']` from the first line. — a--clam, Jul 21 '14 at 05:47
Read a line at a time, looking for the section separators. Effectively, you are creating a simple state machine (states could be labelled "separator", "toread", and "read"); handle the current input line differently depending on the state you are in. — tripleee, Jul 21 '14 at 06:03
However, rethinking your input format would probably be a better approach, if that's feasible. Would JSON, XML, or (bletch) some variant of .ini file format be acceptable? Then there will be ready libraries you can use for reading and parsing. — tripleee, Jul 21 '14 at 06:06
what do you try ? for field separation a simple `IFS="|"` in batch script is enough to use depending of your need (sed, awk, ... also but a bit heavy for simple seaparation of field of this case) — NeronLeVelu, Jul 21 '14 at 06:08
@tripleee how would you recommend I go about doing that? I would really prefer not to change the format. — a--clam, Jul 21 '14 at 06:10
@NeronLeVelu I did try that, but then I realized that I wasn't sure how to separate the titles from the authors since the titles have different numbers of words. — a--clam, Jul 21 '14 at 06:12
So again the real question is what you want the end result to look like. Two different files? A list of lists in JSON format? Something else? — tripleee, Jul 21 '14 at 06:24

score 0 · Answer 1 · answered Jul 21 '14 at 06:25

You have not specified any desired output. So, as I interpret your question, you want to read certain lines from a file, split the lines on '|' and, analogous to python lists, put the results in bash arrays. The specified lines include all lines after ### To Read: except for the line that reads ### Read:. The script below does this and then, to demonstrate success, displays the arrays (using declare):

active=
while read line
do
    if [ "$line" = '### To Read:' ]
    then
        active=1
    elif [ "$line" = '### Read:' ]
    then
        active=1
    elif [ "$active" ]
    then
        IFS='|' my_array=($line)
        declare -p my_array
    fi
done <mylist

The output from your sample input is:

declare -a my_array='([0]="One Hundred Years of Solitude " [1]=" Gabriel García Márquez")'
declare -a my_array='([0]="Moby-Dick " [1]=" Herman Melville")'
declare -a my_array='([0]="Frankenstein " [1]=" Mary Shelley")'
declare -a my_array='([0]="On the Road " [1]=" Jack Kerouac")'
declare -a my_array='([0]="Eyeless in Gaza " [1]=" Aldous Huxley")'
declare -a my_array='([0]="The Name of the Wind (The Kingkiller Chronicles: Day One) " [1]=" Patrick Rothfuss " [2]=" 6-27-2013")'
declare -a my_array='([0]="The Wise Man’s Fear (The Kingkiller Chronicles: Day Two) " [1]=" Patrick Rothfuss " [2]=" 8-4-2013")'
declare -a my_array='([0]="Vampires in the Lemon Grove " [1]=" Karen Russell " [2]=" 12-25-2013")'
declare -a my_array='([0]="Brave New World " [1]=" Aldous Huxley " [2]=" 2-2014")'

Note that this approach easily handles the input even though the lines have different numbers of fields.

score 0 · Answer 2 · answered Jul 21 '14 at 06:32

You are not telling us how to deliver the final output, but here is a skeleton for an Awk solution.

awk -F ' \| ' '/^### To read:/ { s=1; next }
    /^### Read:/ { s=2; next }
    s==1 { print $1 "," $2 ",\"\"" }
    s == 2 { print $1 "," $2 "," $3 }' file

This will simply print an empty third field from the first subsection. You can obviously adapt the actions to be anything you like, or rewrite this in Python if you are more familiar with that.

Separating sections of a text file with a bash script

2 Answers2

Linked