Bash script to split field into array using regex for multi-character delimiters

Question

Hi – I don’t have a ton of shell scripting experience, and I need to create a bash script to split a single large note field into an array of individual notes, using a regex (or multiple regexes) as delimiters. My input looks like this:

This is the first note (AA 01/23 10:00A)This is the second note(AB 01/24 11:00P) This is the third note (C101/25/201512:15A)This is the fourth (and final) note(D2 03/10 03:15P)

My array needs to look like this:

This is the first note          AA  01/23       10:00A
This is the second note         AB  01/24       11:00P
This is the third note          C1  01/25/2015  12:15A
This is the fourth (and final) note D2  03/10       03:15P

Details:

the notes can contain parentheses, hence my thought that I will need to use a regex instead of just splitting after each “)”
the dates in note “tags” (the part contained within the parentheses) can have two distinct formats – some have spaces before and after the date with just a mm/dd date, and others show the date as mm/dd/yyyy with no spaces before and after.
the note tags always begin with “(AA”, where AA can be any combination of uppercase alpha and numeric characters
the note tags always end with “HH:MMA)” where HH is valid hours, MM is valid minutes, and the final character before the ) is either A or P.

I’ve defined two regex’s to identify the beginning and end of the note tag, but I’m at a loss as to how to actually get the data into an array. My regexes are:

starttag= "\([A-Z0-9]{2}"
endtag= "\d+:\d+[A|P]\)"

I’ve tried to create an array using IFS, but it appears that an IFS cannot contain multiple characters – correct? My results appear to be splitting the input on every character in my regex, instead of evaluating the entire regex as a single delimiter.

Any help would be greatly appreciated.

Do you want a 2D array out of this? Bash mostly just supports 1D arrays. — that other guy, Mar 17 '15 at 21:33
No, just a 1D array. Ultimate goal is to create database load records for each individual note associated with an order. So incoming record with order#/notenotenotenotenote will be written out to file as separate records order#/note, order#/note, order# note. — LisePr, Mar 17 '15 at 21:43
I would recommend looking into `grep`, `awk` and `sed` for something like this. You can use grep to search for regex and return what it finds. `egrep` and `grep -e` or `grep -P` should be most useful for your goal. — Luke Mat, Mar 18 '15 at 00:46
Should 'This is the first note' be the index of the array element and 'AA 01/23 10:00A' the value? — ShellFish, Mar 18 '15 at 01:49

score 0 · Answer 1 · answered Mar 18 '15 at 22:38

0

My sed is not the best, and this looks kind of silly and comes with no warranty:

    eval $(sed 's/\([^()]*\)(\([A-Z0-9]\{2\}\)\([^AP]*[AP]\)) */\1 \2 \3" "/g ; s/\([^ ]\)\([0-9]\{2\}:[0-9]\{2\}[AP]\)/\1 \2/g ; s/ "$//g ; s/^.*/array=("&)/' file)

Change "array" to be the name of the array you want to name, and "file" to the name of the file input. With your test input it, the sed line expands to this:

array=("This is the first note  AA  01/23 10:00A" "This is the second note AB  01/24 11:00P" "This is the third note  C1 01/25/2015 12:15A" "This is the fourth (and final) note D2  03/10 03:15P")

The eval picks that up and expands it into the current running shell.

answered Mar 18 '15 at 22:38

shooper

606
4
8

Thanks for the suggestion.. I tried this but the incoming data I need to evaluate is a single field within a larger record, and it appears that sed doesn't like this. Here's what I tried: 'eval $(sed 's/$[^()]*$($[A-Z0-9]\{2\}$$[^AP]*[AP]$) */\1 \2 \3" "/g ; s/$[^ ]$$[0-9]\{2\}:[0-9]\{2\}[AP]$/\1 \2/g ; s/ "$//g ; s/^.*/notes=("&)/' << "$jobnotes")' – LisePr Mar 19 '15 at 17:54
I know that it seemed to work with your sample data. I am sorry that it doesn't work for your real data. – shooper Mar 19 '15 at 18:12

Bash script to split field into array using regex for multi-character delimiters

1 Answers1