0

I would like to use bash on a file to extract text that lies between two strings. There are already some answers to this, eg:

Print text between two strings on the same line

But I would like to do this for multiple occurrences, sometimes on the same line, sometimes on new lines. for example, starting with a file like this:

\section{The rock outcrop pools experimental system} \label{intro:rockpools}
contain pools at their summit \parencite{brendonck_pools_2010} that have weathered into the rock over time \parencite{bayly_aquatic_2011} through chemical weathering after water collecting at the rock surface \parencite{lister_microgeomorphology_1973}.
Classification depends on dimensions \parencite{twidale_gnammas_1963}.

I would like to retrieve:

brendonck_pools_2010
bayly_aquatic_2011
lister_microgeomorphology_1973
twidale_gnammas_1963

I imagine sed should be able to do this but I'm not sure where to start.

Community
  • 1
  • 1
Shearn
  • 33
  • 4
  • 1
    It's always better to show enough context to give some perspective on the complexity of the problem. What [anubhava](https://stackoverflow.com/users/548225/anubhava) [showed](http://stackoverflow.com/a/34771029) when I commented was for a simpler input. I would probably use a marginally modified version of his (PCRE-enabled) `grep` command that puts the `\parencite` before the open brace, and then filter the output with `sed` to remove the unwanted material. – Jonathan Leffler Jan 13 '16 at 16:01

3 Answers3

1

Using grep -oP;

grep -oP '\\parencite\{\K[^}]+' file
brendonck_pools_2010
bayly_aquatic_2011
lister_microgeomorphology_1973
twidale_gnammas_1963

Or using gnu-awk:

awk -v FPAT='\\\\parencite{[^}]+' '{for (i=1; i<=NF; i++) {
    sub(/\\parencite{/, "", $i); print $i}}' file
brendonck_pools_2010
bayly_aquatic_2011
lister_microgeomorphology_1973
twidale_gnammas_1963
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • Thank you, this gets me some of the way. I updated the example because there are other things in the file with the {} string that I do not wish to print. Could you explain how grep is being told to use the "{" and "}" ? when I use `grep -oP 'parencite{\K[^}]+' file` it isn't working... – Shearn Jan 13 '16 at 15:59
  • 1
    @Shearn: which system are you on? Do you have GNU `grep` or another PCRE-enabled `grep`? You might need to escape the `{`, for example. You need to read the manual (depressingly) carefully. When you say "it isn't working", what are the symptoms? Complaints about the regex? Simply not returning anything? When you report 'not working', you need to be explicit — what you see may not be what others see. – Jonathan Leffler Jan 13 '16 at 16:04
  • Totally agreed with `@JonathanLeffler, `isn't working` doesn't really tell us what is not working. Also regarding your edited question, why are 2 values inside `{...}` not in output? – anubhava Jan 13 '16 at 16:18
  • 1
    @Shearn `{` and `}` are special and need to be escaped: `\\parencite\{\K[^}]+` or `(?<=\\parencite\{).+?(?=\})` work with `grep -oP` – glenn jackman Jan 13 '16 at 16:34
  • sorry @JonathanLeffler and @anubhava for being vague. I am running Ubuntu 15.04. By 'not working' I meant that there was no output. @anubhava, the two values to which you refer are not intended to be in the output, only those between `parencite{` and `}`. @glenn jackman, thank you for the explanation, `grep -oP '\\parencite\{\K[^}]+' file` works perfect – Shearn Jan 13 '16 at 17:28
  • That's good, I have updated answer with working versions of both `grep` and `awk`. – anubhava Jan 13 '16 at 17:34
1

This two stage extract might be easier to understand, without using Perl regex.

$ grep -o "parencite{[^}]*}" cite | sed 's/parencite{//;s/}//'
brendonck_pools_2010
bayly_aquatic_2011
lister_microgeomorphology_1973
twidale_gnammas_1963

or, as always awk to the rescue!

$ awk -F'[{}]' -v RS=" " '/parencite/{print $2}' cite
brendonck_pools_2010
bayly_aquatic_2011
lister_microgeomorphology_1973
twidale_gnammas_1963
karakfa
  • 66,216
  • 7
  • 41
  • 56
0

This might work for you (GNU sed):

sed '/\\parencite{\([^}]*\)}/!d;s//\n\1\n/;s/^[^\n]*\n//;P;D' file

Delete any lines that don't contain the required string. Surround the first occurrance with newlines and remove upto and including the first newline. Print upto and including the following newline then delete what was printed and repeat.

potong
  • 55,640
  • 6
  • 51
  • 83