1

And how do I make it do that?

Right now it stops at line breaks (like right after "Chicago,"). Alternatively, if I use DOTALL it just matches "Abbott A (1988)" and then the rest of the string till the very end. I would like it to stop at the next occurrence of (([\w\s]+)(([1|2]\d{3}))), that is ... "Albu OB and Flyverbom M (2016)". And so on and so forth.

Any pointers welcome.

pattern = r"(([\w\s]+)\(([1|2]\d{3})\))(.*)"

sample string

"Abbott A (1988) The System of Professions: An Essay on the Division of Expert Labor. Chicago,
IL: University of Chicago Press.
Albu OB and Flyverbom M (2016) Organizational transparency: conceptualizations, con-
ditions, and consequences. Business & Society. Epub ahead of print 13 July. DOI:
10.1177/0007650316659851.
Ananny M (2016) Toward an ethics of algorithms: convening, observation, probability, and timeli-
ness. Science, Technology & Human Values 41(1): 93–117. DOI: 10.1177/0162243915606523."

sandbox here

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
treakec
  • 139
  • 1
  • 9

1 Answers1

2

You may use

(?sm)^([^()\n\r]+)\(([12]\d{3})\)(.*?)(?=^[^()\n\r]+\([12]\d{3}\)|\Z)

See the regex demo

Details

  • (?sm) - re.DOTALL and re.MULTILINE enabled
  • ^ - start of a line
  • ([^()\n\r]+) - Group 1: one or more chars other than (, ), CR and LF
  • \( - a (
  • ([12]\d{3}) - Group 2: 1 or 2 and then any 3 digits
  • \) - a ) char
  • (.*?) - Group 3: any 0+ chars, including line breaks, as few as possible, up to (but excluding from match) the first...
  • (?=^[^()\r\n]+\([12]\d{3}\)|\Z) - (a positive lookahead that requires the presence of its pattern immediately to the right of the current location):
    • ^[^()\r\n]+\([12]\d{3}\) - same as the start of the pattern but with no groups
    • | - or
    • \Z - end of the whole text.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • i have added a comma, like this (?sm)^([\w\s,]+)\(([12]\d{3})\)(.*?)(?=^[\w\s,]+\([12]\d{3}\)|\Z) to include those references with multiple authors – treakec Jun 08 '18 at 07:41
  • @treakec I have suggested fixing that with `[^()\n\r]+` (there are more than commas, dots are there, too). It i matches any char but `(`, `)` or a common line break char. – Wiktor Stribiżew Jun 08 '18 at 07:45
  • yes, great idea. it will also come in handy later, i think. thanks. – treakec Jun 08 '18 at 08:32