1

I'm writing a Python script to parse Wikipedia articles, and part of that process is parsing links. I'm trying to write a regular expression that matches in this way:

  • [[:Category:Anarchism by country|Anarchism by country]] -> :Category:Anarchism by country
  • [[Squatting|squat]] -> Squatting
  • [[File:Jarach and Zerzan.JPG|thumb|Lawrence Jarach (left) and [[John Zerzan]] (right) -> John Zerzan
  • * {{cite book |last=Avrich |first=Paul |author-link=Paul Avrich |title=[[Anarchist Voices: An Oral History of Anarchism in America]] |year=1996 |publisher=[[Princeton University Press]] |isbn=978-0-691-04494-1 -> Unmatched, begins with * {{ (citation)

I've reached \[\[([^|\]]+)(?:\|[^|\]]+)?\]\] which works in 3 of the above examples, but in the citation it matches the title and the publisher. I know (I think) I need a negative lookahead to prevent any matches in the last example. I'm very bad with regex however, so any suggestions would be greatly appreciated.

InSync
  • 4,851
  • 4
  • 8
  • 30
jemhop
  • 51
  • 3

0 Answers0