4

Basically, I'm reading information from the Wikipedia API, which returns JSON code containing the source code for a page in their markdown. I used a JSON API to filter what I want from the page, and now I want to format the text in a way that removes all the links and such.

The markdown displays links like this: [[wiki page|display text]]
But it can also display like this: [[wiki page]]

So, what I'm trying to do is extract the display text if the pipe character exists, but if not I just want the wiki page text.

This is my code for that right now, which should detect if there's a pipe character and handle those strings properly but doesn't:

private static String format(String s) {
    return s.replaceAll("\\[\\[.+?(\\]\\]|\\|)", "").replace("[[", "").replace("]]", "").trim();
}

When running this it will sometimes take out any text that displays as simply [[wiki page]], but it works if the pipe character is there. How do I manage to get this working correctly?

Eli
  • 327
  • 4
  • 14

2 Answers2

3

You can use:

private static String format(String s) {
    return s.replaceAll("\\[\\[(?:[^|\\]]*\\|)?(.+?)\\]\\]", "$1");
}

RegEx Demo

anubhava
  • 761,203
  • 64
  • 569
  • 643
  • This works great! However, when passed in `[[wiki page]] [[wiki page|display text]]`, only `display text` is given back. Is there a way around this? – Eli Aug 12 '15 at 18:19
1
((?<=\\[\\[)[^|]*|(?<=\\|).*?)(?=\\]\\])

You can use this.Grab the $1.See demo.

https://regex101.com/r/rO0yD8/2

vks
  • 67,027
  • 10
  • 91
  • 124