I am reading a wikipedia XML file, in which i have to delete anything which is a list item. E.g. For the following string:
String text = ": definition list\n
** some list item\n
# another list item\n
[[Category:1918 births]]\n
[[Category:2005 deaths]]\n
[[Category:Scottish female singers]]\n
[[Category:Billy Cotton Band Show]]\n
[[Category:Deaths from Alzheimer's disease]]\n
[[Category:People from Glasgow]]";
Here, i want to delete the *
,#
and :
, but not the one where it says category. Output should look like:
String outtext = "definition list\n
some list item\n
another list item\n
[[Category:1918 births]]\n
[[Category:2005 deaths]]\n
[[Category:Scottish female singers]]\n
[[Category:Billy Cotton Band Show]]\n
[[Category:Deaths from Alzheimer's disease]]\n
[[Category:People from Glasgow]]";
I am using the following code:
Pattern pattern = Pattern.compile("(^\\*+|#+|;|:)(.+)$");
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
String outtext = matcher.group(0);
outtext = outtext.replaceAll("(^\\*+|#+|;|:)\\s", "");
return(outtext);
}
This is not working. Can you please indicate how i should do it?