I have to read wikidumps and extract the headings, bold words, italics, etc. The formatting is done in wikicode. How can I read the wiki markup? I am using pugiXML to parse the document but I have no idea how to read the wiki markup and extract the text. How can I do this?
Asked
Active
Viewed 53 times
1
-
can you show some code of what you've tried and didn't work? – Paweł Łukasik Jan 01 '17 at 11:28
-
I don't have the code for this criteria yet. I want to achieve this using regex but haven't been able to do it. Can you guide me a bit on this? – Rmcf Jan 02 '17 at 18:19
-
I have made the following regex: – Rmcf Jan 02 '17 at 18:47
-
\'.*([a-z]|[A-Z])+.*\' But it only returns me the entire string. However, I want to extract each word of the string. How can I do that? – Rmcf Jan 02 '17 at 18:47
-
you need to show the example (representative) text you want to split with the regex - otherwise it might be hard – Paweł Łukasik Jan 02 '17 at 18:50
-
my text is the xml file of simple wikipedia. – Rmcf Jan 04 '17 at 18:24