1

the overall task is do NLP on Wikipedia pages.

First thing is to access the downloaded Wikipedia database dump(40GB XML file) from GATE in a good way? Actually, I need only the medical category.

is there some libraries for this?

any hints is appreciated!

BW

Matt
  • 741
  • 1
  • 6
  • 17

1 Answers1

0

So far as I searched, seems I have to : 1. Install a Mediawiki locally 2. use mwdump to import the xml dump into the Mediawiki database(MySQL) 3. access the MySQL database using JDBC connector

don't know if I am detouring

Matt
  • 741
  • 1
  • 6
  • 17