So our company has a large number of internal wiki sites for different departments and I'm looking for a way to unify them. We keep trying to get everybody to use the same wiki but it never works, they keep wanting to create new ones. What I'm wanting to do as an alternative is to scrape each wiki and create a new wiki with articles that has combined information from each source.
In terms of implementation I've looked at Nutch (http://nutch.apache.org/) and (http://scrapy.org/) to do the web crawling and using MediaWiki as the frontend. Basically I'd use the crawler as the front end to scrape each wiki, write some code in the middle (I'm thinking of using Python or Perl) to make sense of it and create new articles, writing to MediaWiki using its API.
Wasn't sure if anybody had similar experience and a better way to do this, trying to do some R&D before I get too deep into the project.