2

So our company has a large number of internal wiki sites for different departments and I'm looking for a way to unify them. We keep trying to get everybody to use the same wiki but it never works, they keep wanting to create new ones. What I'm wanting to do as an alternative is to scrape each wiki and create a new wiki with articles that has combined information from each source.

In terms of implementation I've looked at Nutch (http://nutch.apache.org/) and (http://scrapy.org/) to do the web crawling and using MediaWiki as the frontend. Basically I'd use the crawler as the front end to scrape each wiki, write some code in the middle (I'm thinking of using Python or Perl) to make sense of it and create new articles, writing to MediaWiki using its API.

Wasn't sure if anybody had similar experience and a better way to do this, trying to do some R&D before I get too deep into the project.

Nemo
  • 2,441
  • 2
  • 29
  • 63
Sam P
  • 93
  • 3
  • Mixing wikis on different topics sounds like a terrible idea. Why not just add links between them where there is overlap? – stark Dec 13 '12 at 00:16
  • That's the problem, there is way too much overlap and finding relevant information is extremely difficult as things are right now. – Sam P Dec 13 '12 at 00:23

2 Answers2

0

I did something very similar a little while back. I wrote a little Python script that scrapes a page hierarchy in our Confluence wiki, saves the resulting html pages locally and converts them into DITA XML topics for processing by our documentation team.

Python was a good choice - I used mechanize for my browsing/scraping needs, and the lxml module for making sense of the xhtml (it has quite a nice range of xml traversing/selection methods. Worked out nicely!

David Heijl
  • 399
  • 2
  • 5
0

Please don't do screenscraping, you make me cry.

If you just want to regularly merge all wikis into one and have them under a "single wiki", export each wiki to XML and import the XML of each wiki into its own namespace of the combined wiki.

If you want to integrate the wikis more tightly and on a live basis, you need crosswiki transclusion on the combined wiki to load the HTML from a remote wiki and show it as a local page. You can build on existing solutions:

Nemo
  • 2,441
  • 2
  • 29
  • 63