3

I'm crawling some HTML files with crawler4j and I want to replace all links in those pages with custom links. Currently I can get the source HTML and a list of all outgoing links with this code:

        HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
        String html = htmlParseData.getHtml();
        List<WebURL> links = htmlParseData.getOutgoingUrls();

However a simple foreach loop and search & replace won't get me what I want. The problem is that athe WebURL.getURL(); will return the absolute URL but sometimes the links are relative and sometimes are not.

I want to handle all links (Images, URLs, JavaScript files, etc.). For instance I want to replace images/img.gif with view.php?url=http://www.domain.com/images/img.gif.

The only solution that comes to me is using a somewhat complicated Regex but I'm afraid I'm going to miss some rare cases. Has this been done already? Is there a library or some tool to achive this?

Alireza Noori
  • 14,961
  • 30
  • 95
  • 179
  • There seems to be no such tool or library, however regex is a powerful tool and sooner or later you will have to learn how to use it. I suggest you just try to use it right away. You might need to write some unit tests for that as well. – Gavin Xiong Jan 03 '13 at 13:11
  • have you tried my answer ? because I have faced with a problem like you and I use this regex – ibrahimyilmaz Jan 03 '13 at 14:44
  • @GavinXiong Actually, I'm very much familiar with regex. I've done tools which can modify c++ source codes just with the help of powerful regex. However, as I mentioned in the comment below, there might be some cases such as malformed HTML which can cause problems. – Alireza Noori Jan 03 '13 at 15:44
  • @AlirezaNoori I don't see what you can do in cases of malformed HTML... once the parser has done its job, then about all you can do is deal with the results. So do you really need to modify **all** of the links? There might be links to javascript, iframe sources, embedded boject source, etc. Where do you draw the line? – Kiril Jan 04 '13 at 21:19
  • @Lirik Not all, most of the links. For instance I don't want to replace email links, etc. As for the first part, I'm looking for a parser, rather than a regex. And since crawler4j has one already, I may have to modify its code. But, I'd rather use a better solution if provided. – Alireza Noori Jan 04 '13 at 22:35
  • @AlirezaNoori I understand, but what I was trying to get was a clarification on exactly which links do you want to modify. Just `a href` and `img src`? – Kiril Jan 05 '13 at 02:42
  • @Lirik Basically all of them. I'm trying to create a copy of a website and show it on my server. There's a script called `view.php` which takes a `url` parameter and shows its content. `js`, `css`, `images`, etc. So I want `href`, `src`, ..., all of them. – Alireza Noori Jan 05 '13 at 06:39

2 Answers2

0

I think you can make use of regular expression for this:

For example :

  ...
   String regex = "\\/[^.]*\\/[^.]*\\.";
   Pattern pattern =  Pattern.compile(regex);
   Matcher  matcher = pattern.matcher(text);

   while(matcher.find()){
    String imageLink =  matcher.group();
    text = text.replace(imageLink,prefix+imageLink);
   }
ibrahimyilmaz
  • 18,331
  • 13
  • 61
  • 80
  • This regex is too simple for that use. It will replace the URLs in text too. I won't want that. Also, in my question, I mentioned I have regex in mind, but I think there should be some sort of compiler for this to achieve better results. For instance there might be some malformed HTMLs which would not get caught with regex. – Alireza Noori Jan 03 '13 at 15:41
  • This regex help you find the relative url. After you find relative url you can turn them to the type what you want. – ibrahimyilmaz Jan 03 '13 at 16:01
0

Does it has to be a Java solution? PhantomJs in combination with pjscrape can site scrape a page to find all urls.

You just have to create a configuration javascript file.

getlinks.js:

pjs.addSuite({
    url: 'http://stackoverflow.com/questions/14138297/replace-all-urls-in-a-html',
    noConflict: true,
    scraper: function() {
          var links = _pjs.$('a').map(function() {
           // convert relative URLs to absolute
           var link = _pjs.toFullUrl($(this).attr('href'));
           return link;
      });
      return links.toArray();
    }
});
pjs.config({ 
  // options: 'stdout' or 'file' (set in config.outFile)
    log: 'stdout',
    // options: 'json' or 'csv'
    format: 'json',
    // options: 'stdout' or 'file' (set in config.outFile)
    writer: 'stdout',
    scrape_output.json
});

And run the command phantomjs pjscrape.js getlinks.js. In this example, the output is stored in a file (can also be logged in the console):

Here is the (partial) output:

* Suite 0 starting
* Opening http://stackoverflow.com/questions/14138297/replace-all-urls-in-a-html
* Scraping http://stackoverflow.com/questions/14138297/replace-all-urls-in-a-html
* Suite 0 complete
* Writing 145 items
["http://stackoverflow.com/users/login?returnurl=%2fquestions%2f14138297%2freplace-all-urls-in-a-html","http://careers.stackoverflow.com","http://chat.stackoverflow.com","http://meta.stackoverflow.com","http://stackoverflow.com/about","http://stackoverflow.com/faq","http://stackoverflow.com/","http://stackoverflow.com/questions","http://stackoverflow.com/tags","http://stackoverflow.com/users","http://stackoverflow.com/badges","http://stackoverflow.com/unanswered","http://stackoverflow.com/questions/ask", ...
"http://creativecommons.org/licenses/by-sa/3.0/","http://creativecommons.org/licenses/by-sa/3.0/","http://blog.stackoverflow.com/2009/06/attribution-required/"]
* Saved 145 items
asgoth
  • 35,552
  • 12
  • 89
  • 98
  • My code is in Java. If I can't find anything else, I could read this code and port it to Java. However, if possible I'm looking for doing as little as I can ;) – Alireza Noori Jan 03 '13 at 19:36