1

In short: I can’t get this URL’s title http://www.namlihipermarketleri.com.tr/default.asp?git=9&urun=10277 (which is broken now (18-11-2015) )

İn my WebCrawler implementation:

     @Override
     public void visit(Page page) {          
         System.out.println(page.getWebURL().getURL()); // when this prints the url
         if (page.getParseData() instanceof HtmlParseData) {
             HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
             System.out.println(htmlParseData.getTitle()); // This line prints an empty line!
         }
     }

Note: Title itself contains some commas “,”. Can you suggest a solution? Is this a bug?

Thanks in advance.

Daniel Werner
  • 1,350
  • 16
  • 26
Ismail Yavuz
  • 6,727
  • 6
  • 29
  • 50

1 Answers1

2

The problem was probably there were 4 title tags in the HTML document.

I've used Jsoup: http://jsoup.org/

HtmlParseData htmlParseData = (HtmlParseData) page
                        .getParseData();
String html = htmlParseData.getHtml();
Document htmlDocument = Jsoup.parse(html);              
String title = htmlDocument.getElementsByTag("title").get(0).text();
Ismail Yavuz
  • 6,727
  • 6
  • 29
  • 50