6

I am trying to use boilerpipe java library, to extract news articles from a set of websites. It works great for texts in english, but for text with special characters, for example, words with accent marks (história), this special characters are not extracted correctly. I think it is an encoding problem.

In the boilerpipe faq, it says "If you extract non-English text you might need to change some parameters" and then refers to a paper. I found no solution in this paper.

My question is, are there any params when using boilerpipe where i can specify the encoding? Is there any way to go around and get the text correctly?

How i'm using the library: (first attempt based on the URL):

URL url = new URL(link);
String article = ArticleExtractor.INSTANCE.getText(url);

(second on the HTLM source code)

String article = ArticleExtractor.INSTANCE.getText(html_page_as_string);
pedro_silva
  • 143
  • 2
  • 6

6 Answers6

2

You don't have to modify inner Boilerpipe classes.

Just pass InputSource object to the ArticleExtractor.INSTANCE.getText() method and force encoding on that object. For example:

URL url = new URL("http://some-page-with-utf8-encodeing.tld");

InputSource is = new InputSource();
is.setEncoding("UTF-8");
is.setByteStream(url.openStream());

String text = ArticleExtractor.INSTANCE.getText(is);

Regards!

cnr..
  • 51
  • 5
  • 1
    First, sorry to take so long to comment your answer, and thank you for giving it. Unfortunately it is not working for me. I just tried it, and all the letters with accent marks become '?' when i print the extracted article. I will remain with the previous solution for now. – pedro_silva Jul 05 '12 at 13:37
1

Boilerpipe's ArticleExtractor uses some algorithms that have been specifically tailored to English - measuring number of words in average phrases, etc. In any language that is more or less verbose than English (ie: every other language) these algorithms will be less accurate.

Additionally, the library uses some English phrases to try and find the end of the article (comments, post a comment, have your say, etc) which will clearly not work in other languages.

This is not to say that the library will outright fail - just be aware that some modification is likely needed for good results in non-English languages.

Luke
  • 404
  • 5
  • 11
1

Java:

import java.net.URL;

import org.xml.sax.InputSource;

import de.l3s.boilerpipe.extractors.ArticleExtractor;

public class Boilerpipe {

    public static void main(String[] args) {
        try{
            URL url = new URL("http://www.azeri.ru/az/traditions/kuraj_pehlevanov/");

            InputSource is = new InputSource();
            is.setEncoding("UTF-8");
            is.setByteStream(url.openStream());

            String text = ArticleExtractor.INSTANCE.getText(is);
            System.out.println(text);
        }catch(Exception e){
            e.printStackTrace();
        }
    }

}

Eclipse: Run > Run Configurations > Common Tab. Set Encoding to Other(UTF-8), then click Run.

enter image description here

Chris
  • 18,075
  • 15
  • 59
  • 77
1

Well, from what I see, when you use it like that, the library will auto-chose what encoding to use. From the HTMLFetcher source:

public static HTMLDocument fetch(final URL url) throws IOException {
    final URLConnection conn = url.openConnection();
    final String ct = conn.getContentType();

    Charset cs = Charset.forName("Cp1252");
    if (ct != null) {
        Matcher m = PAT_CHARSET.matcher(ct);
        if(m.find()) {
            final String charset = m.group(1);
            try {
                cs = Charset.forName(charset);
            } catch (UnsupportedCharsetException e) {
                // keep default
            }
        }
    }

Try debugging their code a bit, starting with ArticleExtractor.getText(URL), and see if you can override the encoding

Shivan Dragon
  • 15,004
  • 9
  • 62
  • 103
  • Thank you for your answer. I'm sorry for only giving attention to it now but i have been stuck in another project. I tried printing the enconding that was set on the variable cs after this chunk of code, and the result was always ISO-8859-1. I also tried to force the encoding to be UTF-8, but got no better results. The problem must be in one of the conversions, to HTMLDocument, to TextDocument, etc. But i'm having some trouble printing their text content. Any ideas? Thanks again. – pedro_silva Feb 24 '12 at 20:06
  • Andrei, you were right. I was trying to complicate a lot, but in the end it was a very simple solution. Thanks again, i'm sorry i can't upvote you yet. – pedro_silva Mar 06 '12 at 15:33
1

Ok, got a solution. As Andrei said, i had to change the class HTMLFecther, which is in the package de.l3s.boilerpipe.sax What i did was to convert all the text that was fetched, to UTF-8. At the end of the fetch function, i had to add two lines, and change the last one:

final byte[] data = bos.toByteArray(); //stays the same
byte[] utf8 = new String(data, cs.displayName()).getBytes("UTF-8"); //new one (convertion)
cs = Charset.forName("UTF-8"); //set the charset to UFT-8
return new HTMLDocument(utf8, cs); // edited line
pedro_silva
  • 143
  • 2
  • 6
0

I had the some problem; the cnr solution works great. Just change UTF-8 encoding to ISO-8859-1. Thank's

URL url = new URL("http://some-page-with-utf8-encodeing.tld");
InputSource is = new InputSource();
is.setEncoding("ISO-8859-1");
is.setByteStream(url.openStream());

String text = ArticleExtractor.INSTANCE.getText(is);
crowler
  • 3
  • 2