0

This is my code:

// getFile() method returns the input stream of a local or online file
InputStream fileStream = getFile(source);
// Convert an InputStream to an InputSource
org.xml.sax.InputSource fileSource = new org.xml.sax.InputSource(fileStream);
// Extract text via the Boilerpipe DefaultExtractor
String text = DefaultExtractor.INSTANCE.getText(fileSource);

// Extract text and metadata via Apache Tika
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
AutoDetectParser parser = new AutoDetectParser();
parser.parse(fileStream, handler, metadata, context);

I can't figure out why just the first extractor works.

In this case just Boilerpipe (the first extractor) works, while Apache Tika (the second extractor) is not able to extract anything.

I tried to create a copy of fileStream (via InputStream fileStream2 = fileStream;) and to pass fileStream to one reader and fileStream2 to another reader, but it didn't work either.

I also tried passing to Boilerpipe the HTML extracted from fileStream, and fileStream to Tika, but the result was the same.

I suspect that the problem is that the same InputStream cannot be read twice.

Could you please help me how to pass the content of 1 InputStream to 2 readers?

EDIT: I found the solution and I posted it below

Salvatore
  • 499
  • 10
  • 16

2 Answers2

1

If you have a maven project, you have to include these dependencies (in your pom.xml) in order that boilerpipe could work:

 <dependency>
        <groupId>xerces</groupId>
        <artifactId>xercesImpl</artifactId>
        <version>x.y.z</version>
 </dependency>
 <dependency>
        <groupId>net.sourceforge.nekohtml</groupId>
        <artifactId>nekohtml</artifactId>
        <version>x.y.z</version>
</dependency>
Nicomedes E.
  • 1,326
  • 5
  • 18
  • 27
0

I find out that an InputStream can't be read twice as Tika and Boilerpipe did in my old code, so I figured out that I could read fileStream and convert it to String, pass it to Boilerpipe, convert the String to a ByteArrayInputStream and pass that to Tika. This is my new code.

// getFile() method returns the input stream of a local or online file
InputStream fileStream = getFile(source);

// Read the value of the InputStream and pass it to the
// Boilerpipe DefaultExtractor in order to extract the text
String html = readFromStream(fileStream);
String text = DefaultExtractor.INSTANCE.getText(html);

// Convert the value read from fileStream to a new ByteArrayInputStream
fileStream = new ByteArrayInputStream(html.getBytes("UTF-8"));

// Extract text and metadata via Apache Tika
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
AutoDetectParser parser = new AutoDetectParser();
parser.parse(fileStream, handler, metadata, context);
Salvatore
  • 499
  • 10
  • 16