1

So i need to load a pdf document for my gate embedded app. I try to parse the pdf to string with apache tika but the ANNIE tool of gate can't make find annotations in the string. I've heard about tikaformat, but can not find any examples where use.

someone will have some example of tikaformat or pdf documents loaded successfully otherwise?

Ross
  • 1,313
  • 4
  • 16
  • 24
Respino
  • 11
  • 2
  • Can you clarify why neither the plain text output not the html output from Apache Tika isn't working for you? – Gagravarr Mar 15 '14 at 06:22
  • i prove the plain text output in the AnnieStandAlone example, but the api can't make any annotation. Perhaps, with web pages the example runs perfectly. – Respino Mar 16 '14 at 15:07
  • 1
    What about if you get Tika to output as HTML rather than Plain Text, do the annotations come through then? – Gagravarr Mar 16 '14 at 18:17

1 Answers1

1

I think I'm too late for answering this question But I anyone in the future has the same question here the answer

First using Tika to extract the content of any file type

   File file = new File("file path");
   //parse method parameters
   Parser parser = new AutoDetectParser();
   BodyContentHandler handler = new BodyContentHandler();
   Metadata metadata = new Metadata();
   FileInputStream inputstream = new FileInputStream(file);
   ParseContext context = new ParseContext();
   //parsing the file
   parser.parse(inputstream, handler, metadata, context);

after initializing Gate Gate.init();

   Corpus corpus = Factory.newCorpus("SegmenterCorpus");
   Document document = Factory.newDocument(handler.toString());// **handler from tika parser to extract the content of a document** 
   corpus.add(document); 
   pipeline.setCorpus(corpus); 
   pipeline.execute();

for more information about how to use Tika you can see TIKA Tutorial its very usefull and learn you how to use tika step by step

Abeer zaroor
  • 320
  • 2
  • 17