1

I am new to GATE NLP. I have a document, which contains bullets. When I load it into GATE. Bullets are detected as an unknown type symbol which is printed as . I also tried to set the encoding to UTF-8. And I also tryed to load the document programmatically, then bullets gets detected as a ? .

Can anyone explain me this?

Example:

 Promoted to Senior Member Technical in 2.5 years of experience.

Here is the symbol which is in the GATE DEVELOPER UI and the ? symbol is shown when I did it "programmatically".

dedek
  • 7,981
  • 3
  • 38
  • 68
ganesh
  • 43
  • 6
  • You have to provide more details, otherwise your question cannot be answered... E.g.: what kind of file (txt,pdf, doc,docx) are you loading? What do you mean by "loading programmatically"? Can you post the relevant part of your source code? – dedek Aug 08 '16 at 15:22
  • For `pdf` this may be related: _In WinAnsiEncoding, any unused code greater than 040 maps to the bullet character_ https://issues.apache.org/jira/browse/PDFBOX-1713 – dedek Aug 08 '16 at 15:33
  • It's for the pdf ,doc , Docx . Programmatically meanes , I am using embedded gate to load the document and execute it with a pipeline .When i execute it , then ? are there . – ganesh Aug 09 '16 at 05:05

1 Answers1

0

In my experience, doc and docx files usually do not produce characters. Bullets are either missing (text formatted as bullet-list) or printed as (text with raw bullet characters).

See also this related question: Parsing either font style or block of paragraph in GATE

Pdf files often produce "-bullet characters" in a GATE document. It may be related to some pdf or Apache PDFBox issues, see e.g. this one.

These characters also have a unicode value. In XML, they are encoded for example as . In this case, my advice is to trace such characters (they may have different unicode values depending on the original bullet character) and replace them by something printable (e.g. ).

Concerning the ? characters: I it is probably caused by your java environment which doesn't support these characters. See e.g.: Why Some Unicode Characters appears to be question mark in the console?

Community
  • 1
  • 1
dedek
  • 7,981
  • 3
  • 38
  • 68
  • yes , the problem is with the pdf documents . Now I am converting doc to HTML and then processing HTML document . So , It's working for me .thanks @dedek – ganesh Aug 09 '16 at 07:38