Questions tagged [text-extraction]

Text extraction is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents (text).

Text extraction is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents (text).

Text extraction mechanisms may vary depending on the context and the language applied. Approaches may vary from regular expressions to classifiers till more complex/custom models.

More Info

1282 questions
7
votes
1 answer

Extract Text with its Font Details (Style,Size,color,Italic etc) from a PDF in Python

I am looking to Extract Text with its Font Details (Style,Size,color,Italic etc) from a PDF in Python. I need to extract text and its metadata for translation purpose.Can anyone suggest any libraries for the same.
Udaya Kiran
  • 71
  • 1
  • 3
7
votes
3 answers

Not able to understand coordinate in extracted document using OCR engine tesseract

I have extracted a image document from tesseract and It has extracted successful. But I am not able to understand coordinate of extracted document. Problem description: - It showing coordinates but let me know that are these coordinates…
S.P Singh
  • 1,267
  • 3
  • 17
  • 23
7
votes
3 answers

Is there a way to use readability and python to extract just text, not HTML?

I need to extract pure text form a random web page at runtime, on the server side. I use Google App Engine, and Readability python port. There are a number of those. early version by gfxmonk, based on BeautifulSoup version by minvolai based on…
7
votes
2 answers

Rule based PDF text extraction for verious bills and invoices

I have to extract text from invoices and bills pdf files The files layouts can get complex, though its mostly filled with tables. I've read a few dozens articles already about the pdf format, how easy it is for our brain to grasp it and how hard it…
Guy Gavriely
  • 11,228
  • 6
  • 27
  • 42
6
votes
6 answers

Using boilerpipe to extract non-english articles

I am trying to use boilerpipe java library, to extract news articles from a set of websites. It works great for texts in english, but for text with special characters, for example, words with accent marks (história), this special characters are not…
pedro_silva
  • 143
  • 2
  • 6
6
votes
4 answers

Is there a boilerpipe port for .net?

Does anybody know a .net port for the boilerpipe library?
aogan
  • 2,241
  • 1
  • 15
  • 24
6
votes
2 answers

How to extract text from table in image?

I have data which in a structured table image. The data is like below: I tried to extract the text from this image using this code: import pytesseract from PIL import Image value=Image.open("data/pic_table3.png") text =…
Afianh
  • 118
  • 1
  • 6
6
votes
0 answers

How to skip the character causing UnicodeDecodeError: using textract like errors="replace"?

I am trying to convert all readable in a pdf file into a string using textract. It works for most of the files but in some it gives UnicodeDecodeError: I want to skip problematic characters. I have tried to find a way to solve it with…
6
votes
3 answers

Extract pdf text within bounding box directly into python

I'm trying to extract the text of a pdf within a given bounding rectangle. I understand there are tools for pdf scraping such as pdfminer, pypdf, and pdftotext. I've experimented with all 3, and so far I've only gotten code for pdftotext to extract…
Evan Mata
  • 500
  • 1
  • 6
  • 19
6
votes
1 answer

Keyword/keyphrase extraction from text

I am working on a project where I need to extract "technology related keywords/keyphrases" from text. For example, my text is: "ABC Inc has been working on a project related to machine learning which makes use of the existing libraries for finding…
6
votes
5 answers

How to install textract in python3

sudo python3 -m pip install textract sudo apt-get install textract pip install textract sudo apt-get install swig I want to install textract in python3 but it is not install proper way, it gives the following error. x86_64-linux-gnu-gcc -pthread…
Jay Pratap Pandey
  • 352
  • 2
  • 9
  • 19
6
votes
2 answers

Apache PDFBox Remove Spaces between characters

We are using PDFBox to extract text from PDF's. Some PDF's text can't be extract correctly. The following image shows a part from the PDF as image: After text extraction we get the following text: 3, 8 5 EU R 1 Netto 38,50 EUR…
TobiasH
  • 83
  • 1
  • 8
6
votes
2 answers

Python pdftotext ShellError Using textract

When I run the below Python script on a directory that contains a PDF file, I keep getting this error: ShellError: The command pdftotext "path/to/pdf/title.pdf" - failed with exit code 1 ------------- stdout ------------- ------------- stderr…
Gohawks
  • 1,044
  • 3
  • 12
  • 26
6
votes
4 answers

Extract part of string between two different patterns

I try to use stringr package to extract part of a string, which is between two particular patterns. For example, I have: my.string <- "nanaqwertybaba" left.border <- "nana" right.border <- "baba" and by the use of str_extract(string, pattern)…
Marta Karas
  • 4,967
  • 10
  • 47
  • 77
5
votes
2 answers

select HTML text element with regex?

I want to look for © in an HTML document, and basically get the entity the copyright is attributed to. The copyright line shows up a couple of different ways:

© 2011 The New York Times Company

or
tarayani
  • 193
  • 1
  • 9