Highest Voted 'text-extraction' Questions

7

votes

1 answer

Extract Text with its Font Details (Style,Size,color,Italic etc) from a PDF in Python

I am looking to Extract Text with its Font Details (Style,Size,color,Italic etc) from a PDF in Python. I need to extract text and its metadata for translation purpose.Can anyone suggest any libraries for the same.

asked Feb 21 '14 at 06:20

Udaya Kiran

71
1
3

7

votes

3 answers

Not able to understand coordinate in extracted document using OCR engine tesseract

I have extracted a image document from tesseract and It has extracted successful. But I am not able to understand coordinate of extracted document. Problem description: - It showing coordinates but let me know that are these coordinates…

ocr tesseract text-extraction hocr

asked Aug 31 '13 at 16:38

S.P Singh

1,267
3
17
23

7

votes

3 answers

Is there a way to use readability and python to extract just text, not HTML?

I need to extract pure text form a random web page at runtime, on the server side. I use Google App Engine, and Readability python port. There are a number of those. early version by gfxmonk, based on BeautifulSoup version by minvolai based on…

python readability text-extraction html-content-extraction

asked Jun 22 '12 at 06:15

Michael Kariv

1,421
13
20

7

votes

2 answers

Rule based PDF text extraction for verious bills and invoices

I have to extract text from invoices and bills pdf files The files layouts can get complex, though its mostly filled with tables. I've read a few dozens articles already about the pdf format, how easy it is for our brain to grasp it and how hard it…

pdf text-extraction

asked Apr 17 '12 at 10:05

Guy Gavriely

11,228
6
27
42

6

votes

6 answers

Using boilerpipe to extract non-english articles

I am trying to use boilerpipe java library, to extract news articles from a set of websites. It works great for texts in english, but for text with special characters, for example, words with accent marks (história), this special characters are not…

java html text-extraction

asked Feb 13 '12 at 11:51

pedro_silva

143
2
6

6

votes

4 answers

Is there a boilerpipe port for .net?

Does anybody know a .net port for the boilerpipe library?

c# .net text-extraction html-content-extraction boilerpipe

asked Jan 02 '12 at 20:42

aogan

2,241
1
15
24

6

votes

2 answers

How to extract text from table in image?

I have data which in a structured table image. The data is like below: I tried to extract the text from this image using this code: import pytesseract from PIL import Image value=Image.open("data/pic_table3.png") text =…

python ocr tesseract text-extraction python-tesseract

asked Dec 17 '19 at 08:55

Afianh

118
1
6

6

votes

0 answers

How to skip the character causing UnicodeDecodeError: using textract like errors="replace"?

I am trying to convert all readable in a pdf file into a string using textract. It works for most of the files but in some it gives UnicodeDecodeError: I want to skip problematic characters. I have tried to find a way to solve it with…

python pdf text-extraction

asked Oct 25 '19 at 11:51

Aaron Clifton

105
9

6

votes

3 answers

Extract pdf text within bounding box directly into python

I'm trying to extract the text of a pdf within a given bounding rectangle. I understand there are tools for pdf scraping such as pdfminer, pypdf, and pdftotext. I've experimented with all 3, and so far I've only gotten code for pdftotext to extract…

python pdf text-extraction pypdf pdfminer

asked Apr 09 '19 at 00:26

Evan Mata

500
1
6
19

6

votes

1 answer

Keyword/keyphrase extraction from text

I am working on a project where I need to extract "technology related keywords/keyphrases" from text. For example, my text is: "ABC Inc has been working on a project related to machine learning which makes use of the existing libraries for finding…

machine-learning nlp text-mining jnlp text-extraction

asked Mar 13 '18 at 18:28

Surbhi Singh

101
1
4

6

votes

5 answers

How to install textract in python3

sudo python3 -m pip install textract sudo apt-get install textract pip install textract sudo apt-get install swig I want to install textract in python3 but it is not install proper way, it gives the following error. x86_64-linux-gnu-gcc -pthread…

python-3.5 text-extraction

asked Nov 25 '17 at 06:30

Jay Pratap Pandey

352
2
9
19

6

votes

2 answers

Apache PDFBox Remove Spaces between characters

We are using PDFBox to extract text from PDF's. Some PDF's text can't be extract correctly. The following image shows a part from the PDF as image: After text extraction we get the following text: 3, 8 5 EU R 1 Netto 38,50 EUR…

pdfbox text-extraction pdf-parsing

asked Apr 10 '15 at 06:01

TobiasH

83
1
8

6

votes

2 answers

Python pdftotext ShellError Using textract

When I run the below Python script on a directory that contains a PDF file, I keep getting this error: ShellError: The command pdftotext "path/to/pdf/title.pdf" - failed with exit code 1 ------------- stdout ------------- ------------- stderr…

python pdf text-extraction

asked Apr 08 '15 at 17:01

Gohawks

1,044
3
12
26

6

votes

4 answers

Extract part of string between two different patterns

I try to use stringr package to extract part of a string, which is between two particular patterns. For example, I have: my.string <- "nanaqwertybaba" left.border <- "nana" right.border <- "baba" and by the use of str_extract(string, pattern)…

regex r text-extraction stringr

asked Apr 07 '14 at 22:21

Marta Karas

4,967
10
47
77

5

votes

2 answers

select HTML text element with regex?

I want to look for © in an HTML document, and basically get the entity the copyright is attributed to. The copyright line shows up a couple of different ways:

or

javascript jquery regex html-parsing text-extraction

asked Oct 30 '11 at 18:38

tarayani

193
1
9

Prev 1 2 3

…

85 86 Next

Questions tagged [text-extraction]

More Info

Extract Text with its Font Details (Style,Size,color,Italic etc) from a PDF in Python

Not able to understand coordinate in extracted document using OCR engine tesseract

Is there a way to use readability and python to extract just text, not HTML?

Rule based PDF text extraction for verious bills and invoices

Using boilerpipe to extract non-english articles

Is there a boilerpipe port for .net?

How to extract text from table in image?

How to skip the character causing UnicodeDecodeError: using textract like errors="replace"?

Extract pdf text within bounding box directly into python

Keyword/keyphrase extraction from text

How to install textract in python3

Apache PDFBox Remove Spaces between characters

Python pdftotext ShellError Using textract

Extract part of string between two different patterns

select HTML text element with regex?