Questions tagged [text-extraction]

Text extraction is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents (text).

Text extraction mechanisms may vary depending on the context and the language applied. Approaches may vary from regular expressions to classifiers till more complex/custom models.

More Info

Information extraction on wikipedia

1282 questions

votes

3 answers

How to get the number of results found for a keyword in google

I need to supply a keyword like "blue metal kettle" (with/without quotes) and get only the number of results found for this search. If I search without quotes right now, I get: Results 1 - 10 of about 1,040,000 for blue metal kettle. (0.19…

asked Nov 27 '09 at 17:52

Ali

261,656
265
575
769

votes

3 answers

what is fastest way to convert pdf to jpg image?

I am trying to convert multiple pdfs (10k +) to jpg images and extract text from them. I am currently using the pdf2image python library but it is rather slow, is there any faster/fastest library than this? from pdf2image import…

python imagemagick ghostscript text-extraction pdf2image

asked Aug 25 '22 at 05:07

Sahil Lohiya

votes

1 answer

Efficiently extract the highlighted portion from PDFs using PyMuPDF python?

I have a use case where I have to highlight table from PDF document and then extract the highlighted part using python. Once it is highlighted, I have to transform the extracted part to a dataframe such that the dataframe should look like this: name…

python pandas text-extraction pymupdf

asked Dec 07 '21 at 07:01

technophile_3

votes

2 answers

How extract text from this compressed PDF/A?

For machine learning purposes (sckit-learn), I need to extract the raw text from lots of PDF files. First off, I was using xpdf pdftotext to do this task: exe = r'"'+os.path.join(xpdf_path,"pdftotext.exe")+'"' cmd = exe+" "+"\""+pdf+"\""+"…

python pdf compression text-extraction pdfa

asked May 16 '20 at 16:16

celsowm

votes

4 answers

Extract only numbers and only string from pandas dataframe

I am trying to extract only numbers and only strings in two different dataframes. I am using regular expression to extract numbers and string. import pandas as pd df_num = pd.DataFrame({ 'Colors': ['lila1.5', 'rosa2.5', 'gelb3.5', 'grün4',…

python-3.x pandas dataframe data-science text-extraction

asked Feb 19 '20 at 08:14

BC Smith

votes

8 answers

Extract 4-digit year value from a string

I have a year listed in my string $s = "Acquired by the University in 1988"; In practice, that could be anywhere in this single line string. How do I extract it using regex? I tried \d and that didn't work, it just came up with an error. I'm using…

php regex substring text-extraction

asked Apr 15 '11 at 01:56

Jason

15,064
15
65
105

votes

0 answers

How to remove header and footer while extracting multiple page PDF to Text using PDFminer?

I've succesfully extracted text from multiple page PDF's, using PDFminer.six in Python, and converted it into a single string, but I would like to remove the header and footer of each page while extracting the PDF to text. So far similar questions…

python header footer text-extraction pdfminer

asked Feb 21 '19 at 15:04

Peter

votes

2 answers

How to extract text between certain patterns using regular expression (RegEx)?

My text: 27/07/18, 12:02 PM - user_a: https://www.youtube.com/ Watch this 27/07/18, 12:15 PM - user_b: 27/07/18, 12:52 PM - user_b: Read this fully some text some text . some text 27/07/18, 12:56 PM - user_c: text .. Here I want to…

regex python-3.x text-extraction regex-greedy

asked Aug 31 '18 at 14:15

Kalsi

votes

0 answers

Why can't I read pdf files using python textract?

I'm new to python. I'm using Pycharm 2018.2 and the latest version of Anaconda. My operating system is windows 10. After solving all problems with installing textract on windows 10, I got a positive installation result from the anaconda prompt.…

python anaconda text-extraction

asked Aug 18 '18 at 09:42

Richard Zelzner

votes

2 answers

Is there any python package for extracting text nicely from PDFs in RTL-languages?

I've worked with famous python packages for PDF files, like PDFminer, PyMuPDF, PyPDF2 and more. But none of them can extract text correctly from PDF files which are written in right-to-left languages (Persian, Arabic). For example: import fitz doc =…

python pdf text-extraction text-alignment persian

asked Jul 25 '18 at 05:26

armiro

votes

1 answer

Why is it so hard to convert PDF to plain text?

I needed to convert some PDF back to text. I tried many soft and online tools and result was always mediocre. Why is it so difficult technically speaking ?

pdf text-extraction

asked Jun 26 '18 at 05:04

Demeter Purjon

votes

1 answer

How to read PDF files which are in asian languages (Chinese, Japanese, Thai, etc.) and store in a string in python

I am using PyPDF2 to read PDF files in python. While it works well for languages in English and European languages (with alphabets in english), the library fails to read Asian languages like Japanese and Chinese. I tried encode('utf-8'),…

python unicode nlp text-extraction pdf-reader

asked Jun 22 '18 at 10:08

Nikunj Agarwal

votes

1 answer

Is there a CPAN module to extract the current level of content from an email

I'm looking for a module to do a best-effort attempt to extract the immediate level of content (ie discarding any quoted content and the signature block) from the plain text component of an email. We've already got some code that has a shot at it,…

perl cpan text-extraction

asked Feb 14 '11 at 08:29

Cebjyre

6,552
3
32
57

votes

4 answers

part text inside tags python

I have a semi structured .txt file. The file looks like this: blabla I want this blabla And this bla and this …

python beautifulsoup text-extraction

asked Mar 31 '18 at 08:15

gd13

votes

5 answers

What's a good method for extracting text from a PDF using C# or classic ASP (VBScript)?

Is there a good library for extracting text from a PDF? I'm willing to pay for it if I have to. Something that works with C# or classic ASP (VBScript) would be ideal and I also need to be able to separate the pages from the PDF. This question had…

pdf text-extraction pdf-scraping

asked Sep 05 '08 at 20:55

Mark Biek

146,731
54
156
201

Prev 1 2 3

…

85 86 Next