Questions tagged [text-extraction]

Text extraction is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents (text).

Text extraction is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents (text).

Text extraction mechanisms may vary depending on the context and the language applied. Approaches may vary from regular expressions to classifiers till more complex/custom models.

More Info

1282 questions
5
votes
3 answers

How to get the number of results found for a keyword in google

I need to supply a keyword like "blue metal kettle" (with/without quotes) and get only the number of results found for this search. If I search without quotes right now, I get: Results 1 - 10 of about 1,040,000 for blue metal kettle. (0.19…
Ali
  • 261,656
  • 265
  • 575
  • 769
4
votes
3 answers

what is fastest way to convert pdf to jpg image?

I am trying to convert multiple pdfs (10k +) to jpg images and extract text from them. I am currently using the pdf2image python library but it is rather slow, is there any faster/fastest library than this? from pdf2image import…
4
votes
1 answer

Efficiently extract the highlighted portion from PDFs using PyMuPDF python?

I have a use case where I have to highlight table from PDF document and then extract the highlighted part using python. Once it is highlighted, I have to transform the extracted part to a dataframe such that the dataframe should look like this: name…
technophile_3
  • 531
  • 6
  • 21
4
votes
2 answers

How extract text from this compressed PDF/A?

For machine learning purposes (sckit-learn), I need to extract the raw text from lots of PDF files. First off, I was using xpdf pdftotext to do this task: exe = r'"'+os.path.join(xpdf_path,"pdftotext.exe")+'"' cmd = exe+" "+"\""+pdf+"\""+"…
celsowm
  • 846
  • 9
  • 34
  • 59
4
votes
4 answers

Extract only numbers and only string from pandas dataframe

I am trying to extract only numbers and only strings in two different dataframes. I am using regular expression to extract numbers and string. import pandas as pd df_num = pd.DataFrame({ 'Colors': ['lila1.5', 'rosa2.5', 'gelb3.5', 'grün4',…
BC Smith
  • 727
  • 1
  • 7
  • 19
4
votes
8 answers

Extract 4-digit year value from a string

I have a year listed in my string $s = "Acquired by the University in 1988"; In practice, that could be anywhere in this single line string. How do I extract it using regex? I tried \d and that didn't work, it just came up with an error. I'm using…
Jason
  • 15,064
  • 15
  • 65
  • 105
4
votes
0 answers

How to remove header and footer while extracting multiple page PDF to Text using PDFminer?

I've succesfully extracted text from multiple page PDF's, using PDFminer.six in Python, and converted it into a single string, but I would like to remove the header and footer of each page while extracting the PDF to text. So far similar questions…
Peter
  • 41
  • 1
  • 4
4
votes
2 answers

How to extract text between certain patterns using regular expression (RegEx)?

My text: 27/07/18, 12:02 PM - user_a: https://www.youtube.com/ Watch this 27/07/18, 12:15 PM - user_b: 27/07/18, 12:52 PM - user_b: Read this fully some text some text . some text 27/07/18, 12:56 PM - user_c: text .. Here I want to…
Kalsi
  • 579
  • 5
  • 13
4
votes
0 answers

Why can't I read pdf files using python textract?

I'm new to python. I'm using Pycharm 2018.2 and the latest version of Anaconda. My operating system is windows 10. After solving all problems with installing textract on windows 10, I got a positive installation result from the anaconda prompt.…
4
votes
2 answers

Is there any python package for extracting text nicely from PDFs in RTL-languages?

I've worked with famous python packages for PDF files, like PDFminer, PyMuPDF, PyPDF2 and more. But none of them can extract text correctly from PDF files which are written in right-to-left languages (Persian, Arabic). For example: import fitz doc =…
armiro
  • 93
  • 1
  • 3
  • 14
4
votes
1 answer

Why is it so hard to convert PDF to plain text?

I needed to convert some PDF back to text. I tried many soft and online tools and result was always mediocre. Why is it so difficult technically speaking ?
Demeter Purjon
  • 373
  • 1
  • 12
4
votes
1 answer

How to read PDF files which are in asian languages (Chinese, Japanese, Thai, etc.) and store in a string in python

I am using PyPDF2 to read PDF files in python. While it works well for languages in English and European languages (with alphabets in english), the library fails to read Asian languages like Japanese and Chinese. I tried encode('utf-8'),…
4
votes
1 answer

Is there a CPAN module to extract the current level of content from an email

I'm looking for a module to do a best-effort attempt to extract the immediate level of content (ie discarding any quoted content and the signature block) from the plain text component of an email. We've already got some code that has a shot at it,…
Cebjyre
  • 6,552
  • 3
  • 32
  • 57
4
votes
4 answers

part text inside tags python

I have a semi structured .txt file. The file looks like this: blabla I want this blabla And this bla and this …
gd13
  • 55
  • 1
  • 7
4
votes
5 answers

What's a good method for extracting text from a PDF using C# or classic ASP (VBScript)?

Is there a good library for extracting text from a PDF? I'm willing to pay for it if I have to. Something that works with C# or classic ASP (VBScript) would be ideal and I also need to be able to separate the pages from the PDF. This question had…
Mark Biek
  • 146,731
  • 54
  • 156
  • 201