Questions tagged [text-extraction]

Text extraction is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents (text).

Text extraction is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents (text).

Text extraction mechanisms may vary depending on the context and the language applied. Approaches may vary from regular expressions to classifiers till more complex/custom models.

More Info

1282 questions
12
votes
1 answer

How Does a PDF Store Text

I am attempting to gain a better understanding of how a PDF stores text. Generally speaking, when a PDF is created from an application like MS Word (or in my case SQL Server Reporting Services), how is text stored by the PDF? I would hope that the…
Kenneth K.
  • 2,987
  • 1
  • 23
  • 30
11
votes
6 answers

Body Text extraction from websites e.g. extract only article heading and text not all text in site

I am looking for algorithms that allow text extraction from websites. I do not mean "strip html", or any of the hundreds of libraries that allow this. So for example for a news article I would like to identify the heading and all the text, but not…
Scoox
  • 181
  • 2
  • 11
11
votes
3 answers

Is there a way to get all text from the rendered page with JS?

Is there an (unobtrusive, to the user) way to get all the text in a page with Javascript? I could get the HTML, parse it, remove all tags, etc, but I'm wondering if there's a way to get the text from the alread rendered page. To clarify, I don't…
Stavros Korokithakis
  • 4,680
  • 10
  • 35
  • 42
10
votes
4 answers

How to detect Text Area from image?

i want to detect text area from image as a preprocessing step for tesseract OCR engine, the engine works well when the input is text only but when the input image contains Nontext content it falls, so i want to detect only text content in image,any…
chostDevil
  • 1,041
  • 5
  • 17
  • 24
10
votes
1 answer

How to extract text with iTextSharp 4.1.6?

iTextSharp 4.1.6 is the last version licensed under LGPL and is free to use in commercial purpose without paying license fees. It might be interesting for some and for me, how to extract text with this version. Does anyone have an idea?
der_chirurg
  • 1,475
  • 2
  • 16
  • 26
9
votes
6 answers

Extract floating point numbers from a delimited string in PHP

I would like to convert a string of delimited dimension values into floating numbers. For example 152.15 x 12.34 x 11mm into 152.15, 12.34 and 11 and store in an array such that: $dim[0] = 152.15; $dim[1] = 12.34; $dim[2] = 11; I would also need…
Tian Bo
  • 551
  • 3
  • 7
  • 12
9
votes
4 answers

Extracting whole words

I have a large set of real-world text that I need to pull words out of to input into a spell checker. I'd like to extract as many meaningful words as possible without too much noise. I know there's plenty of regex ninjas around here, so hopefully…
orlade
  • 2,060
  • 4
  • 24
  • 35
9
votes
6 answers

Extract columns of text from a pdf file using iText

I need to extract text from pdf files using iText. The problem is: some pdf files contain 2 columns and when I extract text I get a text file where columns are merged as the result (i.e. text from both columns in the same line) this is the…
Rim
  • 185
  • 3
  • 3
  • 11
8
votes
5 answers

How to extract only person A's statements in a conversation between two persons A and B

I have a record of conversations between two arbitrary persons A and B. c1 <- "Person A: blabla...something Person B: blabla something else Person A: OK blabla" c2 <- "Person A: again blabla Person B: blabla something else Person A: thanks…
Rami Al-Fahham
  • 617
  • 1
  • 6
  • 10
8
votes
4 answers

Jsoup - extracting text

I need to extract text from a node like this:
Some text with tags might go here.

Also there are paragraphs

More text can go without paragraphs
And I need to build: Some text with tags might go…
Eugene Retunsky
  • 13,009
  • 4
  • 52
  • 55
7
votes
3 answers

Regular expression to match object dimensions

I'll put it right out there: I'm terrible with regular expressions. I've tried to come up with one to solve my problem but I really don't know much about them. . . Imagine some sentences along the following lines: Hello blah blah. It's around 11…
Edwardr
  • 2,906
  • 3
  • 27
  • 30
7
votes
4 answers

How to extract text matching a regex using Vim?

I would like to extract some data from a piece of text with Vim. The input looks like so: 72" title="(168,72)" onmouseover="posizione('(168,72)');" onmouseout="posizione('(-,-)');">> 72" title="(180,72)" onmouseover="posizione('(180,72)');"…
nick2k3
  • 1,399
  • 8
  • 18
  • 25
7
votes
6 answers

How to use the Amazon Textract with PDF files

I already can use the textract but with JPEG files. I would like to use it with PDF files. I have the code bellow: import boto3 # Document documentName = "Path to document in JPEG" # Read document content with open(documentName, 'rb') as…
ArthurS
  • 137
  • 1
  • 2
  • 5
7
votes
4 answers

Check if two strings contain the same set of words in Python

I am trying to compare two sentences and see if they contain the same set of words. Eg: comparing "today is a good day" and "is today a good day" should return true I am using the Counter function from collections module right now from collections…
TheLastCoder
  • 610
  • 2
  • 6
  • 15
7
votes
1 answer

How to extract the contents of a table in pdf file?

I want to extract the contents of a table in pdf like like this : i wrote this java programme using iText java PDF libray which can read the contents of a PDF file line by line, but I do not know how to get the contents of table import…
Bertrand
  • 341
  • 1
  • 2
  • 12
1 2
3
85 86