Questions tagged [text-extraction]

Text extraction is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents (text).

Text extraction is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents (text).

Text extraction mechanisms may vary depending on the context and the language applied. Approaches may vary from regular expressions to classifiers till more complex/custom models.

More Info

1282 questions
5
votes
1 answer

How to extract text from a two-column PDF using PDFPlumber

I am working on topic modeling tasks using python and I would like to extract texts from annual/sustainability reports. However my problem is, when I tried to extract the report, the extracted lines are broken between two different columns in a page…
5
votes
2 answers

Using Textract, how do you extract tables from a pdf file and output it into a csv file via .py script?

I want to use textract (via aws cli) to extract tables from a pdf file (located in an s3 location) and export it into a csv file. I have tried writing a .py script but am struggling to read from the file. Any suggestions for writing the .py script…
5
votes
5 answers

Extraction of text page by page from MS word docx file using python

I have a MS docx file and I need to extract text from it page-wise. I have tried with python-docx but it could extract the whole text but not pagewise. I have also converted my docx to pdf and then tried text extraction. The problem is, after…
AlfiyaFaisy
  • 314
  • 1
  • 3
  • 15
5
votes
1 answer

What does the key values of the dictionary output of the following code in tesseract signify?

I am using the following code in python: I am getting the following key values in the dictionary: 'block_num' 'conf' 'level' 'line_num' 'page_num' 'par_num', 'text', 'top', 'width', 'word_num', 'height, 'left'. What do these key values signify I…
5
votes
0 answers

pdftotext get font information (font-family, style, size)

I'm using "pdftotext -bbox file.pdf" to convert a pdf file into HTML. Here's a sample line from the output: foo Is there a way to get font information for every word…
5
votes
0 answers

Extract spelled out numbers from string in R

I am trying to extract spelled-out numbers from strings, plus extracting the word that comes after the number. I have managed to do this by a laboursome way of writing my own code including the spelled-out numbers to search for (here an example from…
NelnewR
  • 131
  • 7
5
votes
3 answers

How to split sentences into correlated words (term extraction)?

Is there any NLP python library that split sentence or joins words into related pairs of words? For example: That is not bad example -> "That" "is" "not bad" "example" "Not bad" means the same as good so it would be useless to process it as "not"…
Ala Głowacka
  • 323
  • 2
  • 10
5
votes
1 answer

Can std::cin fail to pass a user input in the command line to a variable with a type of char?

I have tried passing different inputs with the below code, but have failed to get the message printed: "Oops, you did not enter an ASCII char, let alone one that is y or n!" I have entered various Unicode characters which are not of char type…
James Ray
  • 424
  • 3
  • 15
5
votes
2 answers

Extracting data from an email message (or several thousand emails) [Exchange based]

My marketing department, bless them, has decided to make a sweepstakes where people enter over a webpage. That is great but the information isn't stored to a DB of any sort but is sent to an exchange mail box as an email. Great. My challenge is to…
Craig
  • 11,614
  • 13
  • 44
  • 62
5
votes
2 answers

Extract folder name and filename from FilePath using scala

I have streams of files being read from a directory and the filetree is of the…
Taiwotman
  • 885
  • 14
  • 27
5
votes
1 answer

iText: Extracted text from pdf file using LocationTextExtractionStrategy is in wrong order

I am using iText to extract some text from a pdf file at a specific location. In order to do that I am using the LocationTextExtractionStrategy: public static void main(String[] args) throws Exception { PdfReader pdfReader = new…
Olivier Masseau
  • 778
  • 7
  • 23
5
votes
1 answer

extract characters in sequence matlab

I want to extract characters in a sequence. For example, given this image: Here's the code I wrote: [L Ne]=bwlabel(BinaryImage); stats=regionprops(L,'BoundingBox'); cc=vertcat(stats(:).BoundingBox); aa=cc(:,3); bb=cc(:,4); hold on figure for…
Nomi
  • 67
  • 5
5
votes
1 answer

imacros extraction from a range of data

Hi here is how my page looks like
Beamer
Michal K
  • 245
  • 2
  • 9
  • 17
5
votes
4 answers

Extracting numbers from sentences

I need to extract some numbers from a text. Text is x <- "Lorem ipsum dolor sit amet[245], consectetur adipiscing (325). Deinde prima illa, quae in congressu[232]. solemus: Quid tu, inquit, huc? Sequitur disserendi ratio cognitioque 295.…
user3973290
5
votes
8 answers

How to extract Heading tags in PHP from a string?

From a string that contains a lot of HTML, how can I extract all the text from

etc tags into a new variable? I would like to capture all of the text from these elements and store them in a new variable as comma-delimited values. Is it…

bluedaniel
  • 2,079
  • 5
  • 31
  • 49