Questions tagged [text-extraction]

Text extraction is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents (text).

Text extraction mechanisms may vary depending on the context and the language applied. Approaches may vary from regular expressions to classifiers till more complex/custom models.

More Info

Information extraction on wikipedia

1282 questions

votes

1 answer

How to extract text from a two-column PDF using PDFPlumber

I am working on topic modeling tasks using python and I would like to extract texts from annual/sustainability reports. However my problem is, when I tried to extract the report, the extracted lines are broken between two different columns in a page…

asked Aug 25 '21 at 08:04

Ramachandran Ravishankar

votes

2 answers

Using Textract, how do you extract tables from a pdf file and output it into a csv file via .py script?

I want to use textract (via aws cli) to extract tables from a pdf file (located in an s3 location) and export it into a csv file. I have tried writing a .py script but am struggling to read from the file. Any suggestions for writing the .py script…

python amazon-web-services text-extraction amazon-textract

asked Oct 13 '20 at 17:18

Chris You

votes

5 answers

Extraction of text page by page from MS word docx file using python

I have a MS docx file and I need to extract text from it page-wise. I have tried with python-docx but it could extract the whole text but not pagewise. I have also converted my docx to pdf and then tried text extraction. The problem is, after…

python python-3.x document extract text-extraction

asked Dec 18 '19 at 04:53

AlfiyaFaisy

votes

1 answer

What does the key values of the dictionary output of the following code in tesseract signify?

I am using the following code in python: I am getting the following key values in the dictionary: 'block_num' 'conf' 'level' 'line_num' 'page_num' 'par_num', 'text', 'top', 'width', 'word_num', 'height, 'left'. What do these key values signify I…

python-3.x tesseract text-extraction python-tesseract

asked Jun 21 '19 at 07:38

Mayank Kumar

votes

0 answers

pdftotext get font information (font-family, style, size)

I'm using "pdftotext -bbox file.pdf" to convert a pdf file into HTML. Here's a sample line from the output: foo Is there a way to get font information for every word…

text-extraction pdftotext poppler pdf-scraping xpdf

asked May 06 '18 at 11:23

James Kroning

votes

0 answers

Extract spelled out numbers from string in R

I am trying to extract spelled-out numbers from strings, plus extracting the word that comes after the number. I have managed to do this by a laboursome way of writing my own code including the spelled-out numbers to search for (here an example from…

r numbers text-extraction

asked Mar 13 '18 at 08:27

NelnewR

votes

3 answers

How to split sentences into correlated words (term extraction)?

Is there any NLP python library that split sentence or joins words into related pairs of words? For example: That is not bad example -> "That" "is" "not bad" "example" "Not bad" means the same as good so it would be useless to process it as "not"…

python nlp nltk sentiment-analysis text-extraction

asked Feb 21 '18 at 18:51

Ala Głowacka

votes

1 answer

Can std::cin fail to pass a user input in the command line to a variable with a type of char?

I have tried passing different inputs with the below code, but have failed to get the message printed: "Oops, you did not enter an ASCII char, let alone one that is y or n!" I have entered various Unicode characters which are not of char type…

c++ unicode cin non-ascii-characters text-extraction

asked Apr 09 '17 at 12:18

James Ray

votes

2 answers

Extracting data from an email message (or several thousand emails) [Exchange based]

My marketing department, bless them, has decided to make a sweepstakes where people enter over a webpage. That is great but the information isn't stored to a DB of any sort but is sent to an exchange mail box as an email. Great. My challenge is to…

exchange-server text-extraction

asked Dec 30 '08 at 00:05

Craig

11,614
13
44
62

votes

2 answers

Extract folder name and filename from FilePath using scala

I have streams of files being read from a directory and the filetree is of the…

scala feature-extraction text-extraction

asked Apr 07 '16 at 12:48

Taiwotman

votes

1 answer

iText: Extracted text from pdf file using LocationTextExtractionStrategy is in wrong order

I am using iText to extract some text from a pdf file at a specific location. In order to do that I am using the LocationTextExtractionStrategy: public static void main(String[] args) throws Exception { PdfReader pdfReader = new…

pdf itext text-extraction

asked Feb 11 '16 at 16:37

Olivier Masseau

votes

1 answer

extract characters in sequence matlab

I want to extract characters in a sequence. For example, given this image: Here's the code I wrote: [L Ne]=bwlabel(BinaryImage); stats=regionprops(L,'BoundingBox'); cc=vertcat(stats(:).BoundingBox); aa=cc(:,3); bb=cc(:,4); hold on figure for…

image-processing extract text-extraction

asked Aug 25 '15 at 15:46

Nomi

votes

1 answer

imacros extraction from a range of data

Hi here is how my page looks like

Beamer

menu1

menu2

menu3

menu4

imacros extract text-extraction data-extraction

asked Jul 03 '15 at 04:27

Michal K

votes

4 answers

Extracting numbers from sentences

I need to extract some numbers from a text. Text is x <- "Lorem ipsum dolor sit amet[245], consectetur adipiscing (325). Deinde prima illa, quae in congressu[232]. solemus: Quid tu, inquit, huc? Sequitur disserendi ratio cognitioque 295.…

regex r string text-extraction

asked Aug 24 '14 at 18:20

user3973290

votes

8 answers

How to extract Heading tags in PHP from a string?

From a string that contains a lot of HTML, how can I extract all the text from

etc tags into a new variable? I would like to capture all of the text from these elements and store them in a new variable as comma-delimited values. Is it…

php text-extraction domparser

asked Jan 14 '10 at 14:31

bluedaniel

2,079
5
31
49

Prev 1 2 3

…

85 86 Next