Questions tagged [text-extraction]

Text extraction is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents (text).

Text extraction is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents (text).

Text extraction mechanisms may vary depending on the context and the language applied. Approaches may vary from regular expressions to classifiers till more complex/custom models.

More Info

1282 questions
4
votes
1 answer

Is it possible to read tweet-text of a tweet URL without twitter API?

I am using Goose to read the title/text-body of an article from a URL. However, this does not work with a twitter URL, I guess due to the different HTML tag structure. Is there a way to read the tweet text from such a link? One such example of a…
utengr
  • 3,225
  • 3
  • 29
  • 68
4
votes
4 answers

Extract one or more qualifying substrings from each string in an array

I am trying to extract qualifying substrings from an array of strings. Some strings in the array have just one qualifying substring, but others may have more. I need to build a flat array of all of these wanted values. The following is my current…
YVS1102
  • 2,658
  • 5
  • 34
  • 63
4
votes
6 answers

HTML downloading and text extraction

What would be a good tool, or set of tools, to download a list of URLs and extract only the text content? Spidering is not required, but control over the download file names, and threading would be a bonus. The platform is linux.
Cammel
  • 2,653
  • 2
  • 16
  • 5
4
votes
1 answer

Text extraction from PDF using PDFBox 2.0.2 missing class PDFTextStripper()

I've implemented simple text extraction method using PDFBox 1.8.10 in java. Cause of some reasons i have to upgrade library to PDFBox 2.0.2. Probably PDFTextStripper() method is removed or located another package in new version. Is there any way to…
bcakmak
  • 41
  • 2
4
votes
2 answers

Isolate leading portion of string before first hyphen and omit any trailing spaces from match

I have my working code which extracts the title from a string, but right now it still isn't very flexible. Current code: $post_title = "THIS IS A TEST - 10-01-2010 - HELLO WORLD (OKAY)!!"; $post_title = substr($post_title, 0, strpos($post_title,…
Arthor
  • 666
  • 2
  • 13
  • 40
4
votes
1 answer

Java - Text Extraction from PDF using OCR

I have a pdf file (some part of it given below), and want to extract text from it. I have used PDFTextStream, but it doesn't work with this file. (However it worked with other file, that has simple text). What other OCR libraries are capable of…
Dax Amin
  • 497
  • 2
  • 5
  • 13
4
votes
0 answers

Extract data from pdf

Please don't mark as duplicate. I have already been through many Stackoverflow links but they didn't solve my problem. What I'm trying to do : I have to extract data from around 1,50,000 pdf files. A sample pdf : All these pdf are identical in…
Akshay Soam
  • 1,580
  • 3
  • 21
  • 39
4
votes
4 answers

garbage character at end of string?

Hi there I'm reading a string and breaking each word and sorting it into name email and phone number. with the string joe bloggs joeblog@live.com 12345. But once i break everything down, the individual separated variables which hold the name,email…
silent
  • 2,836
  • 10
  • 47
  • 73
4
votes
2 answers

itext: how to tweak text extraction?

I'm using iText 5.5.8 for Java. Following the default, straightforward text extraction procedures, i.e. PdfTextExtractor.getTextFromPage(reader, pageNumber) I was surprised to find several mistakes in the output, specifically all letter ds come…
4
votes
2 answers

Not able to read the exact text highlighted across the lines

I am working on reading the highlighted from PDF document using PDBox. I was able to read the highlighted text in single line both single and multiple words. However, I could not read the highlighted text across the lines. Please find the following…
user5342176
  • 101
  • 1
  • 9
4
votes
4 answers

Extracting readable text from HTML using Python?

I know about utils like html2text, BeautifulSoup etc. but the issue is that they also extract javascript and add it to the text making it tough to separate them. htmlDom = BeautifulSoup(webPage) htmlDom.findAll(text=True) Alternately, from…
demos
  • 2,630
  • 11
  • 35
  • 51
4
votes
5 answers

Get the last value in a comma-separated string

I have a string with numbers, stored in $numbers: 3,6,86,34,43,52 What's the easiest way to get the last value after the last comma? In this case the number 52 would be the last value, which I would like to store in a variable. The number can vary…
Filip Blaauw
  • 731
  • 2
  • 16
  • 29
4
votes
1 answer

Finding word combinations on domain names

I am a PHP novice and need some help finishing my script. I have a PHP script that can take all of the words from a domain name. I need the script to be able to find the most likely words that are the domain name's keywords. Here is my…
Daniel
  • 293
  • 4
  • 10
4
votes
5 answers

Scraping largest block of text from HTML document

I am working on an algorithm that will try to pick out, given an HTML file, what it thinks is the parent element that most likely contains the majority of the page's content text. For example, it would pick the div "content" in the following…
Max
  • 6,901
  • 7
  • 46
  • 61
4
votes
4 answers

Access Adobe InDesign files

I need some directions for the following problem: I have a lot of InDesign files and i have to setup a process that will track if a certain paragraph or text block has changed between diferent versions of the file. If the text block has changed i…
PeterMmm
  • 24,152
  • 13
  • 73
  • 111