Questions tagged [text-extraction]

Text extraction is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents (text).

Text extraction is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents (text).

Text extraction mechanisms may vary depending on the context and the language applied. Approaches may vary from regular expressions to classifiers till more complex/custom models.

More Info

1282 questions
40
votes
13 answers

Getting URL parameter in java and extract a specific text from that URL

I have a URL and I need to get the value of v from this URL. Here is my URL: http://www.youtube.com/watch?v=_RCIP6OrQrE How can I do that?
codezoner
  • 1,054
  • 2
  • 13
  • 32
22
votes
5 answers

How to extract all regex matches in a file using Vim?

Consider the following example: case Foo: ... break; case Bar: ... break; case More: case Complex: ... break: ... Say, we would like to retrieve all matches of the regex case \([^:]*\): (the whole matching text or, even…
Wernight
  • 36,122
  • 25
  • 118
  • 131
22
votes
11 answers

regular expression to extract text from HTML

I would like to extract from a general HTML page, all the text (displayed or not). I would like to remove any HTML tags Any javascript Any CSS styles Is there a regular expression (one or more) that will achieve that?
Ron Harlev
  • 16,227
  • 24
  • 89
  • 132
21
votes
10 answers

How to extract text from resonably sane HTML?

My question is sort of like this question but I have more constraints: I know the document's are reasonably sane they are very regular (they all came from the same source I want about 99% of the visible text about 99% of what is viable at all is…
BCS
  • 75,627
  • 68
  • 187
  • 294
19
votes
8 answers

Text Extraction from HTML Java

I'm working on a program that downloads HTML pages and then selects some of the information and write it to another file. I want to extract the information which is intbetween the paragraph tags, but i can only get one line of the paragraph. My code…
MajorMajor
18
votes
8 answers

What is the state of the art in HTML content extraction?

There's a lot of scholarly work on HTML content extraction, e.g., Gupta & Kaiser (2005) Extracting Content from Accessible Web Pages, and some signs of interest here, e.g., one, two, and three, but I'm not really clear about how well the practice of…
Charles Stewart
  • 11,661
  • 4
  • 46
  • 85
17
votes
2 answers

tag generation from a small text content (such as tweets)

I have already asked a similar question earlier but I have notcied that I have big constrain: I am working on small text sets suchs as user Tweets to generate tags(keywords). And it seems like the accepted suggestion ( point-wise mutual information…
Hellnar
  • 62,315
  • 79
  • 204
  • 279
16
votes
7 answers

PDFminer: extract text with its font information

I find this question, but it uses command line, and I do not want to call a Python script in command line using subprocess and parse HTML files to get the font information. I want to use PDFminer as a library, and I find this question, but they are…
aristotll
  • 8,694
  • 6
  • 33
  • 53
14
votes
3 answers

How to extract text under specific headings from a pdf?

I want to extract text under specific headings from a pdf using python. For example, I have a pdf with headings Introduction,Summary,Contents. I need to extract only the text under the heading 'Summary'. How can I do this?
AlfiyaFaisy
  • 314
  • 1
  • 3
  • 15
14
votes
2 answers

List the words in a vocabulary according to occurrence in a text corpus, with Scikit-Learn CountVectorizer

I have fitted a CountVectorizer to some documents in scikit-learn. I would like to see all the terms and their corresponding frequency in the text corpus, in order to select stop-words. For example 'and' 123 times, 'to' 100 times, 'for' 90 times,…
14
votes
7 answers

Get last whole number in a string

I need to isolate the latest occurring integer in a string containing multiple integers. How can I get 23 instead of 1 for $lastnum1? $text = "1 out of 23"; $lastnum1 = $this->getEval(eregi_replace("[^* out of]", '', $text));
yan
  • 480
  • 1
  • 11
  • 21
13
votes
6 answers

Extract filename with extension from filepath string

I am looking to get the filename from the end of a filepath string, say $text = "bob/hello/myfile.zip"; I want to be able to obtain the file name, which I guess would involve getting everything after the last slash as a substring. Can anyone help…
Dori
  • 18,283
  • 17
  • 74
  • 116
13
votes
6 answers

How do I extract lines from a file using their line number on unix?

Using sed or similar how would you extract lines from a file? If I wanted lines 1, 5, 1010, 20503 from a file, how would I get these 4 lines? What if I have a fairly large number of lines I need to extract? If I had a file with 100 lines, each…
monkeyking
  • 6,670
  • 24
  • 61
  • 81
13
votes
6 answers

Extracting webpage information based on a template in Java

Right now I use Jsoup to extract certain information (not all the text) from some third party webpages, I do it periodically. This works fine until the HTML of certain webpage changes, this change leads to a change in the existing Java code, this is…
vikasing
  • 11,562
  • 3
  • 25
  • 25
12
votes
4 answers

Extract All Unique Lines

I have text files with repeated exact lines of text, but I only want one of each. Imagine this text file: AAAAA AAAAA AAAAA BB BBBBB BBBBB CCC CCC CCC I would only need the following four lines from it: AAAAA BB BBBBB CCC I'm using a text editor…
Agos FS
  • 127
  • 1
  • 8
1
2
3
85 86