Questions tagged [text-extraction]

Text extraction is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents (text).

Text extraction mechanisms may vary depending on the context and the language applied. Approaches may vary from regular expressions to classifiers till more complex/custom models.

More Info

Information extraction on wikipedia

1282 questions

votes

13 answers

Getting URL parameter in java and extract a specific text from that URL

I have a URL and I need to get the value of v from this URL. Here is my URL: http://www.youtube.com/watch?v=_RCIP6OrQrE How can I do that?

java url text-extraction

asked Jul 31 '12 at 05:14

codezoner

1,054
2
13
32

votes

5 answers

How to extract all regex matches in a file using Vim?

Consider the following example: case Foo: ... break; case Bar: ... break; case More: case Complex: ... break: ... Say, we would like to retrieve all matches of the regex case $[^:]*$: (the whole matching text or, even…

regex vim match text-extraction

asked Jan 31 '12 at 12:33

Wernight

36,122
25
118
131

votes

11 answers

regular expression to extract text from HTML

I would like to extract from a general HTML page, all the text (displayed or not). I would like to remove any HTML tags Any javascript Any CSS styles Is there a regular expression (one or more) that will achieve that?

html regex html-content-extraction text-extraction

asked Oct 08 '08 at 01:43

Ron Harlev

16,227
24
89
132

votes

10 answers

How to extract text from resonably sane HTML?

My question is sort of like this question but I have more constraints: I know the document's are reasonably sane they are very regular (they all came from the same source I want about 99% of the visible text about 99% of what is viable at all is…

c# html d text-extraction

asked Jan 21 '10 at 23:03

BCS

75,627
68
187
294

votes

8 answers

Text Extraction from HTML Java

I'm working on a program that downloads HTML pages and then selects some of the information and write it to another file. I want to extract the information which is intbetween the paragraph tags, but i can only get one line of the paragraph. My code…

java html screen-scraping html-content-extraction text-extraction

asked Sep 06 '09 at 16:52

MajorMajor

votes

8 answers

What is the state of the art in HTML content extraction?

There's a lot of scholarly work on HTML content extraction, e.g., Gupta & Kaiser (2005) Extracting Content from Accessible Web Pages, and some signs of interest here, e.g., one, two, and three, but I'm not really clear about how well the practice of…

html html-content-extraction text-extraction

asked Dec 26 '09 at 01:22

Charles Stewart

11,661
4
46
85

votes

2 answers

tag generation from a small text content (such as tweets)

I have already asked a similar question earlier but I have notcied that I have big constrain: I am working on small text sets suchs as user Tweets to generate tags(keywords). And it seems like the accepted suggestion ( point-wise mutual information…

twitter nlp text-extraction nltk text-analysis

asked May 04 '10 at 09:20

Hellnar

62,315
79
204
279

votes

7 answers

PDFminer: extract text with its font information

I find this question, but it uses command line, and I do not want to call a Python script in command line using subprocess and parse HTML files to get the font information. I want to use PDFminer as a library, and I find this question, but they are…

python text-extraction pdfminer

asked Jan 05 '16 at 07:33

aristotll

8,694
6
33
53

votes

3 answers

How to extract text under specific headings from a pdf?

I want to extract text under specific headings from a pdf using python. For example, I have a pdf with headings Introduction,Summary,Contents. I need to extract only the text under the heading 'Summary'. How can I do this?

python-2.7 pdf document text-extraction pdf-extraction

asked Jan 05 '18 at 05:19

AlfiyaFaisy

votes

2 answers

List the words in a vocabulary according to occurrence in a text corpus, with Scikit-Learn CountVectorizer

I have fitted a CountVectorizer to some documents in scikit-learn. I would like to see all the terms and their corresponding frequency in the text corpus, in order to select stop-words. For example 'and' 123 times, 'to' 100 times, 'for' 90 times,…

python machine-learning scikit-learn text-extraction countvectorizer

asked Apr 18 '13 at 08:27

user1506145

5,176
11
46
75

votes

7 answers

Get last whole number in a string

I need to isolate the latest occurring integer in a string containing multiple integers. How can I get 23 instead of 1 for $lastnum1? $text = "1 out of 23"; $lastnum1 = $this->getEval(eregi_replace("[^* out of]", '', $text));

php regex string integer text-extraction

asked Sep 25 '12 at 19:08

yan

votes

6 answers

Extract filename with extension from filepath string

I am looking to get the filename from the end of a filepath string, say $text = "bob/hello/myfile.zip"; I want to be able to obtain the file name, which I guess would involve getting everything after the last slash as a substring. Can anyone help…

php substring filenames filepath text-extraction

asked Jun 30 '10 at 14:02

Dori

18,283
17
74
116

votes

6 answers

How do I extract lines from a file using their line number on unix?

Using sed or similar how would you extract lines from a file? If I wanted lines 1, 5, 1010, 20503 from a file, how would I get these 4 lines? What if I have a fairly large number of lines I need to extract? If I had a file with 100 lines, each…

unix sed awk line-numbers text-extraction

asked Jan 06 '10 at 23:06

monkeyking

6,670
24
61
81

votes

6 answers

Extracting webpage information based on a template in Java

Right now I use Jsoup to extract certain information (not all the text) from some third party webpages, I do it periodically. This works fine until the HTML of certain webpage changes, this change leads to a change in the existing Java code, this is…

java text-extraction named-entity-extraction

asked Mar 04 '13 at 12:45

vikasing

11,562
3
25
25

votes

4 answers

Extract All Unique Lines

I have text files with repeated exact lines of text, but I only want one of each. Imagine this text file: AAAAA AAAAA AAAAA BB BBBBB BBBBB CCC CCC CCC I would only need the following four lines from it: AAAAA BB BBBBB CCC I'm using a text editor…

regex text-extraction

asked Jul 14 '14 at 10:46

Agos FS

Prev 1

…

85 86 Next