Questions tagged [text-mining]

Text Mining is a process of deriving high-quality information from unstructured (textual) information.

Text Mining is a process of deriving high-quality information from unstructured (textual) information. Possible applications for text-mining are

Comments of Survey responses
Customer messages, emails, complaints etc.
Investigating competitors by crawling their web sites

Recognize PDF table using R

I'm trying to extract data from tables inside some pdf reports. I've seen some examples using either pdftools and similar packages I was successful in getting the text, however, I just want to extract the tables. Is there a way to use R to recognize…

r text-mining pdf-scraping

asked May 23 '17 at 17:15

RCS

votes

11 answers

How to determine the (natural) language of a document?

I have a set of documents in two languages: English and German. There is no usable meta information about these documents, a program can look at the content only. Based on that, the program has to decide which of the two languages the document is…

.net nlp text-mining

asked Sep 05 '09 at 14:50

Robert Petermeier

4,122
4
29
37

votes

4 answers

Choose or generate canonical variant from multiple sentences

I'm working with an API that maps my GTIN/EAN queries to product data. Since the data returned originates from merchant product feeds, the following is almost universally the case: Multiple results per GTIN Products' titles are pretty much…

php text-mining information-extraction nlp

asked Jun 01 '12 at 20:25

vzwick

11,008
5
43
63

votes

2 answers

Extract text after a symbol in R

sample1 = read.csv("pirate.csv") sample1[,7] [1] >>xyz>>hello>>mate 1 [2] >>xyz>>hello>>mate 2 [3] >>xyz>>mate 3 [4] >>xyz>>mate 4 [5] >>xyz>>hello>>mate 5 [6] >>xyz>>hello>>mate 6 I have to extract and create an array which contains all the words…

regex r text-mining extract

asked May 05 '16 at 12:59

Looper

votes

2 answers

Use R to convert PDF files to text files for text mining

I have nearly one thousand pdf journal articles in a folder. I need to text mine on all article's abstracts from the whole folder. Now I am doing the following: dest <- "~/A1.pdf" # set path to pdftotxt.exe and convert pdf to text exe <-…

r text-mining tm pdftotext

asked Jan 30 '14 at 00:33

S Das

3,291
6
26
41

votes

6 answers

list of word frequencies using R

I have been using the tm package to run some text analysis. My problem is with creating a list with words and their frequencies associated with the same library(tm) library(RWeka) txt <- read.csv("HW.csv",header=T) df <- do.call("rbind",…

r text-mining word-frequency term-document-matrix

asked Aug 07 '13 at 10:30

ProcRJ

votes

3 answers

Better text documents clustering than tf/idf and cosine similarity?

I'm trying to cluster the Twitter stream. I want to put each tweet to a cluster that talk about the same topic. I tried to cluster the stream using an online clustering algorithm with tf/idf and cosine similarity but I found that the results are…

machine-learning data-mining cluster-analysis text-mining

asked Jul 08 '13 at 23:40

Jack Twain

6,273
15
67
107

votes

4 answers

Counting syllables

I'm looking to assign some different readability scores to text in R such as the Flesh Kincaid. Does anyone know of a way to segment words into syllables using R? I don't necessarily need the syllable segments themselves but a count. so for…

r text-mining

asked Dec 17 '11 at 23:36

Tyler Rinker

108,132
65
322
519

votes

4 answers

Best clustering algorithm? (simply explained)

Imagine the following problem: You have a database containing about 20,000 texts in a table called "articles" You want to connect the related ones using a clustering algorithm in order to display related articles together The algorithm should do…

algorithm text cluster-analysis data-mining text-mining

asked May 12 '09 at 14:38

caw

30,999
61
181
291

votes

3 answers

Emoticons in Twitter Sentiment Analysis in r

How do I handle/get rid of emoticons so that I can sort tweets for sentiment analysis? Getting: Error in sort.list(y) : invalid input Thanks and this is how the emoticons come out looking from twitter and into…

r text-mining iconv sentiment-analysis

asked Apr 01 '13 at 17:25

Rhodo

1,234
4
19
35

votes

1 answer

How to create a good NER training model in OpenNLP?

I just have started with OpenNLP. I need to create a simple training model to recognize name entities. Reading the doc here https://opennlp.apache.org/docs/1.8.0/apidocs/opennlp-tools/opennlp/tools/namefind I see this simple text to train the…

java nlp text-mining opennlp named-entity-recognition

asked Aug 14 '15 at 13:43

Dail

4,622
16
74
109

votes

5 answers

Can stop-words be found automatically?

In NLP, stop-words removal is a typical pre-processing step. And it is typically done in an empirical way based on what we think stop-words should be. But in my opinion, we should generalize the concept of stop-words. And the stop-words could vary…

machine-learning nlp data-mining text-mining

asked Mar 13 '14 at 05:52

smwikipedia

61,609
92
309
482

votes

3 answers

Row sum for large term-document matrix / simple_triplet_matrix ?? {tm package}

So I have a very large term-document matrix: > class(ph.DTM) [1] "TermDocumentMatrix" "simple_triplet_matrix" > ph.DTM A term-document matrix (109996 terms, 262811 documents) Non-/sparse entries: 3705693/28904453063 Sparsity :…

r text-mining

asked Feb 20 '14 at 22:50

Ray

3,137
8
32
59

votes

6 answers

Adding custom stopwords in R tm

I have a Corpus in R using the tm package. I am applying the removeWords function to remove stopwords tm_map(abs, removeWords, stopwords("english")) Is there a way to add my own custom stop words to this list?

r text-mining stop-words corpus tm

asked Aug 26 '13 at 14:22

Brian

7,098
15
56
73

votes

2 answers

how can one increase size of plotted area wordclouds in R

trying to replicate the example here; http://onertipaday.blogspot.com/2011/07/word-cloud-in-r.html Need help figuring out how to increase the plotted area of the word cloud. Changing the height and width parmeters in png("wordcloud_packages.png",…

r text-mining tag-cloud word-cloud

asked Feb 12 '12 at 00:59

sgt pepper

Prev 1

…

99 100 Next