Questions tagged [text-mining]

Text Mining is a process of deriving high-quality information from unstructured (textual) information.

Text Mining is a process of deriving high-quality information from unstructured (textual) information. Possible applications for text-mining are

  • Comments of Survey responses
  • Customer messages, emails, complaints etc.
  • Investigating competitors by crawling their web sites

More about text mining in below links.

2607 questions
24
votes
4 answers

Recognize PDF table using R

I'm trying to extract data from tables inside some pdf reports. I've seen some examples using either pdftools and similar packages I was successful in getting the text, however, I just want to extract the tables. Is there a way to use R to recognize…
RCS
  • 263
  • 1
  • 2
  • 9
23
votes
11 answers

How to determine the (natural) language of a document?

I have a set of documents in two languages: English and German. There is no usable meta information about these documents, a program can look at the content only. Based on that, the program has to decide which of the two languages the document is…
Robert Petermeier
  • 4,122
  • 4
  • 29
  • 37
23
votes
4 answers

Choose or generate canonical variant from multiple sentences

I'm working with an API that maps my GTIN/EAN queries to product data. Since the data returned originates from merchant product feeds, the following is almost universally the case: Multiple results per GTIN Products' titles are pretty much…
vzwick
  • 11,008
  • 5
  • 43
  • 63
22
votes
2 answers

Extract text after a symbol in R

sample1 = read.csv("pirate.csv") sample1[,7] [1] >>xyz>>hello>>mate 1 [2] >>xyz>>hello>>mate 2 [3] >>xyz>>mate 3 [4] >>xyz>>mate 4 [5] >>xyz>>hello>>mate 5 [6] >>xyz>>hello>>mate 6 I have to extract and create an array which contains all the words…
Looper
  • 295
  • 2
  • 3
  • 10
21
votes
2 answers

Use R to convert PDF files to text files for text mining

I have nearly one thousand pdf journal articles in a folder. I need to text mine on all article's abstracts from the whole folder. Now I am doing the following: dest <- "~/A1.pdf" # set path to pdftotxt.exe and convert pdf to text exe <-…
S Das
  • 3,291
  • 6
  • 26
  • 41
21
votes
6 answers

list of word frequencies using R

I have been using the tm package to run some text analysis. My problem is with creating a list with words and their frequencies associated with the same library(tm) library(RWeka) txt <- read.csv("HW.csv",header=T) df <- do.call("rbind",…
ProcRJ
  • 211
  • 1
  • 2
  • 3
21
votes
3 answers

Better text documents clustering than tf/idf and cosine similarity?

I'm trying to cluster the Twitter stream. I want to put each tweet to a cluster that talk about the same topic. I tried to cluster the stream using an online clustering algorithm with tf/idf and cosine similarity but I found that the results are…
Jack Twain
  • 6,273
  • 15
  • 67
  • 107
19
votes
4 answers

Counting syllables

I'm looking to assign some different readability scores to text in R such as the Flesh Kincaid. Does anyone know of a way to segment words into syllables using R? I don't necessarily need the syllable segments themselves but a count. so for…
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
19
votes
4 answers

Best clustering algorithm? (simply explained)

Imagine the following problem: You have a database containing about 20,000 texts in a table called "articles" You want to connect the related ones using a clustering algorithm in order to display related articles together The algorithm should do…
caw
  • 30,999
  • 61
  • 181
  • 291
19
votes
3 answers

Emoticons in Twitter Sentiment Analysis in r

How do I handle/get rid of emoticons so that I can sort tweets for sentiment analysis? Getting: Error in sort.list(y) : invalid input Thanks and this is how the emoticons come out looking from twitter and into…
Rhodo
  • 1,234
  • 4
  • 19
  • 35
17
votes
1 answer

How to create a good NER training model in OpenNLP?

I just have started with OpenNLP. I need to create a simple training model to recognize name entities. Reading the doc here https://opennlp.apache.org/docs/1.8.0/apidocs/opennlp-tools/opennlp/tools/namefind I see this simple text to train the…
Dail
  • 4,622
  • 16
  • 74
  • 109
17
votes
5 answers

Can stop-words be found automatically?

In NLP, stop-words removal is a typical pre-processing step. And it is typically done in an empirical way based on what we think stop-words should be. But in my opinion, we should generalize the concept of stop-words. And the stop-words could vary…
smwikipedia
  • 61,609
  • 92
  • 309
  • 482
17
votes
3 answers

Row sum for large term-document matrix / simple_triplet_matrix ?? {tm package}

So I have a very large term-document matrix: > class(ph.DTM) [1] "TermDocumentMatrix" "simple_triplet_matrix" > ph.DTM A term-document matrix (109996 terms, 262811 documents) Non-/sparse entries: 3705693/28904453063 Sparsity :…
Ray
  • 3,137
  • 8
  • 32
  • 59
17
votes
6 answers

Adding custom stopwords in R tm

I have a Corpus in R using the tm package. I am applying the removeWords function to remove stopwords tm_map(abs, removeWords, stopwords("english")) Is there a way to add my own custom stop words to this list?
Brian
  • 7,098
  • 15
  • 56
  • 73
16
votes
2 answers

how can one increase size of plotted area wordclouds in R

trying to replicate the example here; http://onertipaday.blogspot.com/2011/07/word-cloud-in-r.html Need help figuring out how to increase the plotted area of the word cloud. Changing the height and width parmeters in png("wordcloud_packages.png",…
sgt pepper
  • 267
  • 2
  • 4
  • 9