Questions tagged [tm]

The `tm` package (shorthand for Text Mining Infrastructure in R) provides a framework for text mining applications within R.

source: http://tm.r-forge.r-project.org/

tm - Text Mining Package

tm (shorthand for Text Mining Infrastructure in R) provides a framework for text mining applications within R.

The tm package offers functionality for managing text documents, abstracts the process of document manipulation and eases the usage of heterogeneous text formats in R. The package has integrated database back-end support to minimize memory demands. An advanced meta data management is implemented for collections of text documents to alleviate the usage of large and with meta data enriched document sets.

The package provides native support for reading in several classic file formats (e.g. plain text, PDFs, or XML files). There is also a plug-in mechanism to handle additional file formats.

The data structures and algorithms can be extended to fit custom demands, since the package is designed in a modular way to enable easy integration of new file formats, readers, transformations and filter operations.

tm provides easy access to preprocessing and manipulation mechanisms such as whitespace removal, stemming, or stopword deletion. Further a generic filter architecture is available in order to filter documents for certain criteria, or perform full text search. The package supports the export from document collections to term-document matrices.

tm is freely available under the GNU General Public License (GPL).

Resources:

1083 questions
16
votes
2 answers

R: Calculate cosine distance from a term-document matrix with tm and proxy

I want to calculate the cosine distance among authors of a corpus. Let's take a corpus of 20 documents. require(tm) data("crude") length(crude) # [1] 20 I want to find out the cosine distance (similarity) among these 20 documents. I create a…
CptNemo
  • 6,455
  • 16
  • 58
  • 107
16
votes
3 answers

R tm package used for predictive analytics. How one classifies a new document?

This is a general question about the procedures concerning text mining. Suppose one has a Corpus of documents classified as Spam/No_Spam. As standard procedure one pre-process the data, removing punctuation, stops words etc. After converting it into…
Dr VComas
  • 735
  • 7
  • 22
16
votes
2 answers

R: add title to wordcloud graphics / png

I have some working R code that generates a tag cloud from a term-document matrix. Now I want to create a whole bunch of tag clouds from many documents, and to inspect them visually at a later time. To know which document(s)/corpus the tag-cloud…
knb
  • 9,138
  • 4
  • 58
  • 85
15
votes
7 answers

R break corpus into sentences

I have a number of PDF documents, which I have read into a corpus with library tm. How can one break the corpus into sentences? It can be done by reading the file with readLines followed by sentSplit from package qdap [*]. That function requires a…
Henk
  • 3,634
  • 5
  • 28
  • 54
15
votes
1 answer

Make dataframe of top N frequent terms for multiple corpora using tm package in R

I have several TermDocumentMatrixs created with the tm package in R. I want to find the 10 most frequent terms in each set of documents to ultimately end up with an output table like: corpus1 corpus2 "beach" "city" "sand" "sidewalk" ... …
elfs
  • 165
  • 1
  • 1
  • 4
15
votes
1 answer

Trying to get tf-idf weighting working in R

I am trying to do some very basic text analysis with the tm package and get some tf-idf scores; I'm running OS X (though I've tried this on Debian Squeeze with the same result); I've got a directory (which is my working directory) with a couple text…
cforster
  • 577
  • 2
  • 7
  • 19
14
votes
2 answers

R tm In mclapply(content(x), FUN, ...) : all scheduled cores encountered errors in user code

When I run the following codes to the penultimate line, I got Warning message: In mclapply(content(x), FUN, ...) : all scheduled cores encountered errors in user code When I run the final line, I got "Error in UseMethod(\"words\") : \n no…
Weijia
  • 139
  • 1
  • 1
  • 5
13
votes
1 answer

R, tm-error of transformation drops documents

I want to create a network based on the weight of keywords from text. Then I got an error when running the codes related to tm_map: library (tm) library(NLP) lirary (openNLP) text = c('.......') corp <- Corpus(VectorSource(text)) corp <-…
Julie
  • 151
  • 1
  • 1
  • 8
13
votes
3 answers

How to make R tm corpus of 100 million tweets?

I want to make a text corpus of 100 million tweets using R’s distributed computing tm package (called tm.plugin.dc). The tweets are stored in a large MySQL table on my laptop. My laptop is old, so I am using a Hadoop cluster that I set up on Amazon…
user554481
  • 1,875
  • 4
  • 26
  • 47
12
votes
1 answer

R RKEA - Not enough training instances with class labels (required: 1, provided: 0)!

I'm trying to get RKEA to work in R Studio. Here's my current code: #Imports packages library(RKEA) library(tm) #Creates a corpus of training sentences data <- c("This is a sentence", "I am in an office", "I'm working on a…
peter337
  • 121
  • 3
12
votes
3 answers

TermDocumentMatrix errors in R

I have been working through numerous online examples of the {tm} package in R, attempting to create a TermDocumentMatrix. Creating and cleaning a corpus has been pretty straightforward, but I consistently encounter an error when I attempt to create…
Brian P
  • 1,496
  • 4
  • 25
  • 38
12
votes
3 answers

Stemming with R Text Analysis

I am doing a lot of analysis with the TM package. One of my biggest problems are related to stemming and stemming-like transformations. Let's say I have several accounting related terms (I am aware of the spelling issues). After stemming we…
RUser
  • 588
  • 1
  • 4
  • 17
12
votes
4 answers

Finding ngrams in R and comparing ngrams across corpora

I'm getting started with the tm package in R, so please bear with me and apologies for the big ol' wall of text. I have created a fairly large corpus of Socialist/Communist propaganda and would like to extract newly coined political terms (multiple…
Markus D
  • 187
  • 1
  • 3
  • 10
12
votes
2 answers

Removing non-English text from Corpus in R using tm()

I am using tm() and wordcloud() for some basic data-mining in R, but am running into difficulties because there are non-English characters in my dataset (even though I've tried to filter out other languages based on background variables. Let's say…
roody
  • 2,633
  • 5
  • 38
  • 50
12
votes
2 answers

How to recreate same DocumentTermMatrix with new (test) data

Suppose I have text based training data and testing data. To be more specific, I have two data sets - training and testing - and both of them have one column which contains text and is of interest for the job at hand. I used tm package in R to…
Godel
  • 1,877
  • 3
  • 14
  • 10
1
2
3
72 73