Questions tagged [tm]

The `tm` package (shorthand for Text Mining Infrastructure in R) provides a framework for text mining applications within R.

source: http://tm.r-forge.r-project.org/

tm - Text Mining Package

tm (shorthand for Text Mining Infrastructure in R) provides a framework for text mining applications within R.

The tm package offers functionality for managing text documents, abstracts the process of document manipulation and eases the usage of heterogeneous text formats in R. The package has integrated database back-end support to minimize memory demands. An advanced meta data management is implemented for collections of text documents to alleviate the usage of large and with meta data enriched document sets.

The package provides native support for reading in several classic file formats (e.g. plain text, PDFs, or XML files). There is also a plug-in mechanism to handle additional file formats.

The data structures and algorithms can be extended to fit custom demands, since the package is designed in a modular way to enable easy integration of new file formats, readers, transformations and filter operations.

tm provides easy access to preprocessing and manipulation mechanisms such as whitespace removal, stemming, or stopword deletion. Further a generic filter architecture is available in order to filter documents for certain criteria, or perform full text search. The package supports the export from document collections to term-document matrices.

tm is freely available under the GNU General Public License (GPL).

Resources:

CRAN summary page
R-Forge project page
FAQ
Ingo Feinerer, Kurt Hornik, and David Meyer. Text mining infrastructure in R. Journal of Statistical Software, 25(5):1-54, March 2008.

1083 questions

votes

2 answers

R: Calculate cosine distance from a term-document matrix with tm and proxy

I want to calculate the cosine distance among authors of a corpus. Let's take a corpus of 20 documents. require(tm) data("crude") length(crude) # [1] 20 I want to find out the cosine distance (similarity) among these 20 documents. I create a…

asked Apr 20 '15 at 14:22

CptNemo

6,455
16
58
107

votes

3 answers

R tm package used for predictive analytics. How one classifies a new document?

This is a general question about the procedures concerning text mining. Suppose one has a Corpus of documents classified as Spam/No_Spam. As standard procedure one pre-process the data, removing punctuation, stops words etc. After converting it into…

r tm

asked Apr 01 '13 at 20:22

Dr VComas

votes

2 answers

R: add title to wordcloud graphics / png

I have some working R code that generates a tag cloud from a term-document matrix. Now I want to create a whole bunch of tag clouds from many documents, and to inspect them visually at a later time. To know which document(s)/corpus the tag-cloud…

r graphics tm word-cloud

asked Mar 05 '13 at 13:22

knb

9,138
4
58
85

votes

7 answers

R break corpus into sentences

I have a number of PDF documents, which I have read into a corpus with library tm. How can one break the corpus into sentences? It can be done by reading the file with readLines followed by sentSplit from package qdap [*]. That function requires a…

r split tm sentence qdap

asked Sep 10 '13 at 07:24

Henk

3,634
5
28
54

votes

1 answer

Make dataframe of top N frequent terms for multiple corpora using tm package in R

I have several TermDocumentMatrixs created with the tm package in R. I want to find the 10 most frequent terms in each set of documents to ultimately end up with an output table like: corpus1 corpus2 "beach" "city" "sand" "sidewalk" ... …

r text-mining corpus tm

asked Mar 19 '13 at 17:12

elfs

votes

1 answer

Trying to get tf-idf weighting working in R

I am trying to do some very basic text analysis with the tm package and get some tf-idf scores; I'm running OS X (though I've tried this on Debian Squeeze with the same result); I've got a directory (which is my working directory) with a couple text…

r tm tf-idf text-analysis

asked Feb 11 '13 at 20:49

cforster

votes

2 answers

R tm In mclapply(content(x), FUN, ...) : all scheduled cores encountered errors in user code

When I run the following codes to the penultimate line, I got Warning message: In mclapply(content(x), FUN, ...) : all scheduled cores encountered errors in user code When I run the final line, I got "Error in UseMethod(\"words\") : \n no…

r twitter rstudio tm mclapply

asked Jul 31 '14 at 22:11

Weijia

votes

1 answer

R, tm-error of transformation drops documents

I want to create a network based on the weight of keywords from text. Then I got an error when running the codes related to tm_map: library (tm) library(NLP) lirary (openNLP) text = c('.......') corp <- Corpus(VectorSource(text)) corp <-…

r keyword tm extract

asked Aug 21 '18 at 06:25

Julie

votes

3 answers

How to make R tm corpus of 100 million tweets?

I want to make a text corpus of 100 million tweets using R’s distributed computing tm package (called tm.plugin.dc). The tweets are stored in a large MySQL table on my laptop. My laptop is old, so I am using a Hadoop cluster that I set up on Amazon…

r hadoop amazon-ec2 hive tm

asked May 05 '13 at 19:53

user554481

1,875
4
26
47

votes

1 answer

R RKEA - Not enough training instances with class labels (required: 1, provided: 0)!

I'm trying to get RKEA to work in R Studio. Here's my current code: #Imports packages library(RKEA) library(tm) #Creates a corpus of training sentences data <- c("This is a sentence", "I am in an office", "I'm working on a…

r keyword extract tm corpus

asked Oct 17 '17 at 14:01

peter337

votes

3 answers

TermDocumentMatrix errors in R

I have been working through numerous online examples of the {tm} package in R, attempting to create a TermDocumentMatrix. Creating and cleaning a corpus has been pretty straightforward, but I consistently encounter an error when I attempt to create…

r text-mining tm corpus term-document-matrix

asked Aug 28 '14 at 14:36

Brian P

1,496
4
25
38

votes

3 answers

Stemming with R Text Analysis

I am doing a lot of analysis with the TM package. One of my biggest problems are related to stemming and stemming-like transformations. Let's say I have several accounting related terms (I am aware of the spelling issues). After stemming we…

r text tm stemming

asked Jun 27 '14 at 03:16

RUser

votes

4 answers

Finding ngrams in R and comparing ngrams across corpora

I'm getting started with the tm package in R, so please bear with me and apologies for the big ol' wall of text. I have created a fairly large corpus of Socialist/Communist propaganda and would like to extract newly coined political terms (multiple…

r text-mining n-gram tm

asked Oct 27 '13 at 06:08

Markus D

votes

2 answers

Removing non-English text from Corpus in R using tm()

I am using tm() and wordcloud() for some basic data-mining in R, but am running into difficulties because there are non-English characters in my dataset (even though I've tried to filter out other languages based on background variables. Let's say…

r tm

asked Aug 09 '13 at 18:41

roody

2,633
5
38
50

votes

2 answers

How to recreate same DocumentTermMatrix with new (test) data

Suppose I have text based training data and testing data. To be more specific, I have two data sets - training and testing - and both of them have one column which contains text and is of interest for the job at hand. I used tm package in R to…

r machine-learning nlp text-mining tm

asked May 19 '13 at 01:30

Godel

1,877
3
14
10

Prev 1

…

72 73 Next