Questions tagged [tm]

The `tm` package (shorthand for Text Mining Infrastructure in R) provides a framework for text mining applications within R.

source: http://tm.r-forge.r-project.org/

tm - Text Mining Package

tm (shorthand for Text Mining Infrastructure in R) provides a framework for text mining applications within R.

The tm package offers functionality for managing text documents, abstracts the process of document manipulation and eases the usage of heterogeneous text formats in R. The package has integrated database back-end support to minimize memory demands. An advanced meta data management is implemented for collections of text documents to alleviate the usage of large and with meta data enriched document sets.

The package provides native support for reading in several classic file formats (e.g. plain text, PDFs, or XML files). There is also a plug-in mechanism to handle additional file formats.

The data structures and algorithms can be extended to fit custom demands, since the package is designed in a modular way to enable easy integration of new file formats, readers, transformations and filter operations.

tm provides easy access to preprocessing and manipulation mechanisms such as whitespace removal, stemming, or stopword deletion. Further a generic filter architecture is available in order to filter documents for certain criteria, or perform full text search. The package supports the export from document collections to term-document matrices.

tm is freely available under the GNU General Public License (GPL).

Resources:

1083 questions
-1
votes
1 answer

R: topicmodels, 2 similar documents, code works with one, doesnt with the other

I have a quite strange error occuring when i run my topicmodel code. Basically I have a .csv file with user comments. I want to create a dtm with each comment being one document. i took a sample of 8k comments and used the following code on it: >…
Andres
  • 1
  • 1
-1
votes
1 answer

Cleaning accent in text twitter

I am working in text mining with spanish twitts, my problem is that i have the same words but in differents ways (with accent and without accent), example: accion, acción. I tried to use coding: unicode "UTF-8", but dont work. my…
Rodrigo_BC
  • 161
  • 11
-1
votes
1 answer

Removal of Phrase using wildcards

I'm searching on how to use wildcard characters as part of the removal criteria for a section of a corpus. I was unable to find anything on SO or google related to this issue. Purpose: Analyzing large dataset of standardized notes where employee…
-1
votes
1 answer

R tm TermDocumentMatrix based on a sparse matrix

I have a collection of books in txt format and want to apply some procedures of the tm R library to them. However, I prefer to clean the texts in bash rather than in R because it is much faster. Suppose I am able to get from bash a data.frame such…
Felipe Gerard
  • 1,552
  • 13
  • 23
-1
votes
1 answer

Error in using "termFreq" function in R

I built a corpus in R by the use of tm package. I want to change the frequency boundaries and only keep the words which are repeated at least 4 times in the entire document. After that, I need to build document-term-matrix based on these…
user36729
  • 545
  • 5
  • 30
-1
votes
1 answer

How to read .doc files into R

So for a bit of weekend fun, I decided I was going to try and read a Microsoft Word .doc file into R. Specifically I have a .doc file version of the PDF below: http://www.queensu.ca/rarc/services/ASDAssessmentTemplate/AAA/AQ_Scoring_Key.pdf What I…
googleplex101
  • 195
  • 2
  • 13
-1
votes
2 answers

Error using "in", "if", as a column name in R

Just run into this problem. I was using a data frame with several thousands of columns created out of words and word splits. One of my columns resulted with the name "in" another in "if". When one tries to do something like data$in, there is an…
Dr VComas
  • 735
  • 7
  • 22
-1
votes
1 answer

test dtm 1 on the basis of dtm..so that 1 can predict the categories of dtm1

library functions library(tm) library(e1071) library(plyr) Inserting the journal names to be categorized sample = c( "An Inductive Inference Machine", "Computing Machinery and Intelligence", "On the translation of…
-2
votes
1 answer

(R) "Text Mining" how to see the detail information in <>?

Just start my learning about text mining, followed the book, I used tm::inspect() to see the first information in data "crude", but unlike the example on that book, R showed me the following things instead of the detail information like the book…
Till
  • 3
  • 2
-2
votes
1 answer

How to get the word frequency and corresponding words in R

I am working on text mining project and I have created a sparse matrix in R using tm package. The data is in below mentioned format: Sample Data format I want it in the below format: Resultant Data Format Need help with data wrangling.
-2
votes
1 answer

Loop for a string

This code will be used to count number of links in my tweets collection. The collection is collected from 10 accounts. The questions is, how could I loop through the ten accounts in one code and drop the output in a table or graph? "Unames" is…
-3
votes
1 answer

why isn't this valid java (tm)?

I have a problem when I try install Cpn tools . show warning message "the installer could not find a valid java(tm) on this machine. supported versions: Vendor: Any min.1.6 max.any" what's mean? How can I solve this problem?
engalma
  • 11
  • 1
  • 1
-3
votes
1 answer

Arrange the words of the Document Term Matrix by frequency in R

i'm sorry for new question , but i newbie in text mining, and need in advices of profy. Now, after long torments with content_transformer i have clean corpus The next question 1. How select from `dtm` the words with small frequencies , so that the…
fenton
  • 1
  • 1
  • 6
-3
votes
1 answer

Remove characters from alphanumeric column in R?

I am looking for a code to remove the characters from an alphanumeric vector of a data frame. This my data column below: F9667968CU 67968PX11 3666SP 6SPF10 2323DL1 23DVL10 2016PP07 And this is the code I have used: for(i in 1:…
Monsta
  • 59
  • 3
  • 7
-3
votes
3 answers

R remove specific word in a txte like: the this

txt <- readLines("this.txt") library(tm) corpus <- Corpus(VectorSource(txt)) corpus <- tm_map (corpus, removePunctuation) tdm <- TermDocumentMatrix (corpus) m <- as.matrix (tdm) d <- data.frame(freq = sort(rowSums(m),decreasing = TRUE))
Asma Souzii
  • 29
  • 1
  • 1
  • 2
1 2 3
72
73