Questions tagged [tm]

The `tm` package (shorthand for Text Mining Infrastructure in R) provides a framework for text mining applications within R.

source: http://tm.r-forge.r-project.org/

tm - Text Mining Package

tm (shorthand for Text Mining Infrastructure in R) provides a framework for text mining applications within R.

The tm package offers functionality for managing text documents, abstracts the process of document manipulation and eases the usage of heterogeneous text formats in R. The package has integrated database back-end support to minimize memory demands. An advanced meta data management is implemented for collections of text documents to alleviate the usage of large and with meta data enriched document sets.

The package provides native support for reading in several classic file formats (e.g. plain text, PDFs, or XML files). There is also a plug-in mechanism to handle additional file formats.

The data structures and algorithms can be extended to fit custom demands, since the package is designed in a modular way to enable easy integration of new file formats, readers, transformations and filter operations.

tm provides easy access to preprocessing and manipulation mechanisms such as whitespace removal, stemming, or stopword deletion. Further a generic filter architecture is available in order to filter documents for certain criteria, or perform full text search. The package supports the export from document collections to term-document matrices.

tm is freely available under the GNU General Public License (GPL).

Resources:

1083 questions
12
votes
4 answers

R stemming a string/document/corpus

I'm trying to do some stemming in R but it only seems to work on individual documents. My end goal is a term document matrix that shows the frequency of each term in the document. Here's an…
screechOwl
  • 27,310
  • 61
  • 158
  • 267
11
votes
8 answers

How to show corpus text in R tm package?

I'm completely new in R and tm package, so please excuse my stupid question ;-) How can I show the text of a plain text corpus in R tm package? I've loaded a corpus with 323 plain text files in a corpus: src <-…
Azrael
  • 385
  • 2
  • 5
  • 13
11
votes
5 answers

tm: read in data frame, keep text id's, construct DTM and join to other dataset

I'm using package tm. Say I have a data frame of 2 columns, 500 rows. The first column is ID which is randomly generated and has both character and number in it: "txF87uyK" The second column is actual text : "Today's weather is good. John went…
GorillaInR
  • 675
  • 2
  • 7
  • 19
11
votes
2 answers

Text-mining with the tm-package - word stemming

I am doing some text mining in R with the tm-package. Everything works very smooth. However, one problem occurs after stemming (http://en.wikipedia.org/wiki/Stemming). Obviously, there are some words, which have the same stem, but it is important…
majom
  • 7,863
  • 7
  • 55
  • 88
10
votes
2 answers

Keep document ID with R corpus

I have searched stackoverflow and the web and can only find partial solutions OR some that don't work due to changes in TM or qdap. Problem below: I have a dataframe: ID and Text (Simple document id/name and then some text) I have two issues: Part…
RUser
  • 588
  • 1
  • 4
  • 17
10
votes
2 answers

R text mining documents from CSV file (one row per doc)

I am trying to work with the tm package in R, and have a CSV file of customer feedback with each line being a different instance of feedback. I want to import all the content of this feedback into a corpus but I want each line to be a different…
user2407054
  • 113
  • 1
  • 1
  • 4
9
votes
1 answer

transformation drops documents error in R

Whenever i run this code, tm_map line give me warning message as Warning message: In tm_map.SimpleCorpus(docs, toSpace, "/") : transformation drops documents texts <- read.csv("./Data/fast food/Domino's/Domino's veg pizza.csv",stringsAsFactors =…
NRR
  • 83
  • 2
  • 3
  • 12
9
votes
3 answers

Efficient jaccard similarity DocumentTermMatrix

I want a way to efficiently calculate Jaccard similarity between documents of a tm::DocumentTermMatrix. I can do something similar for cosine similarity via the slam package as shown in this answer. I came across another question and response on…
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
9
votes
4 answers

Unable to convert a Corpus to Data Frame in R

I've looked at the other similar questions that have been posted here (like this), but the problem persists. I have a dataframe of textual data, which I need to stem. So I'm converting it into a corpus, stemming it, then completing the words from…
wrahool
  • 1,101
  • 4
  • 18
  • 42
9
votes
4 answers

Treat words separated by space in the same manner

I am trying to find the words occurring in multiple documents at the same time. Let us take an example. doc1: "this is a document about milkyway" doc2: "milky way is huge" As you can see in above 2 documents, word "milkyway" is occurring in both…
user3664020
  • 2,980
  • 6
  • 24
  • 45
9
votes
2 answers

R tm removeWords function not removing words

I am trying to remove some words from a corpus I have built but it doesn't seem to be working. I first run through everything and create a dataframe that lists my words in order of their frequency. I use this list to identify words I am not…
Adam
  • 1,147
  • 3
  • 15
  • 23
9
votes
1 answer

How to convert vector of characters to corpus input for the DocumentTermMatrix function from tm package in R?

I am new to tm package. I'd like to use DocumentTermMatrix function to create DT- Matrix for further text-mining analysis but I am able to create propoer input for that function. I have my data input so far in a format of a character vector like…
Marcin
  • 7,834
  • 8
  • 52
  • 99
9
votes
1 answer

Text Categorization in R

MY objective is to Automatically route the Feedback Email to respective division. My fields are FNUMBER,CATEGORY, SUBCATEGORY, Description. I have last 6 months Data in the above format - where the entire Email is stored in Description along with…
Prasanna Nandakumar
  • 4,295
  • 34
  • 63
9
votes
1 answer

Search for mispellings of a word in a character vector with R - "inverse" spell checker

I am text mining a large database to create indicator variables which indicate the occurrence of certain phrases in a comments field of an observation. The comments were entered by technicians, so the terms used are always consistent. However,…
Nick Evans
  • 535
  • 3
  • 12
8
votes
2 answers

Error faced while using TM package's VCorpus in R

I am facing the below error while working on the TM package with R. library("tm") Loading required package: NLP Warning messages: 1: package ‘tm’ was built under R version 3.4.2 2: package ‘NLP’ was built under R version 3.4.1 corpus <-…
Saharsh Gandhi
  • 81
  • 1
  • 1
  • 2
1 2
3
72 73