Questions tagged [text-mining]

Text Mining is a process of deriving high-quality information from unstructured (textual) information.

Text Mining is a process of deriving high-quality information from unstructured (textual) information. Possible applications for text-mining are

  • Comments of Survey responses
  • Customer messages, emails, complaints etc.
  • Investigating competitors by crawling their web sites

More about text mining in below links.

2607 questions
16
votes
6 answers

R text file and text mining...how to load data

I am using the R package tm and I want to do some text mining. This is one document and is treated as a bag of words. I don't understand the documentation on how to load a text file and to create the necessary objects to start using features such…
user959129
16
votes
2 answers

Really fast word ngram vectorization in R

edit: The new package text2vec is excellent, and solves this problem (and many others) really well. text2vec on CRAN text2vec on github vignette that illustrates ngram tokenization I have a pretty large text dataset in R, which I've imported as a…
Zach
  • 29,791
  • 35
  • 142
  • 201
16
votes
2 answers

bigrams instead of single words in termdocument matrix using R and Rweka

I've found a way to use use bigrams instead of single tokens in a term-document matrix. The solution has been posed on stackoverflow here: findAssocs for multiple terms in R The idea goes something like…
ds10
  • 275
  • 2
  • 3
  • 11
16
votes
5 answers

Obtaining data from PubMed using python

I have a list of PubMed entries along with the PubMed ID's. I would like to create a python script or use python which accepts a PubMed id number as an input and then fetches the abstract from the PubMed website. So far I have come across NCBI…
Ruchik Yajnik
  • 319
  • 1
  • 4
  • 13
15
votes
3 answers

String Distance Matrix in Python

How to calculate Levenshtein Distance matrix of strings in Python ? str1 str2 str3 str4 ... strn str1 0.8 0.4 0.6 0.1 ... 0.2 str2 0.4 0.7 0.5 0.1 ... 0.1 …
15
votes
7 answers

Text classification/categorization algorithm

My objective is to [semi]automatically assign texts to different categories. There's a set of user defined categories and a set of texts for each category. The ideal algorithm should be able to learn from a human-defined classification and then…
Max
  • 19,654
  • 13
  • 84
  • 122
15
votes
2 answers

How do I remove verbs, prepositions, conjunctions etc from my text?

Basically in my text I just want to keep nouns and remove other parts of speech. I do not think there is any automated way for this. If there is please suggest. If there is no automated way, I can also do it manually, but for that I would require…
user3710832
  • 415
  • 3
  • 6
  • 15
15
votes
1 answer

Make dataframe of top N frequent terms for multiple corpora using tm package in R

I have several TermDocumentMatrixs created with the tm package in R. I want to find the 10 most frequent terms in each set of documents to ultimately end up with an output table like: corpus1 corpus2 "beach" "city" "sand" "sidewalk" ... …
elfs
  • 165
  • 1
  • 1
  • 4
14
votes
1 answer

Issues in getting trigrams using Gensim

I want to get bigrams and trigrams from the example sentences I have mentioned. My code works fine for bigrams. However, it does not capture trigrams in the data (e.g., human computer interaction, which is mentioned in 5 places of my…
user8566323
14
votes
4 answers

AttributeError: 'GridSearchCV' object has no attribute 'cv_results_'

I try to apply this code : pipe = make_pipeline(TfidfVectorizer(min_df=5), LogisticRegression()) param_grid = {'logisticregression__C': [ 0.001, 0.01, 0.1, 1, 10, 100], "tfidfvectorizer__ngram_range": [(1, 1),(1, 2),(1, 3)]} grid =…
Cox Tox
  • 661
  • 3
  • 8
  • 22
14
votes
1 answer

How to find ngram frequency of a column in a pandas dataframe?

Below is the input pandas dataframe I have. I want to find the frequency of unigrams & bigrams. A sample of what I am expecting is shown below How to do this using nltk or scikit learn? I wrote the below code which takes a string as input. How to…
GeorgeOfTheRF
  • 8,244
  • 23
  • 57
  • 80
14
votes
5 answers

How do I clean twitter data in R?

I extracted tweets from twitter using the twitteR package and saved them into a text file. I have carried out the following on the corpus xx<-tm_map(xx,removeNumbers, lazy=TRUE, 'mc.cores=1') xx<-tm_map(xx,stripWhitespace, lazy=TRUE,…
kRazzy R
  • 1,561
  • 1
  • 16
  • 44
13
votes
3 answers

Latent Semantic Analysis concepts

I've read about using Singular Value Decomposition (SVD) to do Latent Semantic Analysis (LSA) in corpus of texts. I've understood how to do that, also I understand mathematical concepts of SVD. But I don't understand why does it works applying to…
stemm
  • 5,960
  • 2
  • 34
  • 64
13
votes
1 answer

How to break conversation data into pairs of (Context , Response)

I'm using Gensim Doc2Vec model, trying to cluster portions of a customer support conversations. My goal is to give the support team an auto response suggestions. Figure 1: shows a sample conversations where the user question is answered in the next…
Shlomi Schwartz
  • 8,693
  • 29
  • 109
  • 186
13
votes
4 answers

Alternatives for wget giving 'ERROR 403: Forbidden'

I'm trying to get text from multiple Pubmed papers using wget, but seems NCBI website don't allow this. Any alternatives? Bernardos-MacBook-Pro:pangenome_papers_pubmed_result bernardo$ wget -i ./url.txt --2016-05-04 10:49:34-- …
biotech
  • 697
  • 1
  • 7
  • 17