Questions tagged [text-mining]

Text Mining is a process of deriving high-quality information from unstructured (textual) information.

Text Mining is a process of deriving high-quality information from unstructured (textual) information. Possible applications for text-mining are

Comments of Survey responses
Customer messages, emails, complaints etc.
Investigating competitors by crawling their web sites

R text file and text mining...how to load data

I am using the R package tm and I want to do some text mining. This is one document and is treated as a bag of words. I don't understand the documentation on how to load a text file and to create the necessary objects to start using features such…

asked Oct 28 '11 at 09:20

user959129

votes

2 answers

Really fast word ngram vectorization in R

edit: The new package text2vec is excellent, and solves this problem (and many others) really well. text2vec on CRAN text2vec on github vignette that illustrates ngram tokenization I have a pretty large text dataset in R, which I've imported as a…

r vectorization text-mining n-gram text2vec

asked Jul 22 '15 at 17:50

Zach

29,791
35
142
201

votes

2 answers

bigrams instead of single words in termdocument matrix using R and Rweka

I've found a way to use use bigrams instead of single tokens in a term-document matrix. The solution has been posed on stackoverflow here: findAssocs for multiple terms in R The idea goes something like…

r text text-mining

asked Jul 17 '13 at 15:09

ds10

votes

5 answers

Obtaining data from PubMed using python

I have a list of PubMed entries along with the PubMed ID's. I would like to create a python script or use python which accepts a PubMed id number as an input and then fetches the abstract from the PubMed website. So far I have come across NCBI…

python text-mining

asked Jul 01 '13 at 16:17

Ruchik Yajnik

votes

3 answers

String Distance Matrix in Python

How to calculate Levenshtein Distance matrix of strings in Python ? str1 str2 str3 str4 ... strn str1 0.8 0.4 0.6 0.1 ... 0.2 str2 0.4 0.7 0.5 0.1 ... 0.1 …

python string machine-learning text-mining levenshtein-distance

asked May 25 '16 at 06:05

Ajay Jadhav

votes

7 answers

Text classification/categorization algorithm

My objective is to [semi]automatically assign texts to different categories. There's a set of user defined categories and a set of texts for each category. The ideal algorithm should be able to learn from a human-defined classification and then…

algorithm text-mining document-classification

asked Aug 27 '10 at 13:12

Max

19,654
13
84
122

votes

2 answers

How do I remove verbs, prepositions, conjunctions etc from my text?

Basically in my text I just want to keep nouns and remove other parts of speech. I do not think there is any automated way for this. If there is please suggest. If there is no automated way, I can also do it manually, but for that I would require…

python r text-mining

asked Jun 25 '14 at 10:28

user3710832

votes

1 answer

Make dataframe of top N frequent terms for multiple corpora using tm package in R

I have several TermDocumentMatrixs created with the tm package in R. I want to find the 10 most frequent terms in each set of documents to ultimately end up with an output table like: corpus1 corpus2 "beach" "city" "sand" "sidewalk" ... …

r text-mining corpus tm

asked Mar 19 '13 at 17:12

elfs

votes

1 answer

Issues in getting trigrams using Gensim

I want to get bigrams and trigrams from the example sentences I have mentioned. My code works fine for bigrams. However, it does not capture trigrams in the data (e.g., human computer interaction, which is mentioned in 5 places of my…

python data-mining text-mining word2vec gensim

asked Sep 11 '17 at 04:28

user8566323

votes

4 answers

AttributeError: 'GridSearchCV' object has no attribute 'cv_results_'

I try to apply this code : pipe = make_pipeline(TfidfVectorizer(min_df=5), LogisticRegression()) param_grid = {'logisticregression__C': [ 0.001, 0.01, 0.1, 1, 10, 100], "tfidfvectorizer__ngram_range": [(1, 1),(1, 2),(1, 3)]} grid =…

python machine-learning scikit-learn text-mining

asked Jan 07 '17 at 17:50

Cox Tox

votes

1 answer

How to find ngram frequency of a column in a pandas dataframe?

Below is the input pandas dataframe I have. I want to find the frequency of unigrams & bigrams. A sample of what I am expecting is shown below How to do this using nltk or scikit learn? I wrote the below code which takes a string as input. How to…

pandas nlp scikit-learn nltk text-mining

asked Apr 12 '16 at 11:39

GeorgeOfTheRF

8,244
23
57
80

votes

5 answers

How do I clean twitter data in R?

I extracted tweets from twitter using the twitteR package and saved them into a text file. I have carried out the following on the corpus xx<-tm_map(xx,removeNumbers, lazy=TRUE, 'mc.cores=1') xx<-tm_map(xx,stripWhitespace, lazy=TRUE,…

r twitter text-mining data-cleaning

asked Jul 10 '15 at 19:04

kRazzy R

1,561
1
16
44

votes

3 answers

Latent Semantic Analysis concepts

I've read about using Singular Value Decomposition (SVD) to do Latent Semantic Analysis (LSA) in corpus of texts. I've understood how to do that, also I understand mathematical concepts of SVD. But I don't understand why does it works applying to…

algorithm nlp data-mining text-mining latent-semantic-indexing

asked Aug 14 '11 at 21:49

stemm

5,960
2
34
64

votes

1 answer

How to break conversation data into pairs of (Context , Response)

I'm using Gensim Doc2Vec model, trying to cluster portions of a customer support conversations. My goal is to give the support team an auto response suggestions. Figure 1: shows a sample conversations where the user question is answered in the next…

python text-mining doc2vec gensym

asked Sep 14 '16 at 12:00

Shlomi Schwartz

8,693
29
109
186

votes

4 answers

Alternatives for wget giving 'ERROR 403: Forbidden'

I'm trying to get text from multiple Pubmed papers using wget, but seems NCBI website don't allow this. Any alternatives? Bernardos-MacBook-Pro:pangenome_papers_pubmed_result bernardo$ wget -i ./url.txt --2016-05-04 10:49:34-- …

web-scraping wget text-mining

asked May 04 '16 at 07:59

biotech

Prev 1 2

…

99 100 Next