Questions tagged [data-mining]

Data mining is the process of analyzing large amounts of data in order to find patterns and commonalities.

Data mining, also known as knowledge discovery, is the process of digging through and analyzing enormous sets of data and then extracting the meaning of the data. Data mining tools like SQL Server Analysis Services, predict behaviors and future trends, allowing businesses to make proactive, knowledge-driven decisions. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. Input to learning mining algorithms is called cases, samples, examples, instances, events, and observations.

3094 questions
48
votes
9 answers

scikit-learn: Predicting new points with DBSCAN

I am using DBSCAN to cluster some data using Scikit-Learn (Python 2.7): from sklearn.cluster import DBSCAN dbscan = DBSCAN(random_state=0) dbscan.fit(X) However, I found that there was no built-in function (aside from "fit_predict") that could…
slaw
  • 6,591
  • 16
  • 56
  • 109
47
votes
3 answers

How to calculate the regularization parameter in linear regression

When we have a high degree linear polynomial that is used to fit a set of points in a linear regression setup, to prevent overfitting, we use regularization, and we include a lambda parameter in the cost function. This lambda is then used to update…
London guy
  • 27,522
  • 44
  • 121
  • 179
46
votes
3 answers

R Random Forests Variable Importance

I am trying to use the random forests package for classification in R. The Variable Importance Measures listed are: mean raw importance score of variable x for class 0 mean raw importance score of variable x for class…
thirsty93
  • 2,602
  • 6
  • 26
  • 26
42
votes
7 answers

Kmeans without knowing the number of clusters?

I am attempting to apply k-means on a set of high-dimensional data points (about 50 dimensions) and was wondering if there are any implementations that find the optimal number of clusters. I remember reading somewhere that the way an algorithm…
Legend
  • 113,822
  • 119
  • 272
  • 400
42
votes
3 answers

How would one use Kernel Density Estimation as a 1D clustering method in scikit learn?

I need to cluster a simple univariate data set into a preset number of clusters. Technically it would be closer to binning or sorting the data since it is only 1D, but my boss is calling it clustering, so I'm going to stick to that name. The…
40
votes
6 answers

Can anyone give a real life example of supervised learning and unsupervised learning?

I recently studied about supervised learning and unsupervised learning. From theory, I know that supervised means getting the information from labeled datasets and unsupervised means clustering the data without any labels given. But, the problem is…
38
votes
5 answers

scikit-learn DBSCAN memory usage

UPDATED: In the end, the solution I opted to use for clustering my large dataset was one suggested by Anony-Mousse below. That is, using ELKI's DBSCAN implimentation to do my clustering rather than scikit-learn's. It can be run from the command line…
JamesT
  • 417
  • 2
  • 6
  • 8
38
votes
6 answers

Choosing eps and minpts for DBSCAN (R)?

I've been searching for an answer for this question for quite a while, so I'm hoping someone can help me. I'm using dbscan from the fpc library in R. For example, I am looking at the USArrests data set and am using dbscan on it as…
Belinda Chiera
  • 417
  • 1
  • 5
  • 7
36
votes
4 answers

How does clustering (especially String clustering) work?

I heard about clustering to group similar data. I want to know how it works in the specific case for String. I have a table with more than different 100,000 words. I want to identify the same word with some differences (eg.: house, house!!,…
Renato Dinhani
  • 35,057
  • 55
  • 139
  • 199
36
votes
3 answers

What makes the distance measure in k-medoid "better" than k-means?

I am reading about the difference between k-means clustering and k-medoid clustering. Supposedly there is an advantage to using the pairwise distance measure in the k-medoid algorithm, instead of the more familiar sum of squared Euclidean…
tumultous_rooster
  • 12,150
  • 32
  • 92
  • 149
36
votes
6 answers

How do I extract keywords used in text?

How do I data mine a pile of text to get keywords by usage? ("Jacob Smith" or "fence") And is there a software to do this already? even semi-automatically, and if it can filter out simple words like "the", "and", "or", then I could get to the topics…
Robin Rodricks
  • 110,798
  • 141
  • 398
  • 607
35
votes
8 answers

Comparing R to Matlab for Data Mining

Instead of starting to code in Matlab, I recently started learning R, mainly because it is open-source. I am currently working in data mining and machine learning field. I found many machine learning algorithms implemented in R, and I am still…
iinception
  • 1,945
  • 2
  • 21
  • 19
32
votes
7 answers

Python Implementation of OPTICS (Clustering) Algorithm

I'm looking for a decent implementation of the OPTICS algorithm in Python. I will use it to form density-based clusters of points ((x,y) pairs). I'm looking for something that takes in (x,y) pairs and outputs a list of clusters, where each cluster…
31
votes
2 answers

Scikit-learn: How to run KMeans on a one-dimensional array?

I have an array of 13.876(13,876) values between 0 and 1. I would like to apply sklearn.cluster.KMeans to only this vector to find the different clusters in which the values are grouped. However, it seems KMeans works with a multidimensional array…
Irene
  • 579
  • 2
  • 10
  • 19
31
votes
14 answers

How can I find the center of a cluster of data points?

Let's say I plotted the position of a helicopter every day for the past year and came up with the following map: Any human looking at this would be able to tell me that this helicopter is based out of Chicago. How can I find the same result in…
Ryan
  • 14,682
  • 32
  • 106
  • 179