Questions tagged [data-mining]

Data mining is the process of analyzing large amounts of data in order to find patterns and commonalities.

Data mining, also known as knowledge discovery, is the process of digging through and analyzing enormous sets of data and then extracting the meaning of the data. Data mining tools like SQL Server Analysis Services, predict behaviors and future trends, allowing businesses to make proactive, knowledge-driven decisions. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. Input to learning mining algorithms is called cases, samples, examples, instances, events, and observations.

machine-learning, artificial-intelligence and statistics provide many techniques used in data mining, in combination with database technologies for efficiency. Please use the appropriate tag (e.g. machine-learning) to refer to the raw methods.
Cluster analysis (dataclustering) and outlier detection (outliers) are two of the main challenges from data mining.
Wiki Links
Data Mining Introduction

3094 questions

votes

9 answers

scikit-learn: Predicting new points with DBSCAN

I am using DBSCAN to cluster some data using Scikit-Learn (Python 2.7): from sklearn.cluster import DBSCAN dbscan = DBSCAN(random_state=0) dbscan.fit(X) However, I found that there was no built-in function (aside from "fit_predict") that could…

machine-learning scikit-learn cluster-analysis data-mining dbscan

asked Jan 07 '15 at 15:27

slaw

6,591
16
56
109

votes

3 answers

How to calculate the regularization parameter in linear regression

When we have a high degree linear polynomial that is used to fit a set of points in a linear regression setup, to prevent overfitting, we use regularization, and we include a lambda parameter in the cost function. This lambda is then used to update…

machine-learning data-mining regression

asked Aug 29 '12 at 16:04

London guy

27,522
44
121
179

votes

3 answers

R Random Forests Variable Importance

I am trying to use the random forests package for classification in R. The Variable Importance Measures listed are: mean raw importance score of variable x for class 0 mean raw importance score of variable x for class…

r statistics data-mining random-forest

asked Apr 10 '09 at 02:18

thirsty93

2,602
6
26
26

votes

7 answers

Kmeans without knowing the number of clusters?

I am attempting to apply k-means on a set of high-dimensional data points (about 50 dimensions) and was wondering if there are any implementations that find the optimal number of clusters. I remember reading somewhere that the way an algorithm…

python machine-learning data-mining k-means

asked Jul 07 '11 at 18:58

Legend

113,822
119
272
400

votes

3 answers

How would one use Kernel Density Estimation as a 1D clustering method in scikit learn?

I need to cluster a simple univariate data set into a preset number of clusters. Technically it would be closer to binning or sorting the data since it is only 1D, but my boss is calling it clustering, so I'm going to stick to that name. The…

machine-learning scikit-learn cluster-analysis data-mining kernel-density

asked Jan 29 '16 at 21:35

Alex Kinman

2,437
8
32
51

votes

6 answers

Can anyone give a real life example of supervised learning and unsupervised learning?

I recently studied about supervised learning and unsupervised learning. From theory, I know that supervised means getting the information from labeled datasets and unsupervised means clustering the data without any labels given. But, the problem is…

machine-learning deep-learning data-mining supervised-learning unsupervised-learning

asked Oct 03 '14 at 16:29

krupal

votes

5 answers

scikit-learn DBSCAN memory usage

UPDATED: In the end, the solution I opted to use for clustering my large dataset was one suggested by Anony-Mousse below. That is, using ELKI's DBSCAN implimentation to do my clustering rather than scikit-learn's. It can be run from the command line…

python scikit-learn cluster-analysis data-mining dbscan

asked May 05 '13 at 05:04

JamesT

votes

6 answers

Choosing eps and minpts for DBSCAN (R)?

I've been searching for an answer for this question for quite a while, so I'm hoping someone can help me. I'm using dbscan from the fpc library in R. For example, I am looking at the USArrests data set and am using dbscan on it as…

r data-mining cluster-analysis dbscan

asked Oct 15 '12 at 10:12

Belinda Chiera

votes

4 answers

How does clustering (especially String clustering) work?

I heard about clustering to group similar data. I want to know how it works in the specific case for String. I have a table with more than different 100,000 words. I want to identify the same word with some differences (eg.: house, house!!,…

string cluster-analysis data-mining

asked Nov 19 '11 at 18:48

Renato Dinhani

35,057
55
139
199

votes

3 answers

What makes the distance measure in k-medoid "better" than k-means?

I am reading about the difference between k-means clustering and k-medoid clustering. Supposedly there is an advantage to using the pairwise distance measure in the k-medoid algorithm, instead of the more familiar sum of squared Euclidean…

machine-learning cluster-analysis data-mining k-means

asked Feb 07 '14 at 05:08

tumultous_rooster

12,150
32
92
149

votes

6 answers

How do I extract keywords used in text?

How do I data mine a pile of text to get keywords by usage? ("Jacob Smith" or "fence") And is there a software to do this already? even semi-automatically, and if it can filter out simple words like "the", "and", "or", then I could get to the topics…

text indexing keyword data-mining

asked Oct 15 '09 at 21:37

Robin Rodricks

110,798
141
398
607

votes

8 answers

Comparing R to Matlab for Data Mining

Instead of starting to code in Matlab, I recently started learning R, mainly because it is open-source. I am currently working in data mining and machine learning field. I found many machine learning algorithms implemented in R, and I am still…

r matlab machine-learning data-mining language-comparisons

asked Jan 27 '11 at 01:04

iinception

1,945
2
21
19

votes

7 answers

Python Implementation of OPTICS (Clustering) Algorithm

I'm looking for a decent implementation of the OPTICS algorithm in Python. I will use it to form density-based clusters of points ((x,y) pairs). I'm looking for something that takes in (x,y) pairs and outputs a list of clusters, where each cluster…

python machine-learning cluster-analysis data-mining optics-algorithm

asked Apr 01 '11 at 15:43

Murat Derya Özen

2,154
8
31
44

votes

2 answers

Scikit-learn: How to run KMeans on a one-dimensional array?

I have an array of 13.876(13,876) values between 0 and 1. I would like to apply sklearn.cluster.KMeans to only this vector to find the different clusters in which the values are grouped. However, it seems KMeans works with a multidimensional array…

python scikit-learn data-mining k-means

asked Feb 09 '15 at 18:08

Irene

votes

14 answers

How can I find the center of a cluster of data points?

Let's say I plotted the position of a helicopter every day for the past year and came up with the following map: Any human looking at this would be able to tell me that this helicopter is based out of Chicago. How can I find the same result in…

algorithm geocoding cluster-analysis data-mining markerclusterer

asked Jun 14 '13 at 16:03

Ryan

14,682
32
106
179

Prev 1

…

99 100 Next