Questions tagged [data-mining]

Data mining is the process of analyzing large amounts of data in order to find patterns and commonalities.

Data mining, also known as knowledge discovery, is the process of digging through and analyzing enormous sets of data and then extracting the meaning of the data. Data mining tools like SQL Server Analysis Services, predict behaviors and future trends, allowing businesses to make proactive, knowledge-driven decisions. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. Input to learning mining algorithms is called cases, samples, examples, instances, events, and observations.

machine-learning, artificial-intelligence and statistics provide many techniques used in data mining, in combination with database technologies for efficiency. Please use the appropriate tag (e.g. machine-learning) to refer to the raw methods.
Cluster analysis (dataclustering) and outlier detection (outliers) are two of the main challenges from data mining.
Wiki Links
Data Mining Introduction

3094 questions

votes

4 answers

WEKA K-Means Clustering

Can anybody explain what the output of the K-Means clustering in WEKA actually means. For example kMeans Number of iterations: 9 Within cluster sum of squared errors: 9434.911100488926 Missing values globally replaced with mean/mode Cluster…

cluster-analysis data-mining weka k-means

asked Apr 26 '11 at 14:09

Chris Taylor

votes

4 answers

Stop word removal in Javascript

HI I am looking for a library that'll remove stop words from text in Javascript, my end goal is to calculate tf-idf and then convert the given document into vector space, and all of this is Javascript. Can anyone point me to a library that'll help…

analytics data-mining javascript stemming

asked Apr 12 '11 at 06:51

dhaval2025

votes

4 answers

Are there any classification algorithms which target data with a one to many (1:n) relationship?

Has there been any research in the field of data-mining regarding classifying data which has a one to many relationship? For example of a problem like this, say I am trying to predict which students are going to drop out of university based on…

algorithm machine-learning data-mining classification database-relations

asked Jan 21 '11 at 22:06

Nixuz

3,439
4
39
44

votes

4 answers

Is there any reason to prefer functional programming for data mining projects?

I am researching the possibility of starting a data mining project which will include intensive calculations and transformation on data, and should be relatively easy to scale. In your experience, is the choice of programming language critical for…

java programming-languages functional-programming clojure data-mining

asked Nov 08 '10 at 21:13

Yuval Adam

161,610
92
305
395

votes

3 answers

What is the difference between "Sequential Pattern Mining" and "Sequential Rule Mining"

The documentation for the very powerful open source data mining tool SPMF lists them separately: http://www.philippe-fournier-viger.com/spmf/index.php?link=algorithms.php Does any one know why?

data-mining data-science pattern-mining

asked Jan 03 '16 at 01:38

R Claven

1,160
2
13
27

votes

2 answers

Clustering Categorical data using jaccard similarity

I am trying to build a clustering algorithm for categorical data. I have read about different algorithm's like k-modes, ROCK, LIMBO, however I would like to build one of mine and compare the accuracy and cost to others. I have (m) training set and…

python-2.7 machine-learning cluster-analysis data-mining k-means

asked May 09 '15 at 12:47

Sam

2,545
8
38
59

votes

2 answers

MATLAB's glmfit vs fitglm

I'm trying to perform logistic regression to do classification using MATLAB. There seem to be two different methods in MATLAB's statistics toolbox to build a generalized linear model 'glmfit' and 'fitglm'. I can't figure out what the difference is…

matlab machine-learning data-mining glm logistic-regression

asked Mar 12 '15 at 15:42

Physman

votes

3 answers

Indexing and Searching Over Word Level Annotation Layers in Lucene

I have a data set with multiple layers of annotation over the underlying text, such as part-of-tags, chunks from a shallow parser, name entities, and others from various natural language processing (NLP) tools. For a sentence like The man went to…

java lucene nlp data-mining text-mining

asked May 21 '10 at 14:37

dmcer

8,116
1
35
41

votes

2 answers

Can I use hdf5 for large amounts of text data?

Suppose I am going to programmatically get hundred thousand open access books as text strings from the internet. My intention is to do some analysis on them (using pandas). I am already using mongodb in some parts of my application but I don't think…

file data-mining hdf5

asked Nov 18 '14 at 13:58

yayu

7,758
17
54
86

votes

1 answer

How to get a fixed size SIFT feature vector?

I am trying to obtain feature vectors for N =~ 1300 images in my data set, one of the features I have to implement is shape. So I plan to use SIFT descriptors. However, each image returns different number of keypoints, so I run [F,D] =…

matlab image-processing data-mining sift vlfeat

asked Apr 18 '14 at 18:19

jeff

13,055
29
78
136

votes

3 answers

Bytes vs Characters vs Words - which granularity for n-grams?

At least 3 types of n-grams can be considered for representing text documents: byte-level n-grams character-level n-grams word-level n-grams It's unclear to me which one should be used for a given task (clustering, classification, etc). I read…

nlp data-mining text-mining n-gram

asked Feb 09 '14 at 08:18

usual me

8,338
10
52
95

votes

1 answer

Cascade Classifiers for Multiclass Problems in scikit-learn

Say I have a classification problem that is multiclass and characteristically hierarchical, e.g. 'edible', 'nutritious' and '~nutritious' - so it can be represented like so ├── edible │ ├── nutritious │ └── ~nutritious └── ~edible While one can…

python machine-learning data-mining scikit-learn

asked Jan 16 '14 at 00:44

tiao

votes

6 answers

Analyzing noisy data

I recently launched a rocket with a barometric altimeter that is accurate to roughly 10 ft (calculated via data acquired during flight). The recorded data is in time increments of 0.05 sec per sample and a graph of altitude vs. time looks pretty…

data-mining numerical-analysis

asked Dec 24 '09 at 05:28

Nick Larsen

18,631
6
67
96

votes

1 answer

How to scale input DBSCAN in scikit-learn

Should the input to sklearn.clustering.DBSCAN be pre-processeed? In the example http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html#example-cluster-plot-dbscan-py the distances between the input samples X are calculated and…

scikit-learn cluster-analysis data-mining dbscan

asked Jul 03 '13 at 21:55

Alex

votes

2 answers

rapid miner: how to add a 'label' attribute to a dataset?

I want to apply a decision tree learning algorithm to a dataset I have imported from a CSV. The problem is that the "tra" input of the Decision Tree block is still red, stating "Input example set must have special attribute 'label'.". How do I add…

machine-learning data-mining decision-tree rapidminer

asked Apr 08 '13 at 12:44

fstab

4,801
8
34
66

Prev 1 2 3

…

99 100 Next