Questions tagged [data-mining]

Data mining is the process of analyzing large amounts of data in order to find patterns and commonalities.

Data mining, also known as knowledge discovery, is the process of digging through and analyzing enormous sets of data and then extracting the meaning of the data. Data mining tools like SQL Server Analysis Services, predict behaviors and future trends, allowing businesses to make proactive, knowledge-driven decisions. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. Input to learning mining algorithms is called cases, samples, examples, instances, events, and observations.

3094 questions
8
votes
4 answers

WEKA K-Means Clustering

Can anybody explain what the output of the K-Means clustering in WEKA actually means. For example kMeans Number of iterations: 9 Within cluster sum of squared errors: 9434.911100488926 Missing values globally replaced with mean/mode Cluster…
Chris Taylor
  • 107
  • 1
  • 1
  • 3
8
votes
4 answers

Stop word removal in Javascript

HI I am looking for a library that'll remove stop words from text in Javascript, my end goal is to calculate tf-idf and then convert the given document into vector space, and all of this is Javascript. Can anyone point me to a library that'll help…
dhaval2025
  • 317
  • 2
  • 5
  • 12
8
votes
4 answers

Are there any classification algorithms which target data with a one to many (1:n) relationship?

Has there been any research in the field of data-mining regarding classifying data which has a one to many relationship? For example of a problem like this, say I am trying to predict which students are going to drop out of university based on…
8
votes
4 answers

Is there any reason to prefer functional programming for data mining projects?

I am researching the possibility of starting a data mining project which will include intensive calculations and transformation on data, and should be relatively easy to scale. In your experience, is the choice of programming language critical for…
Yuval Adam
  • 161,610
  • 92
  • 305
  • 395
8
votes
3 answers

What is the difference between "Sequential Pattern Mining" and "Sequential Rule Mining"

The documentation for the very powerful open source data mining tool SPMF lists them separately: http://www.philippe-fournier-viger.com/spmf/index.php?link=algorithms.php Does any one know why?
R Claven
  • 1,160
  • 2
  • 13
  • 27
8
votes
2 answers

Clustering Categorical data using jaccard similarity

I am trying to build a clustering algorithm for categorical data. I have read about different algorithm's like k-modes, ROCK, LIMBO, however I would like to build one of mine and compare the accuracy and cost to others. I have (m) training set and…
Sam
  • 2,545
  • 8
  • 38
  • 59
8
votes
2 answers

MATLAB's glmfit vs fitglm

I'm trying to perform logistic regression to do classification using MATLAB. There seem to be two different methods in MATLAB's statistics toolbox to build a generalized linear model 'glmfit' and 'fitglm'. I can't figure out what the difference is…
8
votes
3 answers

Indexing and Searching Over Word Level Annotation Layers in Lucene

I have a data set with multiple layers of annotation over the underlying text, such as part-of-tags, chunks from a shallow parser, name entities, and others from various natural language processing (NLP) tools. For a sentence like The man went to…
dmcer
  • 8,116
  • 1
  • 35
  • 41
8
votes
2 answers

Can I use hdf5 for large amounts of text data?

Suppose I am going to programmatically get hundred thousand open access books as text strings from the internet. My intention is to do some analysis on them (using pandas). I am already using mongodb in some parts of my application but I don't think…
yayu
  • 7,758
  • 17
  • 54
  • 86
8
votes
1 answer

How to get a fixed size SIFT feature vector?

I am trying to obtain feature vectors for N =~ 1300 images in my data set, one of the features I have to implement is shape. So I plan to use SIFT descriptors. However, each image returns different number of keypoints, so I run [F,D] =…
jeff
  • 13,055
  • 29
  • 78
  • 136
8
votes
3 answers

Bytes vs Characters vs Words - which granularity for n-grams?

At least 3 types of n-grams can be considered for representing text documents: byte-level n-grams character-level n-grams word-level n-grams It's unclear to me which one should be used for a given task (clustering, classification, etc). I read…
usual me
  • 8,338
  • 10
  • 52
  • 95
8
votes
1 answer

Cascade Classifiers for Multiclass Problems in scikit-learn

Say I have a classification problem that is multiclass and characteristically hierarchical, e.g. 'edible', 'nutritious' and '~nutritious' - so it can be represented like so ├── edible │ ├── nutritious │ └── ~nutritious └── ~edible While one can…
tiao
  • 805
  • 1
  • 8
  • 20
8
votes
6 answers

Analyzing noisy data

I recently launched a rocket with a barometric altimeter that is accurate to roughly 10 ft (calculated via data acquired during flight). The recorded data is in time increments of 0.05 sec per sample and a graph of altitude vs. time looks pretty…
Nick Larsen
  • 18,631
  • 6
  • 67
  • 96
8
votes
1 answer

How to scale input DBSCAN in scikit-learn

Should the input to sklearn.clustering.DBSCAN be pre-processeed? In the example http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html#example-cluster-plot-dbscan-py the distances between the input samples X are calculated and…
Alex
  • 267
  • 1
  • 2
  • 7
8
votes
2 answers

rapid miner: how to add a 'label' attribute to a dataset?

I want to apply a decision tree learning algorithm to a dataset I have imported from a CSV. The problem is that the "tra" input of the Decision Tree block is still red, stating "Input example set must have special attribute 'label'.". How do I add…
fstab
  • 4,801
  • 8
  • 34
  • 66