Questions tagged [data-mining]

Data mining is the process of analyzing large amounts of data in order to find patterns and commonalities.

Data mining, also known as knowledge discovery, is the process of digging through and analyzing enormous sets of data and then extracting the meaning of the data. Data mining tools like SQL Server Analysis Services, predict behaviors and future trends, allowing businesses to make proactive, knowledge-driven decisions. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. Input to learning mining algorithms is called cases, samples, examples, instances, events, and observations.

3094 questions
14
votes
3 answers

An understandable clusterization

I have a dataset. Each element of this set consists of numerical and categorical variables. Categorical variables are nominal and ordinal. There is some natural structure in this dataset. Commonly, experts clusterize datasets such as mine using…
14
votes
3 answers

How to test if a kernel is a valid kernel

If I define my own method of determining the similarity between two input entities of my Support Vector Machine classifier, and thus define it as my kernel, how do I verify if it is indeed a valid kernel that I can use? For example, if my inputs are…
London guy
  • 27,522
  • 44
  • 121
  • 179
14
votes
6 answers

How to find the minimum support in Apriori algorithm

When the percentage values of support and confidence is given how can I find the minimum support in Apriori algorithm. For an example when support and confidence is given as 60% and 60% respectively what is the minimum support?
Chanikag
  • 1,419
  • 2
  • 18
  • 31
13
votes
3 answers

Latent Semantic Analysis concepts

I've read about using Singular Value Decomposition (SVD) to do Latent Semantic Analysis (LSA) in corpus of texts. I've understood how to do that, also I understand mathematical concepts of SVD. But I don't understand why does it works applying to…
stemm
  • 5,960
  • 2
  • 34
  • 64
13
votes
2 answers

Weka simple K-means clustering assignments

I have what feels like a simple problem, but I can't seem to find an answer. I'm pretty new to Weka, but I feel like I've done a bit of research on this (at least read through the first couple of pages of Google results) and come up dry. I am using…
machine yearning
  • 9,889
  • 5
  • 38
  • 51
13
votes
1 answer

Naive Bayesian for Topic detection using "Bag of Words" approach

I am trying to implement a naive bayseian approach to find the topic of a given document or stream of words. Is there are Naive Bayesian approach that i might be able to look up for this ? Also, i am trying to improve my dictionary as i go along.…
AlgoMan
  • 2,785
  • 6
  • 34
  • 40
13
votes
3 answers

Cosine distance as vector distance function for k-means

I have a graph of N vertices where each vertex represents a place. Also I have vectors, one per user, each one of N coefficients where the coefficient's value is the duration in seconds spent at the corresponding place or 0 if that place was not…
Thalis K.
  • 7,363
  • 6
  • 39
  • 54
13
votes
2 answers

What method do you use for selecting the optimum number of clusters in k-means and EM?

Many algorithms for clustering are available. A popular algorithm is the K-means where, based on a given number of clusters, the algorithm iterates to find best clusters for the objects. What method do you use to determine the number of clusters in…
gd047
  • 29,749
  • 18
  • 107
  • 146
13
votes
3 answers

WEKA Tutorials / Examples for a Newbie

In a follow-up to this answer I want to ask if any of you know any good (and more importantly easy to understand) tutorials and / or examples of data mining with the Weka toolkit. I've been very interested in Data Mining ever since I've first heard…
Alix Axel
  • 151,645
  • 95
  • 393
  • 500
13
votes
6 answers

Monitor brands with common words

Let's say you should monitor the brand "ONE" online. What algorithms can be used to separate pages about the brand ONE from pages containing the common word ONE? I'm thinking maybe Bayes could work, but are there other ways to do this?
Christian Davén
  • 16,713
  • 12
  • 64
  • 77
13
votes
4 answers

Hierarchical Clustering: Determine optimal number of cluster and statistically describe Clusters

I could use some advice on methods in R to determine the optimal number of clusters and later on describe the clusters with different statistical criteria. I’m new to R with basic knowledge about the statistical foundations of cluster analysis.…
Joschi
  • 2,941
  • 9
  • 28
  • 36
12
votes
2 answers

What free/paid search API's allow for programmatic querying and caching/storage of the resulting data?

If you've done any serious research into search API's, you know that most of them have a huge slew of TOS/TOU restrictions that make them nearly impossible to use in anything but the most inane applications. Bing's 2.0 API, Yahoo Search BOSS, Google…
rinogo
  • 8,491
  • 12
  • 61
  • 102
12
votes
2 answers

Python, Scipy: Building triplets using large adjacency matrix

I am using an adjacency matrix to represent a network of friends which can be visually interpreted as Mary 0 1 1 1 Joe 1 0 1 1 Bob 1 1 0 1 Susan 1 1 1 0 …
will
  • 225
  • 1
  • 7
12
votes
4 answers

Outlier detection in data mining

I have a few sets of questions regarding outlier detection: Can we find outliers using k-means and is this a good approach? Is there any clustering algorithm which does not accept any input from the user? Can we use support vector machine or any…
Navin
  • 411
  • 3
  • 9
  • 17
12
votes
5 answers

Randomness in Artificial Intelligence & Machine Learning

This question came to my mind while working on 2 projects in AI and ML. What If I'm building a model (e.g. Classification Neural Network,K-NN, .. etc) and this model uses some function that includes randomness. If I don't fix the seed, then I'm…