Questions tagged [data-mining]

Data mining is the process of analyzing large amounts of data in order to find patterns and commonalities.

Data mining, also known as knowledge discovery, is the process of digging through and analyzing enormous sets of data and then extracting the meaning of the data. Data mining tools like SQL Server Analysis Services, predict behaviors and future trends, allowing businesses to make proactive, knowledge-driven decisions. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. Input to learning mining algorithms is called cases, samples, examples, instances, events, and observations.

3094 questions
29
votes
5 answers

how to determine the number of topics for LDA?

I am a freshman in LDA and I want to use it in my work. However, some problems appear. In order to get the best performance, I want to estimate the best topic number. After reading "Finding Scientific topics", I know that I can calculate logP(w|z)…
Chelsea Wang
  • 599
  • 2
  • 5
  • 19
29
votes
5 answers

Algorithm to find the most common substrings in a string

Is there any algorithm that can be used to find the most common phrases (or substrings) in a string? For example, the following string would have "hello world" as its most common two-word phrase: "hello world this is hello world. hello world repeats…
Anderson Green
  • 30,230
  • 67
  • 195
  • 328
28
votes
5 answers

Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)

I have a data table ("norm") containing numeric - at least to what I can see - normalized values of the following form: When I am executing k <- kmeans(norm,center=3) I am receving the following error: Error in do_one(nmeth) : NA/NaN/Inf in…
Jonathan Rhein
  • 1,616
  • 3
  • 23
  • 47
27
votes
20 answers

Data Mining open source tools

I'm due to take up a project which is into data mining. Before I jump in I wanted to probe around for different data mining tools (preferably open source) which allows web based reporting. In my scenario the data would be provided to me, so I'm not…
Arnkrishn
  • 29,828
  • 40
  • 114
  • 128
27
votes
5 answers

random unit vector in multi-dimensional space

I'm working on a data mining algorithm where i want to pick a random direction from a particular point in the feature space. If I pick a random number for each of the n dimensions from [-1,1] and then normalize the vector to a length of 1 will I…
Matt
  • 1,513
  • 3
  • 16
  • 32
27
votes
3 answers

Difference between Closed and open Sequential Pattern Mining Algorithms

I want to use some algorithms to mine my log data. I found a pattern mining framework on: http://www.philippe-fournier-viger.com/spmf/index.php?link=algorithms.php I have tried several algorithms, the BIDE+ algorithm performs the best. The BIDE+…
leon
  • 10,085
  • 19
  • 60
  • 77
26
votes
6 answers

Fast (< n^2) clustering algorithm

I have 1 million 5-dimensional points that I need to group into k clusters with k << 1 million. In each cluster, no two points should be too far apart (e.g. they could be bounding spheres with a specified radius). That means that there probably has…
26
votes
3 answers

Clustering values by their proximity in python (machine learning?)

I have an algorithm that is running on a set of objects. This algorithm produces a score value that dictates the differences between the elements in the set. The sorted output is something like…
PCoelho
  • 7,850
  • 11
  • 31
  • 36
26
votes
3 answers

Javascript and Scientific Processing?

Matlab, R, and Python are powerful but either costly or slow for some data mining work I'd like to do. I'm considering using Javascript both for speed, good visualization libraries, and to be able to use the browser as an interface. The first…
MikeB
  • 788
  • 1
  • 9
  • 27
25
votes
2 answers

Hierarchical clustering of 1 million objects

Can anyone point me to a hierarchical clustering tool (preferable in python) that can cluster ~1 Million objects? I have tried hcluster and also Orange. hcluster had trouble with 18k objects. Orange was able to cluster 18k objects in seconds, but…
25
votes
6 answers

What is the difference between Big Data and Data Mining?

As Wikpedia states The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use How is this related with Big Data? Is it correct if I say that Hadoop…
DesirePRG
  • 6,122
  • 15
  • 69
  • 114
24
votes
7 answers

Finding 2 & 3 word Phrases Using R TM Package

I am trying to find a code that actually works to find the most frequently used two and three word phrases in R text mining package (maybe there is another package for it that I do not know). I have been trying to use the tokenizer, but seem to have…
appletree
  • 353
  • 2
  • 5
  • 10
24
votes
3 answers

Using frequent itemset mining to build association rules?

I am new to this area as well as the terminology so please feel free to suggest if I go wrong somewhere. I have two datasets like this: Dataset 1: A B C 0 E A 0 C 0 0 A 0 C D E A 0 C 0 E The way I interpret this is at some point in time, (A,B,C,E)…
Legend
  • 113,822
  • 119
  • 272
  • 400
24
votes
4 answers

Information retrieval (IR) vs data mining vs Machine Learning (ML)

People often throw around the terms IR, ML, and data mining, but I have noticed a lot of overlap between them. From people with experience in these fields, what exactly draws the line between these?
Boris Yeltz
  • 2,341
  • 5
  • 21
  • 20
23
votes
2 answers

What is the difference between a Confusion Matrix and Contingency Table?

I'm writting a piece of code to evaluate my Clustering Algorithm and I find that every kind of evaluation method needs the basic data from a m*n matrix like A = {aij} where aij is the number of data points that are members of class ci and elements…
MangMang
  • 427
  • 1
  • 5
  • 17