Questions tagged [data-mining]

Data mining is the process of analyzing large amounts of data in order to find patterns and commonalities.

Data mining, also known as knowledge discovery, is the process of digging through and analyzing enormous sets of data and then extracting the meaning of the data. Data mining tools like SQL Server Analysis Services, predict behaviors and future trends, allowing businesses to make proactive, knowledge-driven decisions. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. Input to learning mining algorithms is called cases, samples, examples, instances, events, and observations.

3094 questions
9
votes
4 answers

Which datamining tool to use?

Can somebody explain me the main pros and cons of the most known datamining open-source tools? Everywhere I read that RapidMiner, Weka, Orange, KNIME are the best ones. look at this blog post Can somebody do a fast technical comparison in a small…
user2670818
  • 719
  • 5
  • 12
  • 28
9
votes
2 answers

Efficient algorithm to group points in clusters by distance between every two points

I am looking for an efficient algorithm for the following problem: Given a set of points in 2D space, where each point is defined by its X and Y coordinates. Required to split this set of points into a set of clusters so that if distance between two…
ovk
  • 2,318
  • 1
  • 23
  • 30
9
votes
5 answers

'Similarity' in Data Mining

In the field of Data Mining, is there a specific sub-discipline called 'Similarity'? If yes, what does it deal with. Any examples, links, references will be helpful. Also, being new to the field, I would like the community opinion on how closely…
Shailesh Tainwala
  • 6,299
  • 12
  • 58
  • 69
9
votes
1 answer

What FFT descriptors should be used as feature to implement classification or clustering algorithm?

I have some geographical trajectories sampled to analyze, and I calculated the histogram of data in spatial and temporal dimension, which yielded a time domain based feature for each spatial element. I want to perform a discrete FFT to transform the…
9
votes
3 answers

How to plot/visualize a C50 decision tree in R?

I am using the C50 decision tree algorithm. I am able to build the tree and get the summaries, but cannot figure out how to plot or viz the tree. My C50 model is called credit_model In other decision tree packages, I usually use something like…
mpg
  • 3,679
  • 8
  • 36
  • 45
9
votes
3 answers

Historical weather data from NOAA

I am working on a data mining project and I would like to gather historical weather data. I am able to get historical data through the web interface that they provide at http://www.ncdc.noaa.gov/cdo-web/search. But I would like to access this data…
azrosen92
  • 8,357
  • 4
  • 26
  • 45
9
votes
2 answers

TFIDF calculating confusion

I found the following code on the internet for calculating TFIDF: https://github.com/timtrueman/tf-idf/blob/master/tf-idf.py I added "1+" in the function def idf(word, documentList) so i won't get divided by 0 error: return…
badc0re
  • 3,333
  • 6
  • 30
  • 46
9
votes
5 answers

Similarity distance measures

Vectors like this v1 = {0 0 0 1 1 0 0 1 0 1 1} v2 = {0 1 1 1 1 1 0 1 0 1 0} v3 = {0 0 0 0 0 0 0 0 0 0 1} Need to calculate similarity between them. Hamming distance between v1 and v2 is 4 and between v1 and v3 is also 4. But because I am…
user1306283
9
votes
4 answers

Splitting data into training/testing datasets in MATLAB?

Upon some research I found two functions in MATLAB to do the task: cvpartition function in the Statistics Toolbox crossvalind function in the Bioinformatics Toolbox Now I've used the cvpartition to create n-fold cross validation subsets before,…
Amro
  • 123,847
  • 25
  • 243
  • 454
9
votes
1 answer

How to perform collaborative filtering in R

I'm have matrix data containing some null values. To fill the null values, I'd like to perform collaborative filtering. As I am studying for R, rather I'd like to use R. So, Does anyone know how to perform collaborative filtering in R?
Chappy 003
  • 444
  • 1
  • 5
  • 15
9
votes
5 answers

When are n-grams (n>3) important as opposed to just bigrams or trigrams?

I am just wondering what is the use of n-grams (n>3) (and their occurrence frequency) considering the computational overhead in computing them. Are there any applications where bigrams or trigrams are simply not enough? If so, what is the…
Legend
  • 113,822
  • 119
  • 272
  • 400
9
votes
1 answer

OpenNLP Name Finder

I am using the NameFinder API example doc of OpenNLP. After initializing the Name Finder the documentation uses the following code for the input text: for (String document[][] : documents) { for (String[] sentence : document) { Span…
Chris
  • 18,075
  • 15
  • 59
  • 77
8
votes
2 answers

Combining different similarities to build one final similarity

Im pretty much new to data mining and recommendation systems, now trying to build some kind of rec system for users that have such parameters: city education interest To calculate similarity between them im gonna apply cosine similarity and…
Leg0
  • 510
  • 9
  • 21
8
votes
3 answers

What are the differences between Dynamic Time Warping and Needleman-Wunsch algorithm?

I am looking for the differences between Dynamic Time Warping and Needleman-Wunsch algorithm. Basically, they both find an alignment score. I need to calculate alignment (similarity) score between short sequence of strings (<20 characters) and…
iinception
  • 1,945
  • 2
  • 21
  • 19
8
votes
5 answers

Machine learning library for .net analog of Apache Mahout

Are there libraries for .net like Mahout. What you can recommend for machine learning?
John
  • 864
  • 1
  • 11
  • 26