Questions tagged [data-mining]

Data mining is the process of analyzing large amounts of data in order to find patterns and commonalities.

Data mining, also known as knowledge discovery, is the process of digging through and analyzing enormous sets of data and then extracting the meaning of the data. Data mining tools like SQL Server Analysis Services, predict behaviors and future trends, allowing businesses to make proactive, knowledge-driven decisions. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. Input to learning mining algorithms is called cases, samples, examples, instances, events, and observations.

3094 questions
12
votes
4 answers

Maximal vs. Closed Patterns in Association Rule Mining

In frequent itemset generation of association rule mining, what is the fundamental difference between maximal & closed patterns itemsets. Can someone guide me a resource about them?
Michael
  • 121
  • 1
  • 1
  • 4
12
votes
2 answers

normalization methods for stream data

I am using Clustream algorithm and I have figured out that I need to normalize my data. I decided to use min-max algorithm to do this, but I think in this way the values of new coming data objects will be calculated differently as the values of min…
T.Sh
  • 390
  • 2
  • 16
12
votes
6 answers

what is the difference between Association rule mining & frequent itemset mining

i am new to data mining and confuse about Association rules and frequent item mining. for me i think both are same but i need views from experts on this forum My question is what is the difference between Association rule mining & frequent itemset…
Zia
  • 345
  • 2
  • 5
  • 12
12
votes
3 answers

clustering very large dataset in R

I have a dataset consisting of 70,000 numeric values representing distances ranging from 0 till 50, and I want to cluster these numbers; however, if I'm trying the classical clustering approach, then I would have to establish a 70,000X70,000…
DOSMarter
  • 1,485
  • 5
  • 21
  • 29
12
votes
4 answers

Creating a comparable and flexible fingerprint of an object

My situation Say I have thousands of objects, which in this example could be movies. I parse these movies in a lot of different ways, collecting parameters, keywords and statistics about each of them. Let's call them keys. I also assign a weight to…
Magnus Engdal
  • 5,446
  • 3
  • 31
  • 50
12
votes
4 answers

Using adaboost within R's caret package

I've been using the ada R package for a while, and more recently, caret. According to the documentation, caret's train() function should have an option that uses ada. But, caret is puking at me when I use the same syntax that sits within my ada()…
Bryan
  • 5,999
  • 9
  • 29
  • 50
12
votes
2 answers

How to use Weka for predicting results

I'm new to Weka and I'm confused with the tool. I have a data set about fruit prices and related attributes. I'm trying to predict the specific fruit price using the data set. Since I'm new to Weka, I couldn't figure out how to do this task. Please…
12
votes
1 answer

In scikit learn, how to deal with the data mixed with numerical and nominal value?

I know that the computation in scikit-learn is based on NumPy so everything is a matrix or array. How does this package handle mixed data (numerical and nominal values)? For example, a product could have the attribute 'color' and 'price', where…
12
votes
7 answers

What is data mining from a developer's perspective?

I can find the technical explanation of what data mining is in a book or on Wikipedia, but I'm wondering what sort of development does it exactly involve? Is it more about using tools or more about writing tools? Is it really any much different from…
aberrant80
  • 12,815
  • 8
  • 45
  • 68
11
votes
4 answers

Algorithm to handle data aggregation from multiple error-prone sources

I'm aggregating concert listings from several different sources, none of which are both complete and accurate. Some of the data comes from users (such as on last.fm), and may be incorrect. Other data sources are highly accurate, but may not contain…
Matt Green
  • 2,032
  • 2
  • 22
  • 36
11
votes
2 answers

What is stratified bootstrap?

I have learned bootstrap and stratification. But what is stratified bootstrap? And how does it work? Let's say we have a dataset of n instances (observations), and m is the number of classes. How should I divide the dataset, and what's the…
Kevin217
  • 724
  • 1
  • 10
  • 20
11
votes
2 answers

Removing "almost duplicate" strings in subquadratic time

I'm trying to do machine learning on a real-life dataset (hotel reviews). Unfortunately, it's plagued by spam, which comes in the form of almost identical reviews, complicating matters for me greatly. I would like to remove "almost duplicates" from…
Alexei Averchenko
  • 1,706
  • 1
  • 16
  • 29
11
votes
2 answers

Estimating/Choosing optimal Hyperparameters for DBSCAN

I need to find naturally occurring classes of nouns based on their distribution with different preposition (like agentive, instrumental, time, place etc.). I tried using k-means clustering but of less help, it didn't work well, there was a lot of…
Riyaz
  • 1,430
  • 2
  • 17
  • 27
11
votes
1 answer

No. of hidden layers, units in hidden layers and epochs till Neural Network starts behaving acceptable on Training data

I am trying to solve this Kaggle Problem using Neural Networks. I am using Pybrain Python Library. It's a classical supervised Learning Problem. In following code: 'data' variable is numpy array(892*8). 7 fields are my features and 1 field is my…
11
votes
2 answers

How can i cluster document using k-means (Flann with python)?

I want to cluster documents based on similarity. I haved tried ssdeep (similarity hashing), very fast but i was told that k-means is faster and flann is fastest of all implementations, and more accurate so i am trying flann with python bindings but…
Phyo Arkar Lwin
  • 6,673
  • 12
  • 41
  • 55