Questions tagged [data-mining]

Data mining is the process of analyzing large amounts of data in order to find patterns and commonalities.

Data mining, also known as knowledge discovery, is the process of digging through and analyzing enormous sets of data and then extracting the meaning of the data. Data mining tools like SQL Server Analysis Services, predict behaviors and future trends, allowing businesses to make proactive, knowledge-driven decisions. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. Input to learning mining algorithms is called cases, samples, examples, instances, events, and observations.

3094 questions
1
vote
0 answers

predictive attributes in WEKA

I am trying to select the best attributes for my training data set which contains numeric values/attributes. which attribute evaluator/method would yield the best results for about 10 or so attributes? Training dataset is about 1400 lines of…
1
vote
1 answer

R-convert transaction format dataset to basket format for sequence mining

ORIGINAL TABLE CELL NUMBER ----------ACTIVITY--------TIME
001................................call a................12.23
002................................call b................01.00
002................................call…
steven
  • 644
  • 1
  • 11
  • 23
1
vote
3 answers

Performance of Frequent Itemset mining

I have implemented apriori algorithm for mining frequent itemset its working fine for sample data but when i have tried to execute it for retail dataset available at http://fimi.ua.ac.be/data/retail.dat which is around 3mb data with 88k transaction…
1
vote
1 answer

scikit-learn interpretation of integer variables

I'm just started to use scikit-learn after years of datamining with SAS/SPSS products. I'm amazed by the capability of scikit-learn and pandas however there is one thing I can't figure out by myself. Let us assume that my training data is build up…
dealah
  • 13
  • 2
1
vote
2 answers

why training and testing file same in svmlight

I Downloaded the SVM-Light for linux OS. run the Commands .It produce 2 executable svm_learn and svm_classify. Using this i tried to execte a example file(It contain a train.dat,test.dat files) with following code ./svm_learn example1/train.dat…
user39133
  • 93
  • 1
  • 7
1
vote
2 answers

Clustering huge data matrix in python?

I want to cluster 1,5 million of chemical compounds. This means having 1.5 x 1.5 Million distance matrix... I think I can generate such a big table using pyTables but now - having such a table how will I cluster it? I guess I can't just pass…
mnowotka
  • 16,430
  • 18
  • 88
  • 134
1
vote
4 answers

Implementation of k-means clustering algorithm

In my program, i'm taking k=2 for k-mean algorithm i.e i want only 2 clusters. I have implemented in a very simple and straightforward way, still i'm unable to understand why my program is getting into infinite loop. can anyone please guide me where…
chinu
  • 133
  • 4
  • 6
  • 12
1
vote
1 answer

How to compute a knee in k-distance plot?

I want to implement some kind of improvement of DBSCAN algorithm, where user do not need to enter input parameters (minPts and Eps). My idea is to use the K-distances plot, but what is the best method to compute the 'knee' of this plot? How to count…
user3146344
  • 207
  • 1
  • 3
  • 16
1
vote
1 answer

Finding data patterns in sequential Postgresql rows

I'd like to ask Postgres how often two occurrences of an event, one occurrence per row, are seen. For example, if I have user events like: User 1: Clicked button 1, redirected to page 2 User 1: Clicked button 2, redirected to page 3 User 1: Clicked…
Carson
  • 17,073
  • 19
  • 66
  • 87
1
vote
1 answer

Aggregating overlapping "all-previous-events" features from time series data - in Python

My problem is pretty general and can probably be solved in many ways. But what is a smart way considering time and memory? I have time series data of user interactions of the following form: cookie_id interaction --------- ----------- 1234 …
elgehelge
  • 2,014
  • 1
  • 19
  • 24
1
vote
2 answers

Pattern mining for item sets of length 2

I am looking for association mining algorithm where I can mine frequent item sets of length 2 only. Is it better to use a query on database to compute frequent items when stopping at 2-itemsets.
user1239080
  • 61
  • 2
  • 6
1
vote
1 answer

SQL To Find Word Pairs/Clusters Between Columns

I have a SQL Server 2012 database with a table that contains questions and answers. Simplified structure is like this: question_id int question varchar(500) answer varchar(50) I'd like to find word pairs or clusters between the question and…
1
vote
4 answers

Intelligent Database - Capable of identifying out of the ordinary values

I am looking for a tool or system to take a look at the database and identify values that are out of the ordinary. I don't need anything to do real time checks, just a system which does processing overnight or at scheduled points. I am looking for a…
Pasta
  • 2,491
  • 5
  • 24
  • 33
1
vote
2 answers

DBSCAN in hadoop

Actually I don't know what should be the key and value for map() and what should be the input format and output format. If I read one point at a time by map() then how the neighbors can be computed using one point because remaining points are not…
1
vote
1 answer

Clustering Data in a 3D matrix with another matrix

I Have got 2 Data cubes represented as 3D matrices. Both of them will be of same dimensions. we have to do rule based ordering. our condition is that if any sub cube of both of them ( sub cube must match exactly in location and orientation) matches…
infinitum
  • 31
  • 6