Questions tagged [data-mining]

Data mining is the process of analyzing large amounts of data in order to find patterns and commonalities.

Data mining, also known as knowledge discovery, is the process of digging through and analyzing enormous sets of data and then extracting the meaning of the data. Data mining tools like SQL Server Analysis Services, predict behaviors and future trends, allowing businesses to make proactive, knowledge-driven decisions. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. Input to learning mining algorithms is called cases, samples, examples, instances, events, and observations.

3094 questions
1
vote
2 answers

Understanding Partition density of partitioned network

i am implementing Link Communities community detection algorithm. I have trouble understanding explanation of partition density described in the paper Here is only the part defining partition density: I cannot find the connection between…
hendrix
  • 3,364
  • 8
  • 31
  • 46
1
vote
0 answers

Declarative Data Mining: Frequent Itemset Tiling

For a course in my Computer Science studies, I have to come up with a set of constraints and a score-definition to find a tiling for frequent itemset mining. The matrix with the data consists of ones and zeroes. My task is to come up with a set of…
1
vote
1 answer

Do we need to normalize input segment of training set only?

I want to know that data normalization that is required whether it must be applied to whole part of training set both input and output or input segment is enough.
1
vote
2 answers

Data mining: Apriori issue. Min-support

I wrote data mining apriori algorithm, it works well on small test data but I am having issue to run it on bigger data sets. I am trying to generate rules of items which were bought together frequently. My small test data is 5 transactions and 10…
John Latham
  • 255
  • 1
  • 2
  • 9
1
vote
3 answers

alternative similarity measure in DBSCAN?

I test my image set on DBSCAN algorithm in scikit-learn python module . There are alternatives for similarity computing: # Compute similarities D = distance.squareform(distance.pdist(X)) S = 1 - (D / np.max(D)) A weighted measure or something like…
postgres
  • 2,242
  • 5
  • 34
  • 50
1
vote
0 answers

What's a good way of storing R models for future scoring

Let's say I run random forest or kmeans. I get an R object. Now I want to save that model for future use. I thought PMML was a good format but then realized that R can't read PMML and turn it back into an object that can be used for scoring. It can…
user1827975
  • 427
  • 3
  • 10
1
vote
1 answer

How much mxRealloc can affect a C-Mex matlab code?

For these days I was working on C-mex code in order to improve speed in DBSCAN matlab code. In fact, at the moment I finished a DBSCAN on C-mex. But instead, it takes more time (14.64 seconds in matlab, 53.39 seconds in C-Mex) with my test data…
mrDataos
  • 21
  • 3
1
vote
3 answers

Writing a large number of queries to a text file

I have a list of about 200,000 entities, and I need to query a specific RESTful API for each of those entities, and end up with all the 200,000 entities saved in JSON format in txt files. The naive way of doing it is going through the list of the…
leonsas
  • 4,718
  • 6
  • 43
  • 70
1
vote
1 answer

Decision Tree - Sparse dataset

I have very sparse dataset with huge number of attributes (~12 K features and 700K records) I can not fit it in memory (attribute values are binomial i.e. True/False) , As it is sparse I keep the dataset in (ID , Feature) format, so for example I…
Arian
  • 7,397
  • 21
  • 89
  • 177
1
vote
1 answer

How to make weka treat empty strings as 0

I'm using weka for clustering binary data. Note that I use weka directly through the API or the source code. My data input is a huge .csv file for example attrib1, attrib2, atrib3 0,1,0 1,0,1 0,0,1 But in order to reduce the .csv size the data…
Flo
  • 1,367
  • 1
  • 13
  • 27
1
vote
1 answer

Changing feature value type in RapidMiner

I have a dataset with many attributes (2k) which a few of them (about 10) are not binary and the rest are binary (0,1) , I want to change the value types of these binary attributes from integer to binomial , as the name of features are not fixed I…
Arian
  • 7,397
  • 21
  • 89
  • 177
1
vote
1 answer

How can I use the rule-based learning algorithms for this example

I have data as follows in order to do a predictive learning as to what feature do people find attractive in a model when purchasing clothes online. So I have data as follows. COLORofCLOTHING MODELHAIR_COLOR MODEL_BUILD SELLER_CATEGORY Red …
ExceptionHandler
  • 213
  • 1
  • 8
  • 24
1
vote
1 answer

Can SAS Enterprise Miner - Cluster node - take coordinate matrix as input?

I am using SAS proc distance to create a distance matrix. I wanted to know if SAS EM cluster node can use this matrix to create perform K mean clustering?
1
vote
1 answer

Create Edge List From Ragged Data Frame in R (for network analysis)

I have a ragged data frame with each row as an occurrence in time of one or more entities, like so: (time1) entitya entityf entityz (time2) entityg entityh (time3) entityo entityp entityk entityL (time4) entityM I want to create an edge list for…
Olga Mu
  • 908
  • 2
  • 12
  • 23
1
vote
2 answers

Fast and scalable similarity detection

I have large postgresql database, containing documents. Every document represented as a row in the table. When new document added to the database I need to check for duplicates. But I can't just use select to find exact match. Two documents can vary…
Evgeny Lazin
  • 9,193
  • 6
  • 47
  • 83
1 2 3
99
100