Questions tagged [data-mining]

Data mining is the process of analyzing large amounts of data in order to find patterns and commonalities.

Data mining, also known as knowledge discovery, is the process of digging through and analyzing enormous sets of data and then extracting the meaning of the data. Data mining tools like SQL Server Analysis Services, predict behaviors and future trends, allowing businesses to make proactive, knowledge-driven decisions. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. Input to learning mining algorithms is called cases, samples, examples, instances, events, and observations.

3094 questions
11
votes
4 answers

What is Java Data Mining, JDM?

I am looking at JDM. Is this simply an API to interact with other tools that do the actual data mining? Or is this a set of packages that contain the actual data mining algorithms?
Anthony D
  • 10,877
  • 11
  • 46
  • 67
11
votes
1 answer

Effects of Stemming on the term frequency?

How are the term frequencies (TF), and inverse document frequency (IDF), affected by stop-word removal and stemming? Thanks!
Ataman
  • 2,530
  • 3
  • 22
  • 34
10
votes
8 answers

How to store many years worth of 100 x 25 Hz time-series - Sql Server or timeseries database

I am trying to identify possible methods for storing 100 channels of 25 Hz floating point data. This will result in 78,840,000,000 data-points per year. Ideally all this data would be efficiently available for Web-sites and tools such as Sql Server…
Duncan
  • 101
  • 1
  • 3
10
votes
2 answers

CSV Autodetection in Java

What would be a reliable way of autodetecting that a file is actually CSV, if CSV was redefined to mean "Character-Separated Values", i.e. data using any single character (but typically any non-alphanumeric symbol) as the delimiter and not only…
PNS
  • 19,295
  • 32
  • 96
  • 143
10
votes
4 answers

DBMS_DATA_MINING.CREATE_MODEL causes "ORA-40103: invalid case-id column: TID" on 11.2.0.1.0 64b, but on 10g OK

I have a problem with DBMS_DATA_MINING.CREATE_MODEL on version 11.2. On 10g this code below works OK, and I'm quite sure that on 11.1 it works too. CREATE OR REPLACE VIEW "SH"."ITEMS" AS SELECT PROD_ID AS item FROM SALES GROUP BY PROD_ID; CREATE OR…
zacheusz
  • 8,750
  • 3
  • 36
  • 60
10
votes
4 answers

Predicting Values with k-Means Clustering Algorithm

I'm messing around with machine learning, and I've written a K Means algorithm implementation in Python. It takes a two dimensional data and organises them into clusters. Each data point also has a class value of either a 0 or a 1. What confuses me…
DizzyDoo
  • 1,489
  • 6
  • 21
  • 32
10
votes
1 answer

Java Support for PMML

I am new in PMML: Predictive Model Markup Language (www.dmg.org) and I was wondering if there is some kind of Java support (Open Source / professional) for creating/parsing PMML files. Initially I only have in mind the possibility of…
Oscar
  • 101
  • 1
  • 4
10
votes
2 answers

Mixed variables (categorical and numerical) distance function

I want to fuzzy cluster a set of jobs. Jobs Attributes are: Categorical: position,diploma, skills Numerical : salary , years of experience My question is: how to calculate the distance between different jobs? e.g…
Mariya
  • 847
  • 1
  • 9
  • 25
10
votes
1 answer

In TeamCity, is there a way of seeing a report of tests ordered by failed-most-often across the whole history?

We have some unreliable tests - unreliable because of environmental reasons. We'd like to see a history of which tests have failed the most often, so we can drill into why and fix the environment issue that causes that particular failure or class of…
Peter Mounce
  • 4,105
  • 3
  • 34
  • 65
10
votes
4 answers

Decision Tree Learning and Impurity

There are three ways to measure impurity: What are the differences and appropriate use cases for each method?
Jony
  • 6,694
  • 20
  • 61
  • 71
10
votes
2 answers

Comparing/Clustering Trajectories (GPS data of (x,y) points) and Mining the data

I've got 2 questions on analyzing a GPS dataset. 1) Extracting trajectories I have a huge database of recorded GPS coordinates of the form (latitude, longitude, date-time). According to date-time values of consecutive records, I'm trying to extract…
Murat Derya Özen
  • 2,154
  • 8
  • 31
  • 44
10
votes
2 answers

Computing Jaccard Similarity in Python

I have 20,000 documents that I want to compute the true Jaccard similarity for, so that I can later check how accurately MinWise hashing approximates it. Each document is represented as a column in a numpy matrix, where each row is a word that…
Magic8ball
  • 145
  • 1
  • 2
  • 8
10
votes
3 answers

what is the bootstrapped data in data mining?

recently I came across this term,but really have no idea what it refers to.I've searched online,but with little gain. Thanks.
Kevin
  • 6,711
  • 16
  • 60
  • 107
10
votes
3 answers

In data mining what is a class label..? please give an example

i don't understand what it means. in database a tuple means a field value and a attribute means a table field? am i correct? and what is a Class label in Data Mining?
Akhil T Mohan
  • 193
  • 1
  • 1
  • 14
10
votes
3 answers

Probabilistic Generation of Semantic Networks

I've studied some simple semantic network implementations and basic techniques for parsing natural language. However, I haven't seen many projects that try and bridge the gap between the two. For example, consider the dialog: "the man has a hat" "he…
Cerin
  • 60,957
  • 96
  • 316
  • 522