Questions tagged [data-mining]

Data mining is the process of analyzing large amounts of data in order to find patterns and commonalities.

Data mining, also known as knowledge discovery, is the process of digging through and analyzing enormous sets of data and then extracting the meaning of the data. Data mining tools like SQL Server Analysis Services, predict behaviors and future trends, allowing businesses to make proactive, knowledge-driven decisions. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. Input to learning mining algorithms is called cases, samples, examples, instances, events, and observations.

3094 questions
23
votes
4 answers

Guided mining of common substructures in large set of graphs

I have a large (>1000) set of directed acyclic graphs with a large (>1000) set of vertices each. The vertices are labeled, the label's cardinality is small (< 30) I want to identify (mine) substructures that appear frequently over the whole set of…
user2722968
  • 13,636
  • 2
  • 46
  • 67
23
votes
3 answers

How to find out if a sentence is a question (interrogative)?

Is there an open source Java library/algorithm for finding if a particular piece of text is a question or not? I am working on a question answering system that needs to analyze if the text input by user is a question. I think the problem can…
nabeelmukhtar
  • 1,371
  • 15
  • 24
23
votes
6 answers

Text mining with PHP

I'm doing a project for a college class I'm taking. I'm using PHP to build a simple web app that classify tweets as "positive" (or happy) and "negative" (or sad) based on a set of dictionaries. The algorithm I'm thinking of right now is Naive Bayes…
garyc40
  • 343
  • 1
  • 3
  • 8
23
votes
1 answer

How to find common phrases in a large body of text

I'm working on a project at the moment where I need to pick out the most common phrases in a huge body of text. For example say we have three sentences like the following: The dog jumped over the woman. The dog jumped into the car. The dog jumped…
benmcredmond
  • 1,702
  • 2
  • 15
  • 22
22
votes
2 answers

pandas pivot table rename columns

How to rename columns with multiple levels after pandas pivot operation? Here's some code to generate test data: import pandas as pd df = pd.DataFrame({ 'c0': ['A','A','B','C'], 'c01': ['A','A1','B','C'], 'c02': ['b','b','d','c'], …
muon
  • 12,821
  • 11
  • 69
  • 88
22
votes
2 answers

scikit-learn: clustering text documents using DBSCAN

I'm tryin to use scikit-learn to cluster text documents. On the whole, I find my way around, but I have my problems with specific issues. Most of the examples I found illustrate clustering using scikit-learn with k-means as clustering algorithm.…
22
votes
7 answers

Can k-means clustering do classification?

I want to know whether the k-means clustering algorithm can do classification? If I have done a simple k-means clustering . Assume I have many data , I use k-means clusterings, then get 2 clusters A, B. and the centroid calculating method is…
Sirius Wang
  • 339
  • 1
  • 5
  • 15
22
votes
6 answers

Is it ok to define your own cost function for logistic regression?

In least-squares models, the cost function is defined as the square of the difference between the predicted value and the actual value as a function of the input. When we do logistic regression, we change the cost function to be a logarithmic…
London guy
  • 27,522
  • 44
  • 121
  • 179
21
votes
5 answers

How can I perform K-means clustering on time series data?

How can I do K-means clustering of time series data? I understand how this works when the input data is a set of points, but I don't know how to cluster a time series with 1XM, where M is the data length. In particular, I'm not sure how to update…
Jaz
  • 581
  • 2
  • 6
  • 10
21
votes
3 answers

Better text documents clustering than tf/idf and cosine similarity?

I'm trying to cluster the Twitter stream. I want to put each tweet to a cluster that talk about the same topic. I tried to cluster the stream using an online clustering algorithm with tf/idf and cosine similarity but I found that the results are…
Jack Twain
  • 6,273
  • 15
  • 67
  • 107
20
votes
2 answers

Can an author's unique "literary style" be used to identify him/her as the author of a text?

Let's imagine, I have two English language texts written by the same person. Is it possible to apply some Markov chain algorithm to analyse each: create some kind of fingerprint based on statistical data, and compare fingerprints gotten from…
user313885
20
votes
1 answer

Clustering cosine similarity matrix

A few questions on stackoverflow mention this problem, but I haven't found a concrete solution. I have a square matrix which consists of cosine similarities (values between 0 and 1), for example: | A | B | C | D A | 1.0 | 0.1 | 0.6 | 0.4 B…
Stefan D
  • 1,229
  • 2
  • 15
  • 29
20
votes
5 answers

Machine learning challenge: diagnosing program in java/groovy (datamining, machine learning)

I'm planning to develop program in Java which will provide diagnosis. The data set is divided into two parts one for training and the other for testing. My program should learn to classify from the training data (BTW which contain answer for 30…
19
votes
4 answers

Best clustering algorithm? (simply explained)

Imagine the following problem: You have a database containing about 20,000 texts in a table called "articles" You want to connect the related ones using a clustering algorithm in order to display related articles together The algorithm should do…
caw
  • 30,999
  • 61
  • 181
  • 291
19
votes
5 answers

What kind of artificial intelligence jobs are out there?

Throughout my academic years in computer science I fell in love with many aspects of artificial intelligence. From expert systems, neural networks, to data mining (classification). I wonder, if I was to transform this academic passion…
wsb3383
  • 3,841
  • 12
  • 44
  • 59