Questions tagged [data-mining]

Data mining is the process of analyzing large amounts of data in order to find patterns and commonalities.

Data mining, also known as knowledge discovery, is the process of digging through and analyzing enormous sets of data and then extracting the meaning of the data. Data mining tools like SQL Server Analysis Services, predict behaviors and future trends, allowing businesses to make proactive, knowledge-driven decisions. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. Input to learning mining algorithms is called cases, samples, examples, instances, events, and observations.

3094 questions
7
votes
5 answers

Simplest feature selection algorithm

I am trying to create my own and simple feature selection algorithm. The data set that I am going to work with is here (very famous data set). Can someone give me a pointer on how to do so? I am planning to write a feature rank algorithm for a text…
aherlambang
  • 14,290
  • 50
  • 150
  • 253
7
votes
1 answer

Is DLIB a good open source library for developing my own machine learning algorithms in C++?

Is DLIB a good open source library for developing my own machine learning algorithms in C++? How about other ones, such as libSVM, SHOGUN?
user297850
  • 7,705
  • 17
  • 54
  • 76
7
votes
3 answers

How to get topic associated with each document using pyspark(2.1.0) LdA?

I am using LDAModel of pyspark to get topics from corpus. My goal is to find topics associated with each document. For that purpose I tried to set topicDistributionCol as per Docs. Since I am new to this, I am not sure what is the purpose of this…
7
votes
1 answer

How to draw a small graph with community structure in networkx

The graph has around 100 nodes, and the number of communities ranges from 5 to 20. Is there any way to draw the graph such that the nodes of the same community are close to each other? I've tried to assign different communities different colors,…
user3813057
  • 891
  • 3
  • 13
  • 31
7
votes
4 answers

how to choose initial centroids for k-means clustering

I am working on implementing k-means clustering in Python. What is the good way to choose initial centroids for a data set? For instance: I have following data set: A,1,1 B,2,1 C,4,4 D,4,5 I need to create two different clusters. How do i start…
Clint Whaley
  • 459
  • 2
  • 7
  • 18
7
votes
2 answers

DBSCAN for clustering data by location and density

I'm using the method dbscan::dbscan in order to cluster my data by location and density. My data looks like this: str(data) 'data.frame': 4872 obs. of 3 variables: $ price : num ... $ lat : num ... $ lng : num ... Now I'm using…
Paul
  • 1,325
  • 2
  • 19
  • 41
7
votes
2 answers

Anything better than ruby alchemy for extracting keywords?

I've currently written an algorithm in Ruby based on the arc90 readability code to extract an article from a web page. Now that I have the article, I want to extract keywords and specific information from it (names, author, etc) I heard Alchemy was…
dpigera
  • 3,339
  • 5
  • 39
  • 60
7
votes
10 answers

Hadoop beginners

I'm trying to practice some data mining algorithms using hadoop. Can I do this with HDFS alone, or do I need to use the sub-projects like hive/hbase/pig?
realnumber
  • 2,124
  • 5
  • 25
  • 33
7
votes
3 answers

Is there a stop word list for twitter?

I want to do some mining on tweets. Is there any more specific stop word list for tweets such as removing "lol" and other twitter smiley?
陈家泽
  • 115
  • 1
  • 4
7
votes
1 answer

Implementing Naïve Bayes algorithm in Java - Need some guidance

As a School assignment i'm required to implement Naïve Bayes algorithm which i am intending to do in Java. In trying to understand how its done, i've read the book "Data Mining - Practical Machine Learning Tools and Techniques" which has a section…
ke3pup
  • 1,835
  • 4
  • 36
  • 66
7
votes
1 answer

R: unclear behaviour of tuneRF function (randomForest package)

I feel uncomfortable with the meaning of the stepFactor parameter of the tuneRF function which is used for tuning the mtry parameter used further in the randomForest function. The documentation of tuneRF says that stepFactor is a magnitude by which…
7
votes
2 answers

Speed-efficient classification in Matlab

I have an image of size as RGB uint8(576,720,3) where I want to classify each pixel to a set of colors. I have transformed using rgb2lab from RGB to LAB space, and then removed the L layer so it is now a double(576,720,2) consisting of AB. Now, I…
7
votes
4 answers

Sentiment Analysis java Library

I have some unlabeled microblogging posts and I want to create a sentiment analysis module. To do this I have try Stanford library and Alchemy Api web service but the result it is not very good. For now I don't want training my classifier. So I…
7
votes
5 answers

Which data mining algorithm would you suggest for this particular scenario?

This is not a directly programming related question, but it's about selecting the right data mining algorithm. I want to infer the age of people from their first names, from the region they live, and if they have an internet product or not. The…
ercan
  • 1,639
  • 1
  • 20
  • 34
7
votes
7 answers

How whether a string is randomly generated or plausibly an English word?

I have a corpus of text which contains some strings. In these strings, some are English words, some are random such as VmsVKmGMY6eQE4eMI, there are no limit on the number of characters in each string. Is there any way to test whether or not one…
ikel
  • 1,790
  • 6
  • 31
  • 61