Questions tagged [data-mining]

Data mining is the process of analyzing large amounts of data in order to find patterns and commonalities.

Data mining, also known as knowledge discovery, is the process of digging through and analyzing enormous sets of data and then extracting the meaning of the data. Data mining tools like SQL Server Analysis Services, predict behaviors and future trends, allowing businesses to make proactive, knowledge-driven decisions. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. Input to learning mining algorithms is called cases, samples, examples, instances, events, and observations.

3094 questions
19
votes
3 answers

Twitter: How to extract tweets containing symbols (!,%,$)?

For a project, I want to be able to create a dataset of tweets containing some particular string of symbols. Since I would also like to go as far back in time as possible, I tried using the GetOldTweets script (…
Melsauce
  • 2,535
  • 2
  • 19
  • 39
19
votes
5 answers

How exactly does sharkscope or PTR data mine all those hands?

I'm very curious to know how this process works. These sites (http://www.sharkscope.com and http://www.pokertableratings.com) data mine thousands of hands per day from secure poker networks, such as PokerStars and Full Tilt. Do they have a farm of…
Fred Fickleberry III
  • 2,439
  • 4
  • 34
  • 50
19
votes
2 answers

Matlab - PCA analysis and reconstruction of multi dimensional data

I have a large dataset of multidimensional data(132 dimensions). I am a beginner at performing data mining and I want to apply Principal Components Analysis by using Matlab. However, I have seen that there are a lot of functions explained on the web…
Simon
  • 4,999
  • 21
  • 69
  • 97
18
votes
2 answers

What are some good ways of estimating 'approximate' semantic similarity between sentences?

I have been looking at the nlp tag on SO for the past couple of hours and am confident I did not miss anything but if I did, please do point me to the question. In the mean time though, I will describe what I am trying to do. A common notion that I…
Legend
  • 113,822
  • 119
  • 272
  • 400
18
votes
3 answers

GBM R function: get variable importance separately for each class

I am using the gbm function in R (gbm package) to fit stochastic gradient boosting models for multiclass classification. I am simply trying to obtain the importance of each predictor separately for each class, like in this picture from the Hastie…
Antoine
  • 1,649
  • 4
  • 23
  • 50
18
votes
3 answers

Download link for Ta Feng Grocery dataset

I am desperately trying to download the Ta-Feng grocery dataset for few days but appears that all links are broken. I needed for data mining / machine learning research for my msc thesis. I also have the Microsoft grocery database, the Belgian store…
Dragan
  • 500
  • 3
  • 11
17
votes
5 answers

How would you group/cluster these three areas in arrays in python?

So you have an array 1 2 3 60 70 80 100 220 230 250 For a better understanding: How would you group/cluster the three areas in arrays in python(v2.6), so you get three arrays in this case containing [1 2 3] [60 70 80 100] [220 230…
Zurechtweiser
  • 1,165
  • 2
  • 16
  • 29
17
votes
5 answers

Retrieving population density data

I need to figure out whether not a given location is considered urban or rural. I take it that the best way to do this is by looking at the population density of the city/state or province/country combination. The kicker is that we're using this for…
17
votes
6 answers

Ways to calculate similarity

I am doing a community website that requires me to calculate the similarity between any two users. Each user is described with the following attributes: age, skin type (oily, dry), hair type (long, short, medium), lifestyle (active outdoor lover, TV…
17
votes
2 answers

Cosine similarity when one of vectors is all zeros

How to express the cosine similarity ( http://en.wikipedia.org/wiki/Cosine_similarity ) when one of the vectors is all zeros? v1 = [1, 1, 1, 1, 1] v2 = [0, 0, 0, 0, 0] When we calculate according to the classic formula we get division by zero: Let…
17
votes
5 answers

Can stop-words be found automatically?

In NLP, stop-words removal is a typical pre-processing step. And it is typically done in an empirical way based on what we think stop-words should be. But in my opinion, we should generalize the concept of stop-words. And the stop-words could vary…
smwikipedia
  • 61,609
  • 92
  • 309
  • 482
17
votes
2 answers

dbscan - setting limit on maximum cluster span

By my understanding of DBSCAN, it's possible for you to specify an epsilon of, say, 100 meters and — because DBSCAN takes into account density-reachability and not direct density-reachability when finding clusters — end up with a cluster in which…
user139014
  • 1,445
  • 2
  • 19
  • 33
17
votes
3 answers

Writing rules generated by Apriori

I'm working with some large transactions data. I've been using read.transactions and apriori (parts of the arules package) to mine for frequent item pairings. My problem is this: when rules are generated (using "inspect()") I can easily view them in…
user2432675
  • 715
  • 1
  • 6
  • 14
17
votes
3 answers

Python tools for out-of-core computation/data mining

I am interested in python mining data sets too big to sit in RAM but sitting within a single HD. I understand that I can export the data as hdf5 files, using pytables. Also the numexpr allows for some basic out-of-core computation. What would come…
user17375
  • 529
  • 4
  • 14
16
votes
2 answers

How to read binary files in Python using NumPy?

I know how to read binary files in Python using NumPy's np.fromfile() function. The issue I'm faced with is that when I do so, the array has exceedingly large numbers of the order of 10^100 or so, with random nan and inf values. I need to apply…
Suyash Shetty
  • 513
  • 3
  • 8
  • 17