-2

I have very limited knowledge of machine learning. I'm looking for a certain clustering algorithm that can help me to group data points together by some historical data of those points. Think of this example: There are n weather stations (for example 200), I have hourly temperature data for 5 years for those n weather stations. So the data looks like

timestamp, station_1, station_2, ...
1900-01-01 00:00:00, 80, 60, 81, ...
1900-01-01 01:00:00, 82, 59, 83

I'm looking for a clustering algorithm that group weather stations together so in a cluster the station temperatures are very close. For example, 80 and 81 are close, while 80 and 60 are not.

Plus, if the algorithm can also tell/calculate how 'close' is the data point to the cluster center, that will be great...

pythonician_plus_plus
  • 1,244
  • 3
  • 15
  • 38
  • you're missing out one important point: what are "close" temperatures? You'll have to define this first, before even starting the clustering. Simplest solution would probably be to simply calculate the average-temperature first. –  Aug 20 '15 at 23:24
  • @Paul For example, 80 and 81 are close, while 80 and 60 are not. – pythonician_plus_plus Aug 20 '15 at 23:26
  • this applies for **single values**, yes. But you want to cluster by weatherstations (or atleast that's how i understood the question), not by single temperatures. So the searched relationship isn't between temperatures, but between sets of temperatures –  Aug 20 '15 at 23:30
  • @Paul Right, that's the tricky part. I know there is a k-means, might be applied to cluster the temperatures, but how to do with weather stations? – pythonician_plus_plus Aug 20 '15 at 23:35
  • the relationship between the weather-station/sets of temperatures is rather a statistical problem than one concerning coding. There are tons of approaches for this. Simplest ones would be some kind of means, but you could aswell include some sort of standard deviation into the comparison, the list would be endless. Pretty difficult to recommend anything without knowing the precise requirements and statistics in general isn't exactly my speciality. –  Aug 20 '15 at 23:39
  • mabe try Principle Component Analyis, aggregate over 24-hour periods, and/or 1-year periods, to find which stations vary similarly over time and encode similar responses to environmental changes. – knb Aug 22 '15 at 12:26

1 Answers1

1

There is no free lunch

Don't expect to find an algorithm that exactly does what you need.

Customize algorithms as adequate for your problem. That is the very story of the Data Science buzz, the need to experiment and customize instead of hoping for a turnkey solution.

You have avery specific idea of what you need. You will have to put this idea into code and plug it into some algorithm. For example, consider complete linkage clustering with maximum norm. It probably is what you explained above, but I don't think it will be useful.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194