-1

I'm new in machine learning and recently got job to do R&D related to Big Data.

The main idea is to get the insight from random collection of big data (I don't know yet what will be the data) and turn it into information and then from information turn it into knowledge. Common things.

I realized that in the end mostly Big Data analysis are using Machine Learning to do some of it jobs automatically. Therefore, my focus for now is changed to Machine Learning first.

The first things I know is, to get insight from a data that we don't know, it is most likely suitable for Unsupervised Learning. So, I tried Clustering first using K-means.

In here, I started to have questions:

  1. In K-means, we need to decided the K. Which is weird for me, why we need to decided the cluster quantity result, when I expect it will be able to make it's own border and decide how many cluster it found ?

  2. Even if the cluster is decided, how do I know what is the insight that I got ? While I don't even know how the cluster had been decided. So in the end we still need manual analysis for this kind of things ?

  3. I wonder, is there a way to get insight from random data without additional manual analysis, or is it supposed to be like that ?

Lyn
  • 507
  • 3
  • 15

1 Answers1

0

There is some manual analysis involved in any kind of problem statement. From what you wrote, there is no clarity on what exactly is the problem statement. When you are not even sure of what the data is going to be, first you have to look at all the features of data, some basic statistics of data, null values, any duplicates, proper data types, etc, and then first clean the data and after that only you can apply any ML techniques to get some insights.

Coming to K-Means, which is unsupervised learning, there are some techniques to decide on which "k" to choose. Explore "elbow method" to choose k. In your case K-means might help in some proper data segmentation for initial data analysis.

I cannot further comment on what to do as I do not know the data.

My3
  • 140
  • 1
  • 10
  • That is my point exactly. I knew about elbow methods and others method that can help choosing the K. I'm just wondering why in the end we have to choose the K again. It's like we have to do analysis with another analysis. And yes, the plan is I won't know what kind of the data, which will be used. The plan is to get some insight from that unknown data automatically in some limitation, without manual analysis. From your response. Looks like it is not possible ? – Lyn Oct 04 '18 at 10:04