0


I have to analyse completely unknown numerical data(I don't know what it concerns).
There are some samples below from the training data:

   'yout': array([[  0.00000000e+00,  -7.87464718e-08,  -7.31121013e-08, ...,
     -4.20583628e-07,  -3.62647412e-07,  -2.17680232e-07],
   [ -1.13230235e-13,  -9.38223846e-05,   8.30087034e-05, ...,
     -1.66600921e-07,  -2.18490921e-07,   3.85091720e-07],
   [  3.32348250e-06,  -1.93950410e-04,   1.54892852e-04, ...,
     -7.36868568e-08,  -1.41946370e-07,   2.15633282e-07],
   ..., 
   [  9.72858182e-04,   7.22416022e-05,  -1.68044656e-05, ...,
     -2.90709866e-06,   2.59359588e-06,   3.13502801e-07],
   [  9.71197632e-04,   7.19938095e-05,  -1.67844712e-05, ...,
     -2.91106565e-06,   2.58013028e-06,   3.30935374e-07],
   [  9.80158036e-04,   7.25326131e-05,  -1.69481316e-05, ...,
     -2.94693184e-06,   2.59483672e-06,   3.52095128e-07]]), 
   'uin': array([[ -9.01855411e-03,   0.00000000e+00,   0.00000000e+00, ...,
      0.00000000e+00,  -7.99360578e-14,   0.00000000e+00],
   [ -9.01855411e-03,   0.00000000e+00,   0.00000000e+00, ...,
      0.00000000e+00,  -6.21724894e-14,   0.00000000e+00],
   [ -9.01855411e-03,   0.00000000e+00,   0.00000000e+00, ...,
      0.00000000e+00,   1.41805257e-05,   0.00000000e+00],
   ..., 
   [ -2.50927606e-02,   0.00000000e+00,   0.00000000e+00, ...,
      0.00000000e+00,  -8.40115265e-01,   0.00000000e+00],
   [ -2.50927606e-02,   0.00000000e+00,   0.00000000e+00, ...,
      0.00000000e+00,  -8.40071885e-01,   0.00000000e+00],
   [ -2.50891131e-02,   0.00000000e+00,   0.00000000e+00, ...,
      0.00000000e+00,  -8.40028529e-01,   0.00000000e+00]]),        
   'time': array([[  0.00000000e+00],
   [  1.00000000e-02],
   [  2.00000000e-02],
   ..., 
   [  1.99980000e+02],
   [  1.99990000e+02],
   [  2.00000000e+02]])

The shape of output, input and time array respectively:

   ((184112, 63), (184112, 21), (184112, 1))

What have I done with input data so far?
- tidying - removing a few columns which retains only zeros
- applying some statistic: mean,min,max,percentiles and correlation matrix
- visualising: histogram of each numerical attribute, pairplot using seaborn
- clustering: K-Means and elbow method; after looking for the best number of clusters it turned out that there are 3 clusters

The problem is that I don't know to verify my suspicion that there are 3 clusters, no idea how to make use of output data (which contains 3 times more features) and moreover what to do with timestamps.

Can anyone advise me how I should carry on my analysis, please?

(I do ask for your understanding, because I am totally beginner in Data Analysis , even more so ML and AI. )

  • Welcome to SO!! Please provide more context!! what techniques you applied so that 21 columns became 63. Provide sample data and so on. How can folks help you here with nothing to work with!! – Rahul Agarwal Oct 10 '18 at 14:12
  • @RahulAgarwal Thank You for your greeting. I've just upgraded my post and added the essential information. – Ciastko Czekoladowe Oct 10 '18 at 15:21
  • When you say `I have to analyse completely unknown numerical data`, what exactly do you have to do with it? Do you have to build a model that, given an input not present in the training set, predicts its associated output? If this is the case you may not be able to achieve this task with clustering, and should try [multivariate linear regression](https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.lstsq.html#numpy.linalg.lstsq) instead. Also, [dimensionality reduction](http://scikit-learn.org/stable/modules/unsupervised_reduction.html) may help you better understand your data. – Daneel R. Oct 10 '18 at 19:05
  • @DanielR. I was given the data and nothing more information (ex. where it comes from). My task is to gather as many information about it as I can (correlation, anomaly etc. ). It's the first time I've faced such a problem, thus I feel that I need some help. – Ciastko Czekoladowe Oct 10 '18 at 19:18
  • Try to implement one of the algorithms described in the comment above, and if you fail, post your code and we can help you with that. This is a forum about programming, and without seeing the actual data and the actual lines of codes, you can only receive very general suggestions. Also: `The problem is that I don't know to verify my suspicion that there are 3 clusters` For this to be true, [cosine distance](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_distances.html) of the three clusters should be very high. – Daneel R. Oct 12 '18 at 08:45

0 Answers0