0

I am new to python and data science, and I am currently working on a project that is based on a very large dataframe, with 75 columns. I am doing some data exploration and I would like to check for possible correlations between the columns. For smaller dataframes I know I could use pandas plotting.scatter_matrix() on the dataframe in order to do so. However, in my case this produces a 75x75 matrix -- and I can't even visualize the individual plots.

An alternative would be creating lists of 5 columns and using scatter_matrix multiple times, but this method would produce too many scatter matrices. For instance, with 15 columns this would be:


import pandas as pd

df = pd.read_csv('dataset.csv')

list1 = [df.iloc[:, i] for i in range(5)]
list2 = [df.iloc[:, i+5] for i in range(5)]
list3 = [df.iloc[:, i+10] for i in range(5)]

pd.plotting.scatter_matrix(df_acoes[list1])
pd.plotting.scatter_matrix(df_acoes[list2])
pd.plotting.scatter_matrix(df_acoes[list3])

In order to use this same method with 75 columns, I'd have to go on until list15. This looks very inefficient. I wonder if there would be a better way to explore correlations in my dataset.

JBVasc
  • 63
  • 8
  • Please try something first and then post your question with codes. – Sachith Muhandiram Aug 09 '20 at 16:06
  • Do you need plots? Or are you looking for a correlation matrix? ...two way correlations are often not significant, try reading about feature selection in the user guide for the library you are using... here's that section in [scikit learn](https://scikit-learn.org/stable/modules/feature_selection.html) – RichieV Aug 09 '20 at 17:02

1 Answers1

0

The problem here is to a lesser extend the technical part. The production of the plots (in number 5625) will take quite a long time. Additionally, the plots will take a bit of memory.

So I would ask a few questions to get around the problems:

  • Is it really necessary to have all these scatter plots?
  • Can I reduce the dimensional in advance?
  • Why do I have such a high number of dimensions?

If the plots are really useful, You could produce them by your own and stick them together, or wait until the function is ready.

thomas
  • 381
  • 2
  • 7
  • I think you are right, there are simply too many features to approach in this way. Dimensional reduction seems to be the way to go! – JBVasc Aug 10 '20 at 21:38