0

Is there an equivalent corr() function for Python Datatable as exists for Python Pandas - to find the correlation matrix of the Frame columns? Thanks

Pasha
  • 6,298
  • 2
  • 22
  • 34
SJain
  • 11
  • 1
  • 1
    According to [the docs](https://datatable.readthedocs.io/en/latest/changelog/v0.10.0.html#general) a `corr()` function was added in `v0,10,0` in december last year "to compute the covariance and Pearson correlation coefficient between columns of a Frame". Is that what you're looking for? – G. Anderson Feb 10 '20 at 22:33
  • Thanks Anderson. That helps for my current use case. However, what I was looking for is the corr() equivalent from Pandas, where we can do pandas_df.corr() to get the correlation matrix for ALL columns in one go, instead of having to specify each pairwise column. Thanks. – SJain Feb 12 '20 at 17:03

1 Answers1

0

One option is to use the following function:

def frame_corr(dt_frame):
    numcols = [col for col in dt_frame if col.type.is_numeric]
    result = dt.rbind([dt_frame[:, [dt.corr(col1, col2) for col2 in numcols]] for col1 in numcols])
    result.names = dt_frame[:,numcols].names
    return result

Input Data

data = dt.Frame(x = np.random.normal(size=10),
         y = np.random.normal(size=10),
         z = np.random.normal(size=10)
        )

Output

frame_corr(data)
   |         x          y          z
   |   float64    float64    float64
-- + ---------  ---------  ---------
 0 |  1         -0.880012   0.26132 
 1 | -0.880012   1         -0.440515
 2 |  0.26132   -0.440515   1       
[3 rows x 3 columns]

data.to_pandas().corr()
          x         y         z
x  1.000000 -0.880012  0.261320
y -0.880012  1.000000 -0.440515
z  0.261320 -0.440515  1.000000

Note: is_numeric available in version 1.1.0

langtang
  • 22,248
  • 1
  • 12
  • 27