10

I have a rectangular (can't be assumed to be square) Pandas DataFrame of numbers. Say I pick a diagonal direction (either "upperleft to lowerright" or "upperright to lowerleft"). I'd like to compute a series whose entries are the sums of the values from the original DataFrame along the chosen set of parallel diagonals. To fully specify the goal, you need to decide whether diagonals are "anchored" on the left or "anchored" on the right. For the below, I assume they're "anchored" on the left.

I can do this without too much trouble:

import numpy as np
import pandas as pd

rectdf = pd.DataFrame(np.arange(15).reshape(5,3))

# result:
    0   1   2
0   0   1   2
1   3   4   5
2   6   7   8
3   9  10  11
4  12  13  14

I can compute the "upperleft to lowerright" diagonal sums as follows:

ullrsums = pd.concat([rectdf.iloc[:, i].shift(-i) for i in range(rectdf.shape[1])], axis=1)\
    .sum(axis=1, fillna=0)

# result:
0    12
1    21
2    30
3    22
4    12

And I can compute the "upperright to lowerleft" diagonal sums by flipping the shift(-i) to shift(i) in the previous:

urllsums = pd.concat([rectdf.iloc[:, i].shift(i) for i in range(rectdf.shape[1])], axis=1)\
    .sum(axis=1, fillna=0)

# result:
0     0
1     4
2    12
3    21
4    30

These results are all correct (i.e. this code does what I want). Is there a more direct way to compute these sums in Pandas or Numpy?

8one6
  • 13,078
  • 12
  • 62
  • 84
  • Related: http://stackoverflow.com/q/10792897 and http://stackoverflow.com/q/28917414 –  Jan 28 '16 at 23:37

4 Answers4

6

You may be looking for numpy.trace(), documented here, to get the trace directly, or numpy.diagonal() to get the diagonal vector, documented here

First, convert your dataframe to a numpy matrix using rectdf.as_matrix()

Then:

np.trace(matrix, offset)

The offset, which can be either positive or negative, does the shifting you require.

For example, if we do:

a = np.arange(15).reshape(5, 3)
for x in range(-4, 3): print np.trace(a, x)

We get output:

12
22
30
21
12
6
2

To do this for a general matrix, we want the range from -(rows - 1) to columns, i.e. if we have a variable rows and a variable columns:

a = np.arange(rows * columns).reshape(rows, columns)
for x in range(-(rows - 1), columns): print np.trace(a, x)
Alex Alifimoff
  • 1,850
  • 2
  • 17
  • 34
3

For a 2D numpy array A this might be (?) the shortest code to sum diagonals:

np.bincount(sum(np.indices(A.shape)).flat, A.flat)

To sum the opposite diagonals, you can np.fliplr the array.

0

Short answer

See the fast, but complicated function at the end.

development

Iteration over the trace is good, but I'm not sure it is better than the the pandas solution. Both involve iteration - over diagonals or columns. Conceptually it is simpler or cleaner, but I'm not sure about speed, especially on large arrays.

Each diagonal has a different length, [[12],[9,13],...]. That is a big red flag, warning us that a block array operation is difficult if not impossible.

With scipy.sparse I can construct a 2d array that can be summed to give these traces:

In [295]: from scipy import sparse
In [296]: xs=sparse.dia_matrix(x)
In [297]: xs.data
Out[297]: 
array([[12,  0,  0],
       [ 9, 13,  0],
       [ 6, 10, 14],
       [ 3,  7, 11],
       [ 0,  4,  8],
       [ 0,  1,  5],
       [ 0,  0,  2]])
In [298]: np.sum(xs.data,axis=1)
Out[298]: array([12, 22, 30, 21, 12,  6,  2])

This sparse format stores its data in a 2d array, with the necessary shifts. In fact your pd.concat produces something similar:

In [304]: pd.concat([rectdf.iloc[:, i].shift(-i) for i in range(rectdf.shape[1])], axis=1)
Out[304]: 
    0   1   2
0   0   4   8
1   3   7  11
2   6  10  14
3   9  13 NaN
4  12 NaN NaN

It looks like sparse creates this data array by starting with a np.zeros, and filling it with appropriate indexing:

 data[row_indices, col_indices] = x.ravel()

something like:

In [344]: i=[4,5,6,3,4,5,2,3,4,1,2,3,0,1,2]
In [345]: j=[0,1,2,0,1,2,0,1,2,0,1,2,0,1,2]
In [346]: z=np.zeros((7,3),int)
In [347]: z[i,j]=x.ravel()[:len(i)]
In [348]: z
Out[348]: 
array([[12,  0,  0],
       [ 9, 13,  0],
       [ 6, 10, 14],
       [ 3,  7, 11],
       [ 0,  4,  8],
       [ 0,  1,  5],
       [ 0,  0,  2]])

though I still need a way of creating i,j for any shape. For j it is easy:

j=np.tile(np.arange(3),5)
j=np.tile(np.arange(x.shape[1]),x.shape[0])

Reshaping i

In [363]: np.array(i).reshape(-1,3)
Out[363]: 
array([[4, 5, 6],
       [3, 4, 5],
       [2, 3, 4],
       [1, 2, 3],
       [0, 1, 2]])

leads me to recreating it with:

In [371]: ii=(np.arange(3)+np.arange(5)[::-1,None]).ravel()
In [372]: ii
Out[372]: array([4, 5, 6, 3, 4, 5, 2, 3, 4, 1, 2, 3, 0, 1, 2])

So together:

def all_traces(x):
    jj = np.tile(np.arange(x.shape[1]),x.shape[0])
    ii = (np.arange(x.shape[1])+np.arange(x.shape[0])[::-1,None]).ravel()
    z = np.zeros(((x.shape[0]+x.shape[1]-1),x.shape[1]),int)
    z[ii,jj] = x.ravel()
    return z.sum(axis=1)

It needs more testing over a variety of shapes.

This function is faster than the iteration over traces, even with this small size array:

In [387]: timeit all_traces(x)
10000 loops, best of 3: 70.5 µs per loop
In [388]: timeit [np.trace(x,i) for i in range(-(x.shape[0]-1),x.shape[1])]
10000 loops, best of 3: 106 µs per loop
hpaulj
  • 221,503
  • 14
  • 230
  • 353
0

Maybe faster

import numpy as np
from numpy.lib.stride_tricks import as_strided

def sum_all_diagonals(x):
    rows, cols = x.shape
    if cols > rows:
        x = x.T
        rows, cols = x.shape
    fill = np.zeros((cols - 1, cols), dtype=x.dtype)
    stacked = np.vstack((x, fill, np.fliplr(x), fill, x))
    major_stride, minor_stride = stacked.strides
    strides = major_stride, minor_stride * (cols + 1)
    shape = ((rows + cols - 1)*2, cols)
    return as_strided(stacked, shape, strides).sum(1)

credit https://stackoverflow.com/a/10808134/4191074

Ariel
  • 327
  • 2
  • 6