subclassing ndarray drops information when broadcast in pyspark

Question

I'm hoping someone can help me debug an issue we're seeing with subclassed ndarrays in spark. Specifically when broadcast a subclassed array it seems to lose the extra information. A trivial example is below:

>>> import numpy as np
>>> 
>>> class Test(np.ndarray):
...     def __new__(cls, input_array, info=None):
...         obj = np.asarray(input_array).view(cls)
...         obj.info = info
...         return obj
...     
...     def __array_finalize__(self, obj):
...         if not hasattr(self, "info"):
...             self.info = getattr(obj, 'info', None)
...         else:
...             print("has info attribute: %s" % getattr(self, 'info'))
... 
>>> test = Test(np.array([[1,2,3],[4,5,6]]), info="info")
>>> print(test.info)
info
>>> print(sc.broadcast(test).value)
[[1 2 3]
 [4 5 6]]
>>> print(sc.broadcast(test).value.info)
None

This thread solved it: http://stackoverflow.com/questions/26598109/preserve-custom-attributes-when-pickling-subclass-of-numpy-array — David, Apr 04 '17 at 03:11

score 0 · Answer 1 · answered Apr 03 '17 at 21:55

0

At a minimum, you have a small typo - you're checking for hasattr(obj, "info") when instead you should be checking if hasattr(self, "info"). Because of the if statement flip, info isn't being carried over.

test = Test(np.array([[1,2,3],[4,5,6]]), info="info")
print test.info # info
test2 = test[1:]
print test2.info # info

answered Apr 03 '17 at 21:55

Jeff Tratner

16,270
4
47
67

good catch! doesn't fix the overarching issue though :( – David Apr 03 '17 at 21:59

subclassing ndarray drops information when broadcast in pyspark

1 Answers1