1

I am trying to use dask.bag to hold objects of a given class, where each instance captures various properties of a document (title, wordcount, etc.).

This object has some associated methods that set different attributes of the object.

For example:

import dask.bag as db

class Item:    
    def __init__(self, value):
        self.value = 'My value is: "{}"'.format(value)        
    def modify(self):
        self.value = 'My value used to be: "{}"'.format(self.value)

def generateItems():
    i = 1
    while i <= 100:
        yield(Item(i))
        i += 1

b = db.from_sequence(generateItems())
# looks like:
b.take(1)[0].value #'My value is: "1"'

How do I create a bag of each modify-d instance in the first bag (b)?

Desired output: 'My value used to be: "My value is: "1""' etc.

I tried:

c = b.map(lambda x: x.modify() )

c.take(1)[0].value 
#AttributeError: 'NoneType' object has no attribute 'value'

# Also tried:
d = b.map(lambda x: x[0].modify() )    
b.take(1) # TypeError: 'Item' object does not support indexing
C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134

1 Answers1

2

The problem here, is that c gets the results of running your lambda function, and Item.modify() has no output. Typically in Dask, you are expecting to return new objects based on the input, not mutate existing ones How does dask.delayed handle mutable inputs? - consider what would happen if multiple tasks operated on the same object in multiple threads or in multiple processes.

In this most simple case, you could get what you desire by adding return self to the end of mutate(), or changing the lambda expression to x.mutate() and x; but DON'T program this way, create a new object with the desired new value instead.

mdurant
  • 27,272
  • 5
  • 45
  • 74
  • Useful link, thanks. So no mutations in Dask. FWIW I couldn't get the "changing the lambda expression" approach to work. – C8H10N4O2 Mar 28 '18 at 14:37