4

I was trying to build a pipeline with luigi. First by getting data from an API, transform and then save it to a mongo db. I'm still new to luigi, my question is how do I implement the output() function which specifies outputs to a mongo db. And how would I create the require() function for subsequent tasks?

The first one, I was trying to attempt the demo here, but it's using MySql instead of mongodb. So I tried

from luigi.contrib.mongodb import MongoTarget
from pymongo import MongoClient

def output(self):
    # connect to db
    connection = MongoClient(self.host, self.port)
    db_client = connection[self.db_name]
    collection_name = 'myCollection'

    return MongoTarget(db_client, '_id', collection_name)

but it gave me error like this:

TypeError: Can't instantiate abstract class MongoTarget with abstract methods exists

A quick search of the error seems like due to pyMongo, but that solution still doesn't fix it.

For the require part, I'm not sure how to approach it either, I would like to check on if the records existed alraedy so I don't duplicate them. But there is no unique index from my API data, so I guess I have to somehow scan over all the records to make sure there are no duplicates.

There isn't a lot of documentation or examples on using mongo with luigi, any help is appreciated.

Sam
  • 475
  • 1
  • 7
  • 19

2 Answers2

2

I have never used the mongodb package myself but it appears your first problem is due to a misuse of the MongoTarget interface. If you look at the code in here:

https://github.com/spotify/luigi/blob/master/luigi/contrib/mongodb.py#L25

you see that you would need to pass an instance of the MongoClient (connection in your case instead of db_client). You are right though, there is hardly any documentation for non-core Luigi packages. I developed a habit of reading the codebase to understand how to use a given package when working with Luigi (which now turned into a habit with any library I use)

Having said that, I don't think the MongoTarget is an actual target; It doesn't implement the exists method. You should instead use the other Targets provided in the module, namely MongoCellTarget, MongoRangeTarget, MongoCollectionTarget etc Read their docstring for further info on what they do!

Ouanixi
  • 116
  • 1
  • 10
  • You are right, I also found about that from the [issue](https://github.com/spotify/luigi/pull/2062) section on lugi's repo, haven't got chance to run it yet. Thanks for sharing. I see, then I will look into other targets method. – Sam Jan 14 '18 at 15:46
  • I tried with just putting in the mongoclient directly in the Target class, it still doesn't work. Same error. – Sam Jan 15 '18 at 15:24
2

The Luigi code base has a test for the luigi.contrib.mongodb package. The test uses MongoCellTarget and MongoRangeTarget. That test worked for me and was good enough to build upon.

The following snippet is derived from that test, and assumes an empty MongoDB instance running on localhost.

import pymongo
from luigi.contrib.mongodb import MongoRangeTarget

HOST = 'localhost'
PORT = 27017
INDEX = 'luigi_test'
COLLECTION = 'luigi_collection'

mongo_client = pymongo.MongoClient(HOST, PORT)
collection = mongo_client[INDEX][COLLECTION]

# Add sample data
test_docs = [
    {'_id': 'person_1', 'age': 11, 'experience': 10, 'content': "Lorem ipsum, dolor sit amet. Consectetur adipiscing elit."},
    {'_id': 'person_2', 'age': 12, 'experience': 22, 'content': "Sed purus nisl. Faucibus in, erat eu. Rhoncus mattis velit."},
    {'_id': 'person_3', 'age': 13, 'content': "Nulla malesuada, fringilla lorem at pellentesque."},
    {'_id': 'person_4', 'age': 14, 'content': "Curabitur condimentum. Venenatis fringilla."}
]

collection.insert_many(test_docs)

## Test reading from MongoDB via MongoRangeTarget    
test_values = [
    ('age', [], {}),
    ('age', ['unknown_person'], {}),
    ('age', ['person_1', 'person_3'], {'person_1': 11, 'person_3': 13}),
    ('age', ['person_1', 'person_3', 'person_5'], {'person_1': 11, 'person_3': 13}),
    ('experience', ['person_1', 'person_3'], {'person_1': 10}),
    ('experience', ['person_1', 'person_3', 'person_5'], {'person_1': 10}),
]

for field, ids, result in test_values:
    target = MongoRangeTarget(mongo_client, INDEX, COLLECTION, ids, field)
    assertEqual(result, target.read())
yoda_droid
  • 361
  • 1
  • 7