-1

I am trying to use Luigi to build a small scraping pipeline and I'm using Pillow to save the images from the pages I scrape. However, I'm struggling with the output when I try to save each image in loop (e.g. I want to save img_1, img_2, img_3, etc. in the output folder). I tried to pass an "image_id" parameter within the output function but it doesn't work and I can't figure out how to accomplish this.

class DownloadImages(luigi.Task):

    def requires(self):
        pass # taking out dependencies for this example

    def output(self, image_id):
        return luigi.LocalTarget(f"img/img_{image_id}.jpeg")

    def run(self):
        resp = requests.get("https://my-site.com")
        soup = BeautifulSoup(resp.content, "html.parser")
        images_list = soup.select("img")
        for image_id in range(len(images_list)):
            image_url = images_list[image_id]["src"]
            img = Image.open(requests.get(image_url, stream=True).raw)
            img.save(self.output(image_id).path)
Andrea
  • 59
  • 5
  • What do you mean by "it doesn't work"? What happens? – jarmod Aug 21 '21 at 17:47
  • I get this: TypeError: output() missing 1 required positional argument: 'image_id' – Andrea Aug 21 '21 at 17:49
  • check if you are running what you're seeing on the screen or if you still have an old file in your debugger – Lukas Schmid Aug 21 '21 at 18:22
  • The error message doesn't seem to match the code you've shown us. Please correct one of them. – jarmod Aug 21 '21 at 18:33
  • I run the workflow from the local scheduler as such: "python -m luigi --module scraper DownloadImages". The code works fine as a standard Python class as per answer from @LukasSchmid but in Luigi the task fails with the above error. – Andrea Aug 21 '21 at 18:41
  • Check out the appropriate section of the luigi docs here: https://luigi.readthedocs.io/en/stable/tasks.html#task-output or split up your task into three tasks as explained here: https://luigi.readthedocs.io/en/stable/luigi_patterns.html?highlight=wrappertask#triggering-many-tasks . I would go for the latter option because it looks like your three subtasks are not coupled. – Joooeey Sep 09 '21 at 08:23

1 Answers1

1

New answer since it's completely different: python -m luigi --module scraper DownloadImages --local-scheduler

from PIL import Image
import requests
import luigi

class DownloadImages(luigi.Task):
    save_path = f"img/*.jpg"

    def output(self):
        return luigi.LocalTarget(self.save_path)

    def run(self):
        img_ids = [1,2,3]
        self.imgs = []
        for img_id in img_ids:
            img = Image.open(requests.get("https://i.kym-cdn.com/entries/icons/original/000/000/007/bd6.jpg", stream=True).raw)
            img.save(self.save_path.replace("*", f"img_{img_id}"))

The Point of luigi is to have a workflow. It uses output to pass the location of your data between tasks. You cannot have additional arguments since you are not supposed to call that function (except to get the location where you want to save your output).

Disclaimer: I might have used it wrong too. Please go look at the Documentation

Lukas Schmid
  • 1,895
  • 1
  • 6
  • 18