1

I would like Crawl 3 million web pages in a day. Due to variety of web nature - HTML, pdf etc. I need to use Selenium, Playwright etc. I noticed to use Selenium one has to build a custom container using Google DataFlow

  1. Is it a good choice to use Selenium inside ParDo Fns ? Can we use a single instance of Selenium across multiple instances ?
  2. Is the same applicable Playwright, should I build a custom image ?
Poala Astrid
  • 1,028
  • 2
  • 10
Sunil
  • 311
  • 1
  • 10

1 Answers1

2

You can do anything in a Python DoFn that you can do from Python. Yes, I would definitely use custom containers for complex dependencies like this.

You can share instances of Selenium (or any other object) per DoFn instance by initializing it in your setup method. You can share it for the whole process by using a module-level global or something like shared (noting that it may be accessed by more than one thread at once).

robertwb
  • 4,891
  • 18
  • 21