0

I try to use document loader for websites urls. However for UnstructuredURLLoader some websites return:

(Document(page_content='Please enable JS and disable any ad blocker',
 metadata={'source': 'https://wellfound.com/company/chorus-one'}) 

So I wanted to use SeleniumURLLoader that is advertised in the doc to overcome the issues.

However, after installing pip install selenium webdriver_manager.

 selenium                      4.10.0
 webdriver-manager             3.8.6
from langchain.document_loaders import UnstructuredURLLoader, SeleniumURLLoader

loaders = SeleniumURLLoader(urls=urls)
data = loaders.load()

I keep getting errors:

The version of chrome cannot be detected. Trying with latest driver version

WebDriverException: Message: unknown error: cannot find Chrome binary Stacktrace: #0 0x55c7d5ee44e3 #1 0x55c7d5c13c76 #2 0x55c7d5c3a757 #3 0x55c7d5c39029 #4 0x55c7d5c77ccc #5 0x55c7d5c7747f #6 0x55c7d5c6ede3 #7 0x55c7d5c442dd #8 0x55c7d5c4534e #9 0x55c7d5ea43e4 #10 0x55c7d5ea83d7 #11 0x55c7d5eb2b20 #12 0x55c7d5ea9023 #13 0x55c7d5e771aa #14 0x55c7d5ecd6b8 #15 0x55c7d5ecd847 #16 0x55c7d5edd243 #17 0x7fbddad0d609 start_thread

What am I doing wrong?

mCs
  • 2,591
  • 6
  • 39
  • 66
  • Perhaps you should verify whether your URLs are directed towards any of the following non-HTML file types: `jpg`, `jpeg`, `JPG`, `JPEG`, `png`, `PNG`, `svg`, `gif`, `GIF`, `ttf`, `woff`, `js`, `json`, `css`, `css2`, `ico`, `xml`, `mp3`, `mp4`, `php`, `rdf`, `axd`, `eot`, `pdf`, `doc`, `docx`, `xlsx`. If that is indeed the case, you must eliminate such URLs, as these files cannot be processed using either `UnstructuredURLLoader` or `SeleniumURLLoader`. – Carlos Luis Rivera Jul 20 '23 at 13:31

1 Answers1

0

You will need to install chromium sudo apt-get install chromium

Jason
  • 676
  • 1
  • 12
  • 34