0

Although this is most likely a newbie question I struggled to find any information online to help me with my problem

My code is meant to scrap onion sites, and despite being able to connect to TOR and the web scraper working fine as a stand-alone, when I tried combining both code blocks I kept getting numerous errors regarding the keyword argument in my code, even attempting to delete it presents me with bugs, I am a bit lost on what I'm supposed to do

import socket
import socks
import requests
from pywebcopy import save_webpage

socks.set_default_proxy(socks.SOCKS5, "127.0.0.1", 9050)
socket.socket = socks.socksocket

def get_tor_session():
    session = requests.session()
    # Tor uses the 9050 port as the default socks port
    session.proxies = {'http':  'socks5h://127.0.0.1:9050',
                       'https': 'socks5h://127.0.0.1:9050'}
    return session


session = get_tor_session()
print(session.get("http://httpbin.org/ip").text)
  
kwargs = {'project_name': 'site folder'}

save_webpage(
    
        # url of the website
        
session.get(url="http://elfqv3zjfegus3bgg5d7pv62eqght4h6sl6yjjhe7kjpi2s56bzgk2yd.onion"),
        
    # folder where the copy will be saved            

        project_folder=r"C:\Users\admin\Desktop\WebScraping",
        **kwargs
)

In this case, I'm presented with the following error:

TypeError: Cannot mix str and non-str arguments

attempting to replace

project_folder=r"C:\Users\admin\Desktop\WebScraping",
**kwargs

with

kwargs, 
project_folder=r"C:\Users\admin\Desktop\WebScraping"

presents me with this error:

TypeError: save_webpage() got multiple values for argument

traceback for the first error:

  File "C:\Users\admin\Desktop\WebScraping\tor.py", line 43, in <module>
    **kwargs

  File "C:\Users\admin\anaconda3\lib\site-packages\pywebcopy\api.py", line 58, in save_webpage
    config.setup_config(url, project_folder, project_name, **kwargs)

  File "C:\Users\admin\anaconda3\lib\site-packages\pywebcopy\configs.py", line 189, in setup_config
    SESSION.load_rules_from_url(urljoin(project_url, '/robots.txt'))

  File "C:\Users\admin\anaconda3\lib\urllib\parse.py", line 487, in urljoin
    base, url, _coerce_result = _coerce_args(base, url)

  File "C:\Users\admin\anaconda3\lib\urllib\parse.py", line 120, in _coerce_args
    raise TypeError("Cannot mix str and non-str arguments")

I'd really appreciate an explanation on what causes such a bug and how to avoid it in the future

AanTuning
  • 5
  • 2
  • Welcome to SO. You use ```**kwargs``` when you define a function; not when you use it. – ewokx Feb 07 '22 at 02:07
  • Looking at the example usage for `pywebcopy`, the function `save_webpage` expects a string for the `url` keyword parameter, not a `Response` object. Why are you making an HTTP GET request there? That's probably where the initial type error is coming from. – Paul M. Feb 07 '22 at 02:11
  • Its great that you posted the error, but post the full traceback message so we can easily spot the failing line. – tdelaney Feb 07 '22 at 03:02
  • @ewong - you can expand dictionaries into keyword arguments when calling a function. – tdelaney Feb 07 '22 at 03:09
  • Apologies, I missed that, i edited to add the traceback for the first error. – AanTuning Feb 07 '22 at 03:12
  • If so, when should I make an HTTP GET request? @Paul M. – AanTuning Feb 07 '22 at 03:23
  • The problem is in pywebcopy - I'm not familiar with that code, but it seems like `save_webpage` wants a url (string) as first paramter, but you are doing a `session.get` which returns a response object. This confuses urllib which is expecting a string. – tdelaney Feb 07 '22 at 03:23
  • What module do you recommend I use instead of pywebcopy, beautiful Soap maybe? @tdelaney – AanTuning Feb 07 '22 at 03:26
  • Not sure. The docs for pywebcopy are at https://pypi.org/project/pywebcopy/. I think the first step is to make sure you are using it right. I haven't used it but if you look at `1.5 Authentication and Cookies` it seems like you want to configure its session instead of using your own requests.session. – tdelaney Feb 07 '22 at 03:31
  • I have read the documentation, correct me if I'm wrong but doesn't Authentication aid in scrapping websites that require it? This is certainly useful but I don't believe it has any relation to my issue, which I suspect is caused by misconfiguration between the keyword argument and the code @tdelaney – AanTuning Feb 07 '22 at 03:39
  • You are setting up a requests session to handle proxies, but I seems from the documentation that you want to configure its session info instead of trying to pass in your own. That section was an example for authentication but it may be a hint about how to configure proxies. – tdelaney Feb 07 '22 at 03:43

2 Answers2

0

Not sure why this hasn't been answered yet. As mentioned in my comment, simply change this:

save_webpage(
    # url of the website
    session.get(url=...),

    # folder where the copy will be saved            
    project_folder=r"C:\Users\admin\Desktop\WebScraping",
    **kwargs
)

To:

save_webpage(
    # url of the website
    url=...,

    # folder where the copy will be saved            
    project_folder=r"C:\Users\admin\Desktop\WebScraping",
    **kwargs
)

save_webpage makes the request internally.

Paul M.
  • 10,481
  • 2
  • 9
  • 15
-2

SOLVED

adding the following code resolved the issue:

def getaddrinfo(*args):
    return [(socket.AF_INET, socket.SOCK_STREAM, 6, '', (args[0], args[1]))]

socket.getaddrinfo = getaddrinfo
AanTuning
  • 5
  • 2
  • would you be able to explain a bit further the answer? – user7440787 Feb 07 '22 at 14:26
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Feb 07 '22 at 14:27