0

I'm making a Discord Bot in Python to scrape Hack The Box data. This is already functional, but I want to use async with aiohttp for increase speed when I'm requesting each profile of each member.

So in the synchronous version, I made a login function that first make a get request, to get the token on the login page, then make a post request with the token, email and password.

And in the asynchronous version with aiohttp, when I do my post request, my session is not connected.

I shortened it a little bit just for performance testing:

import requests
import re
import json
from scrapy.selector import Selector
import config as cfg
from timeit import default_timer

class HTBot():
    def __init__(self, email, password, api_token=""):
        self.email = email
        self.password = password
        self.api_token = api_token

        self.session = requests.Session()
        self.headers = {
            "user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.85 Safari/537.36"
        }
        self.payload = {'api_token': self.api_token}

        if path.exists("users.txt"):
            with open("users.txt", "r") as f:
                self.users = json.loads(f.read())
        else:
            self.users = []


    def login(self):
        req = self.session.get("https://www.hackthebox.eu/login", headers=self.headers)

        html = req.text
        csrf_token = re.findall(r'type="hidden" name="_token" value="(.+?)"', html)

        if not csrf_token:
            return False

        data = {
            "_token": csrf_token[0],
            "email": self.email,
            "password": self.password
        }

        req = self.session.post("https://www.hackthebox.eu/login", data=data, headers=self.headers)

        if req.status_code == 200:
            print("Connecté à HTB !")
            self.session.headers.update(self.headers)
            return True

        print("Connexion impossible.")
        return False


    def extract_user_info(self, htb_id):
        infos = {}
        req = self.session.get("https://www.hackthebox.eu/home/users/profile/" + str(htb_id), headers=self.headers)

        if req.status_code == 200:
            body = req.text
            html = Selector(text=body)

            infos["username"] = html.css('div.header-title > h3::text').get().strip()
            infos["avatar"] = html.css('div.header-icon > img::attr(src)').get()
            infos["points"] = html.css('div.header-title > small > span[title=Points]::text').get().strip()
            infos["systems"] = html.css('div.header-title > small > span[title="Owned Systems"]::text').get().strip()
            infos["users"] = html.css('div.header-title > small > span[title="Owned Users"]::text').get().strip()
            infos["respect"] = html.css('div.header-title > small > span[title=Respect]::text').get().strip()
            infos["country"] = Selector(text=html.css('div.header-title > small > span').getall()[4]).css('span::attr(title)').get().strip()
            infos["level"] = html.css('div.header-title > small > span::text').extract()[-1].strip()
            infos["rank"] = re.search(r'position (\d+) of the Hall of Fame', body).group(1)
            infos["challs"] = re.search(r'has solved (\d+) challenges', body).group(1)
            infos["ownership"] = html.css('div.progress-bar-success > span::text').get()

            return infos

        return False


    def refresh_user(self, htb_id, new=False):
        users = self.users

        for user in users:
            if user["htb_id"] == htb_id:
                infos = self.extract_user_info(htb_id)


    def refresh_all_users(self):
        users = self.users

        for user in users:
            self.refresh_user(user["htb_id"])

            elapsed = default_timer() - START_TIME
            time_completed_at = "{:5.2f}s".format(elapsed)
            print("{0:<30} {1:>20}".format(user["username"], time_completed_at))

        print("Les users ont été mis à jour !")

htbot = HTBot(cfg.HTB['email'], cfg.HTB['password'], cfg.HTB['api_token'])
htbot.login()

START_TIME = default_timer()
htbot.refresh_all_users()

Then, my async rewrite only for the login function :

import asyncio
import re
import config as cfg

headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.85 Safari/537.36"
}

LOGIN_LOCK = asyncio.Lock()

async def login():
    async with LOGIN_LOCK:
        async with aiohttp.TCPConnector(share_cookies=True) as connector:
            async with aiohttp.ClientSession(connector=connector, headers=headers) as session:
                async with session.get("https://www.hackthebox.eu/login") as req:

                    html = await req.text()
                    csrf_token = re.findall(r'type="hidden" name="_token" value="(.+?)"', html)

                    if not csrf_token:
                        return False

                    payload = {
                        "_token": csrf_token[0],
                        "email": cfg.HTB['email'],
                        "password": cfg.HTB['password']
                    }

                async with session.post('https://www.hackthebox.eu/login', data=payload) as req:
                    print(await req.text())

                exit()


async def main():
    await login()

asyncio.run(main())

I think I'm going too far with this BaseConnector, Locks etc but I've been working on it for two days now and I'm running out of ideas, I'm already trying to connect with this post request.

I also did a comparison of the two requests with Requests and aiohttp in Wireshark. The only difference is that the one with aiohttp doesn't send keepalive and has cookies. (I already tried to manually set the header "connection: keep-alive" but it doesn't change anything). However, according to the documentation, keep-alive should be active by default, so I don't understand.

(In the screen the 301 status codes are normal, for seeing my HTTP requests I had to use http instead of https.)

Screen of Wireshark : https://files.catbox.moe/bignh0.PNG

Thank you if you can help me !

Since I'm new to asynchronous programming, I'll take all your advice. Unfortunately almost everything I read about it on the internet is deprecated for Python 3.7+ and doesn't use the new syntaxes.

mxrch
  • 53
  • 6

1 Answers1

0

Okay, I have finally switched to httpx and it worked like a charm. I really don't know why aiohttp wouldn't work.

mxrch
  • 53
  • 6