1

I am building a Python code for scraping colors from a website and linking them to the elements that have that color (for example, I would like to associate all the <p> with their color).

The way I am doing it is that I am trying to access the website CSS and from there scrape all the hex colors present, and eventually linking it to its selector.

The problem is, there seems to be a problem with getting the CSS url. I am using Beautiful Soup to parse the html, but when trying to get the CSS I previously got the error:

MissingSchema: Invalid URL '/_next/static/css/19cb64e37006115a.css': No scheme supplied. Perhaps you meant http:///_next/static/css/19cb64e37006115a.css?
So I added a snippet that allows to add the protocol http when not present.
Anyway, I get the new error: InvalidURL: Invalid URL 'http:///_next/static/css/19cb64e37006115a.css': No host supplied.

There still is some problem with getting the right CSS url.

Here is the full code:

import re
import cssutils
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlsplit, urlunsplit


url = 'https://www.endy.com/'

# Retrieve the HTML code of the website
response = requests.get(url)
html = response.text

# Use BeautifulSoup to find the CSS file(s) and extract the URLs
soup = BeautifulSoup(html, 'html.parser')
css_urls = [link['href'] for link in soup.find_all('link', rel='stylesheet')]

# Create a dictionary to store the color and corresponding selectors
color_dict = {}

# Retrieve the CSS file(s) and parse the CSS rules
for css_url in css_urls:
    # Add default scheme if none is provided
    if not urlsplit(css_url).scheme:
        css_url = urlunsplit(('http',) + urlsplit(css_url)[1:])

    css_response = requests.get(css_url)
    css_text = css_response.text
    sheet = cssutils.parseString(css_text)

    # Extract CSS rules containing hexadecimal color codes
    for rule in sheet:
        if rule.type == rule.STYLE_RULE:
            css_text = rule.selectorText + ' {' + rule.style.cssText + '}'
            hex_colors = re.findall(r'#(?:[0-9a-fA-F]{3}){1,2}\b', css_text)
            if hex_colors:
                for color in hex_colors:
                    if color not in color_dict:
                        color_dict[color] = []
                    color_dict[color].append(rule.selectorText)

# Print the color and corresponding selectors
for color, selectors in color_dict.items():
    print(f"Color: {color}")
    print(f"Selectors: {', '.join(selectors)}")
    print("------------------------------")

Any help will be greatly appreciated. Thank you!

The_spider
  • 1,202
  • 1
  • 8
  • 18

1 Answers1

0

Anyway, I get the new error: InvalidURL: Invalid URL 'http:///_next/static/css/19cb64e37006115a.css': No host supplied.

You need to set the host, like if the website is http://localhost:4200 you need to add that before ‘/_next/static/css/19cb64e37006115a.css‘

idvr
  • 1