3

I have the following code:

pattern = re.compile(r"^\s\s<strong>.*</strong>$")
matches = []

with open(r"spotifycharts.html", "rt") as current_file:
    content = current_file.read()

for line in content:
    matches = findall(pattern, line)

print(matches)

I have checked that the pattern works and matches with the strings found in the html file. However, the findall() function still returns an empty list. Is there something I've done wrong here?

EDIT: An error was pointed out and I fixed it. The matches list still is empty once the code is run.

pattern = re.compile(r"^\s\s<strong>.*</strong>$")
matches = []

with open(r"spotifycharts.html", "rt") as current_file:
    content = current_file.read()

for line in content:
    if findall(pattern, line) != []:
        matches.append(findall(pattern, line))

print(matches)

Here is less code which produces the same problem. Hope this helps

matches = []
with open(r"spotifycharts.html", "rt") as current_file:
    content = current_file.read()

matches = findall("^\s\s<strong>.*</strong>$", content)

print(matches)

Source HTML: view-source:https://spotifycharts.com/regional/au/daily/latest

kaido
  • 33
  • 3
  • This will only find matches if there are in the last line. Maybe you want the print inside the loop? – Mark Sep 16 '20 at 02:49
  • 2
    Welcome to SO! Looks like you're parsing HTML with regex. Can you show the input HTML and expected output and provide a [mcve]? See [this](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) and [this](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). Thanks. – ggorlen Sep 16 '20 at 02:52
  • You are mixing `re.compile` and `findall` (without `re.`) - its generally better to share code that works by itself (a minimal, reproducible exampel, as indicated by @ggorlen). For a useful answer, you'll need to provide a relevant sample of HTML as well, or at least provide a link to a public page that's a good example of the HTML you're parsing. – Grismar Sep 16 '20 at 02:55
  • 2
    If you are parsing HTML I suggest using BeautifulSoup. – Shivansh Potdar Sep 16 '20 at 03:04
  • Are you really looking for line(s) with exactly two spaces at the front of them, followed exactly by a quantity inside begin/end tags?. That's all your RE is going to match. – CryptoFool Sep 16 '20 at 03:04
  • @Steve Yes, I'm looking for song names which are all within tags. They all have two spaces before the tags. – kaido Sep 16 '20 at 03:06
  • @kaido, and that's the only thing on the line? I think iterating over lines removes white space from the front of each line, so I think your code is never going to match anything. I'd suggest taking away the `^\s\s` and `$` and see what you get. If that tag really is the only thing on a line, you still need to get rid of the `\s\s`. – CryptoFool Sep 16 '20 at 03:08
  • @Steve Ahhh, removing the ^ and $ worked. "\s\s.*" returns all the lines with song names in them. Thanks! – kaido Sep 16 '20 at 03:13
  • Cool. There's other stuff on these lines, right? That was your problem. `^` means the beginning of the line, and `$` means the end of the line. You could also have changed your expression to `^.*\s\s.*.*$`. – CryptoFool Sep 16 '20 at 03:19
  • I moved the final issue to an answer to wrap things up – CryptoFool Sep 16 '20 at 03:22
  • Note that `content = current_file.read()` means that `content` contains one large string. Doing `for line in content:` literally iterates *by character*, not by line. – Steven Rumbalski Sep 16 '20 at 03:23
  • Instead of editing your question to say "solved," you should mark the answer that solved your problem as accepted. If no answer solved it, you can post your own answer. If someone's comment solved it, you can ask them to post it as an answer. – Brian McCutchon Sep 16 '20 at 03:24
  • Didn't know that haha. Thanks for letting me know! @StevenRumbalski – kaido Sep 16 '20 at 03:40

2 Answers2

2

Using regex to parse HTML is like using a baseball bat to clean someone's teeth. Baseball bats are nice tools but they solve different problems than dental scalers.

Python has an HTML parser called BeautifulSoup which you can install with pip install beautifulsoup4:

>>> import requests
>>> from bs4 import BeautifulSoup
>>> html = requests.get("https://spotifycharts.com/regional/au/daily/latest").text
>>> bs = BeautifulSoup(html)
>>> [e.text for e in bs.select(".chart-table-track strong")][:3]
['WAP (feat. Megan Thee Stallion)', 'Mood (feat. Iann Dior)', 'Head & Heart (feat. MNEK)']

Here we use a CSS selector ".chart-table-track strong" to extract all of the song titles (I assume that's the data you want...).


Another approach is to use Pandas:

>>> import pandas as pd
>>> import requests # not needed if you have html5lib
>>> html = requests.get("https://spotifycharts.com/regional/au/daily/latest").text
>>> df = pd.read_html(html)[0]
>>> df[["Track", "Artist"]] = df["Track"].str.split("  by ", expand=True)
>>> df.drop(columns=df.columns[[0, 1, 2]])
                                Track  Streams      Artist
0     WAP (feat. Megan Thee Stallion)   311167     Cardi B
1              Mood (feat. Iann Dior)   295922    24kGoldn
2           Head & Heart (feat. MNEK)   190025  Joel Corry
3    Savage Love (Laxed - Siren Beat)   163776   Jawsh 685
4                         Breaking Me   150560       Topic
..                                ...      ...         ...
195                           Daisies    31092  Katy Perry
196                                21    31088      Polo G
197                     Nobody's Love    31047    Maroon 5
198        Ballin' (with Roddy Ricch)    30862     Mustard
199          Dancing in the Moonlight    30853   Toploader

[200 rows x 3 columns]
ggorlen
  • 44,755
  • 7
  • 76
  • 106
  • its not builtin, you need to install it via `pip`. `pip install beautifulsoup4` – Just for fun Sep 16 '20 at 03:29
  • Fair enough. However my assignment requires the use of default Python modules and I don't think BeautifulSoup is included. – kaido Sep 16 '20 at 03:34
  • also, if you only need to do something simple, and you don't already know BS, maybe the baseball bat is the expedient choice. I love BS and have used it a few times, but being a regex weenie, I could see myself going either way. I wouldn't have been able to come up with the syntax for BS out of my head. – CryptoFool Sep 16 '20 at 03:42
  • 1
    I'd just learn BS or some other typical HTML parser in your favorite language. It's well worth it unless you're parsing HTML one time then quitting programming for life. – ggorlen Sep 16 '20 at 03:53
  • The `pandas` solution is nice, but it also isn't a standard library module. – Steven Rumbalski Sep 16 '20 at 03:53
1

I expect that there is other stuff on the lines you're trying to match. Your expression only allows for EXACTLY a pair of begin/end tags on a line, with stuff between them, but nothing before or after them on the same line. I bet you want to use this expression:

"\s\s<strong>.*?</strong>"
CryptoFool
  • 21,719
  • 5
  • 26
  • 44
  • Yes. I was really mostly interested in getting him moving in the right direction. I actually wondered if he might have two pairs on the same line, in which case he'd get the wrong behavior here too. – CryptoFool Sep 16 '20 at 03:40
  • Personally I would go with `r'\s*(.*?)'` or even `r'\s*(.*?)\s*by (.*?)'` to include the artist. (Well, actually I would probably use BeautifulSoup, which is out of scope here.) – Steven Rumbalski Sep 16 '20 at 03:51