Python BeautifulSoup: Loop through elements to strip whitespace from within a function

Question

I am trying to write a function that I can reuse to strip whitespace from scraped elements. I am scraping h2, li and p tags; they are currently being returned as <tag> string </tag> and I'd like to remove the whitespace and save the content back using *.get_text(strip=True).

h_content = soup.select('h2') will store all the h2 tags found.

p_content = soup.select('p') will store all the p tags found.

And so on.

I have been trying this but am not sure how to return the items to the original location, that is to say, return them here --> *_content

def remove_whitespace(tags):
    for item in tags:
        item.get_text(strip=True)
        return item

The ideal situation is to end up with a function that I can reuse.

remove_whitespace(*_content)

When I place return inside of the loop and execute p_content = remove_whitespace(p_content) I see that the function worked but only in the first item and when I place return outside of the loop and execute again I receive an error `Traceback (most recent call last): File "", line 1, in \ File "", line 3, in remove_whitespace \ AttributeError: 'unicode' object has no attribute 'get_text'` — Freddy, Feb 07 '22 at 08:30
Are you trying to modify the HTML and save a version with whitespace removed? Can you [edit] your question to give some worked example — Martin Evans, Feb 07 '22 at 10:56
Yes @MartinEvans that is what I am trying to do. I've edited my question and hopefully it is a bit more clear now. — Freddy, Feb 07 '22 at 15:19

score 2 · Accepted Answer · answered Feb 07 '22 at 10:49

The error you got

AttributeError: 'unicode' object has no attribute 'get_text'

stems from an element (Tag) in given result set, that is not an instance or descendant of class NavigableString. As such it has no method get_text.

See also docs Miscellaneous common errors.

I would suggest to use the string-generators like stripped_strings or the simple text attribute:

def remove_whitespace(tags):
    texts = [] 
    for t in tags:
        print(t, type(t))  # debug print to see the type
        texts.append(t.text.strip())
    return texts

See also:

This worked and it helped me find out the reason for my `AttributeError` as well. One thing I can't seem to understand clearly is how `texts = []` is storing the results in the function but when I do `print(p_content)` I see the cleaned up results. How is the list `res` mapping back to `p_content`? — Freddy, Feb 07 '22 at 15:43

score 1 · Answer 2 · answered Feb 07 '22 at 08:57

1

Using "return" exits the function after the first iteration. You need to do something like this to stop that from happening:

def remove_whitespace(tags):
    res=[]
    for item in tags:
        res.append(item.get_text(strip=True))
    return res

answered Feb 07 '22 at 08:57

pyNeophyte

123
2
9

The collect into list and return list is a good step. However, I am not sure that this will solve the current error with BS `AttributeError: 'unicode' object has no attribute 'get_text'. – hc_dev Feb 07 '22 at 09:35
1

@hc_dev I went back through my logout output and realized that somewhere through all my trial and errors I had a stored instance of `p_content` that I had cleaned up and stored as `u'one item'` and that is why I was getting the `AttributeError`. – Freddy Feb 07 '22 at 15:39

Python BeautifulSoup: Loop through elements to strip whitespace from within a function

2 Answers2