0

I am trying to write a function that I can reuse to strip whitespace from scraped elements. I am scraping h2, li and p tags; they are currently being returned as <tag> string </tag> and I'd like to remove the whitespace and save the content back using *.get_text(strip=True).

h_content = soup.select('h2') will store all the h2 tags found.

p_content = soup.select('p') will store all the p tags found.

And so on.

I have been trying this but am not sure how to return the items to the original location, that is to say, return them here --> *_content

def remove_whitespace(tags):
    for item in tags:
        item.get_text(strip=True)
        return item

The ideal situation is to end up with a function that I can reuse.

remove_whitespace(*_content)

Freddy
  • 511
  • 2
  • 9
  • 19
  • What output are you currently getting? – pyNeophyte Feb 07 '22 at 06:58
  • When I place return inside of the loop and execute p_content = remove_whitespace(p_content) I see that the function worked but only in the first item and when I place return outside of the loop and execute again I receive an error `Traceback (most recent call last): File "", line 1, in \ File "", line 3, in remove_whitespace \ AttributeError: 'unicode' object has no attribute 'get_text'` – Freddy Feb 07 '22 at 08:30
  • Are you trying to modify the HTML and save a version with whitespace removed? Can you [edit] your question to give some worked example – Martin Evans Feb 07 '22 at 10:56
  • Yes @MartinEvans that is what I am trying to do. I've edited my question and hopefully it is a bit more clear now. – Freddy Feb 07 '22 at 15:19

2 Answers2

2

The error you got

AttributeError: 'unicode' object has no attribute 'get_text'

stems from an element (Tag) in given result set, that is not an instance or descendant of class NavigableString. As such it has no method get_text.

See also docs Miscellaneous common errors.

I would suggest to use the string-generators like stripped_strings or the simple text attribute:

def remove_whitespace(tags):
    texts = [] 
    for t in tags:
        print(t, type(t))  # debug print to see the type
        texts.append(t.text.strip())
    return texts

See also:

hc_dev
  • 8,389
  • 1
  • 26
  • 38
  • This worked and it helped me find out the reason for my `AttributeError` as well. One thing I can't seem to understand clearly is how `texts = []` is storing the results in the function but when I do `print(p_content)` I see the cleaned up results. How is the list `res` mapping back to `p_content`? – Freddy Feb 07 '22 at 15:43
1

Using "return" exits the function after the first iteration. You need to do something like this to stop that from happening:

def remove_whitespace(tags):
    res=[]
    for item in tags:
        res.append(item.get_text(strip=True))
    return res
pyNeophyte
  • 123
  • 2
  • 9
  • The collect into list and return list is a good step. However, I am not sure that this will solve the current error with BS `AttributeError: 'unicode' object has no attribute 'get_text'. – hc_dev Feb 07 '22 at 09:35
  • 1
    @hc_dev I went back through my logout output and realized that somewhere through all my trial and errors I had a stored instance of `p_content` that I had cleaned up and stored as `u'one item'` and that is why I was getting the `AttributeError`. – Freddy Feb 07 '22 at 15:39