Is there another way besides "strip()" and "replace()" to get rid of the extra white space in the data I scraped?

Question

I am pretty new to python and I am trying to set up a webscraper that gathers data on characters who have died in the show Game of Thrones. I have gotten the data that I want but I can't seem to get some of the extra fluff out of the data.

I have tried the .strip() method and the .replace() method using .replace(" ", "") but each time nothing changes. Here is a block of my code:

url = "http://time.com/3924852/every-game-of-thrones-death/"
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')

# Find the characters who have died by searching for the text embedded within the <div> tag with class = "headline"
find_deaths = soup.find_all('div', class_="headline")

# Strip out all the extra fluff at the beginning and end of the text and add it to list
for hit in find_deaths:
    deaths.append(hit.contents)

This code yields items in the list that look like this:

    deaths = [['\n                            Will\n                          '], ['\n                            Jon Arryn\n                          '], ['\n                            Jory Cassel\n                          ']

I have tried the following methods in order to try to stip out the extra fluff surrounding the data but it doesn't change anything in the list at all.

for item in deaths:
       str(item).strip()


for item in deaths:
    str(item).replace("\n ", "")

Using either one of the two methods above I thought that it would strip all the extra fluff out from the items in the list but it doesn't seem to change anything at all.

Is there another method I could use besides strip and replace that will get rid of the extra fluff in this data.

As pointed out by @Michael Butscher, you should do `new_item = str(item).strip` then `new_item` will be a copy of `item` after strip operation — Ilia Gilmijarow, Apr 19 '19 at 22:40

score 0 · Answer 1 · answered Apr 19 '19 at 22:42

You should use a list comprehension:

deaths = [s.strip() for s in deaths]

However, you have a lot of unnecessary intermediate steps here - you can simply use a list comprehension directly out of find_all:

deaths = [hit.contents[0].strip() for hit in soup.find_all('div', class_="headline")]

With the given website and query, deaths will be

['Will', 'Jon Arryn', 'Jory Cassel', 'Benjen Stark', 'Robert Baratheon', 'Syrio Forel', 'Eddard Stark', 'Viserys Targaryen', 'Drogo', 'Rhaego', 'Mirri Maz Duur', 'Rakharo', 'Yoren', 'Renly Baratheon', 'Rodrik Cassel', 'Irri', 'Maester Luwin', 'Qhorin', 'Pyat Pree', 'Doreah', 'Xaro Xhoan Daxos', 'Hoster Tully', 'Jeor Mormont', 'Craster', 'Kraznys', 'Beric Dondarrion', 'Ros', 'Talisa Stark', 'Robb Stark', 'Catelyn Stark', 'Polliver', 'Tansy', 'Joffrey Baratheon', 'Karl Tanner', 'Locke', 'Rast', 'Lysa Arryn', 'Oberyn Martell', 'The Mountain', 'Grenn', 'Mag the Mighty', 'Pyp', 'Styr', 'Ygritte', 'Jojen Reed', 'Shae', 'Tywin Lannister', 'Mance Rayder', 'Janos Slynt', 'Barristan Selmy', 'Maester Aemon', 'Karsi', 'Shireen Baratheon', 'Hizdahr zo Loraq', 'Selyse Baratheon', 'Stannis Baratheon', 'Myranda', 'Meryn Trant', 'Myrcella Baratheon', 'Jon Snow', 'Areo Hotah', 'Doran Martell', 'Trystane Martell', 'The Flasher', 'Roose Bolton', 'Walda Bolton', 'Unnamed Bolton Child', 'Balon Greyjoy', 'Alliser Thorne', 'Olly', 'Ser Arthur Dayne', 'Osha', 'Khal Moro', 'Three-Eyed Raven', 'Leaf', 'Hodor', 'Aerys II Targaryen, "The Mad King"', 'Brother Ray', 'Lem', 'Brynden Tully (The Blackfish)', 'Lady Crane', 'The Waif', 'Razdal mo Eraz', 'Belicho Paenymion', 'Rickon Stark', 'Jon Umber', 'Wun Weg Wun Dar Wun', 'Ramsay Bolton', 'Grand Maester Pycelle', 'Lancel', 'The High Sparrow', 'Loras Tyrell', 'Mace Tyrell', 'Kevan Lannister', 'Margaery Tyrell', 'Tommen Baratheon', 'Walder Rivers', 'Lothar Frey', 'Walder Frey', 'Lyanna Stark', 'Nymeria Sand', 'Obara Sand', 'Tyene Sand', 'Olenna Tyrell', 'Randyll Tarly', 'Dickon Tarly', 'Thoros of Myr', 'Petyr "Littlefinger" Baelish', 'Ned Umber']

score 0 · Answer 2 · answered Apr 19 '19 at 22:42

0

Strings are immutable. strip() and replace() return new strings, they don't change the original.

Use a list comprehension like the one that @Tomothy32 suggested:

deaths = [hit.contents.strip() for hit in soup.find_all('div', class_="headline")]

answered Apr 19 '19 at 22:42

Alec

8,529
8
37
63

score 0 · Answer 3 · answered Apr 20 '19 at 07:33

0

I can't test due to my location but you should be able to avoid this but using the already clean string in the name attribute of the elements with class anchor-only

deaths = [item['name'] for item in soup.select('.anchor-only')]

answered Apr 20 '19 at 07:33

QHarr

83,427
12
54
101

Is there another way besides "strip()" and "replace()" to get rid of the extra white space in the data I scraped?

3 Answers3