Clean up a scraped text string with Python

Question

I have been trying to remove unnecessary parts of a scraped string and I'm having difficulty. I'm sure it's simple but I'm probably lacking the terminology to search for an effective solution.

I have all the information I need and am now trying to create a clean output. I am using this code...

for each in soup.findAll('div', attrs={'class': 'className'}):
    print(each.text.split('\n'))

And the output, a mix of numbers and text with variable spaces, is similar to...

['', '', '', '                    1                ', '  Text Example', '                        (4)']

What I need to produce is a list like...

['1', 'Text Example', '(4)']

Perhaps even removing the brackets "()" from the number 4.

Thanks.

Possible duplicate of [How to remove whitespace in BeautifulSoup](https://stackoverflow.com/questions/4270742/how-to-remove-whitespace-in-beautifulsoup) — Cfreak, Nov 28 '17 at 21:37
I have tried removing the whitespace with split() and strip() variants and I haven't been able to figure out the combination I need. — Toby Booth, Nov 28 '17 at 21:39
`text.strip()` without parameters removes spaces, tabs, enters. If you have list then you have do `result = [x.strip() for x in your_list if x.strip() != '']` — furas, Nov 28 '17 at 21:40
@furas and yet when I'm doing it that way, it keeps splitting the two word text I need, eg. ['text', 'example']. I need them together. — Toby Booth, Nov 28 '17 at 21:42
`strip()` only removes at the ends - `split()` splits text into words so don't use it. — furas, Nov 28 '17 at 21:44

ahed87 · Accepted Answer · 2017-11-28T23:24:51.797

2

clean = []
for each in soup.findAll('div', attrs={'class': 'className'}):
    clean.append([s.strip() for s in each.text.strip() if s.strip()])
print(clean)

should do it, full code for where do I put it...

Update:

Since there was a comment about inefficiency, out of curiosity I timed dual strip vs nested list, on py3. It seems like there is something behind when people say it's best to profile...

%timeit [s.strip() for s in l if s.strip()]
1.83 µs ± 21.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%timeit [i for i in (s.strip() for s in l) if i]
2.16 µs ± 24.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Results are as usual a bit different with larger data amounts...

%timeit [s.strip() for s in l*1000 if s.strip()]
1.57 ms ± 85.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit [i for i in (s.strip() for s in l*1000) if i]
1.45 ms ± 16.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

edited Nov 28 '17 at 23:24

answered Nov 28 '17 at 21:43

ahed87

1,240
10
10

I've seen that used, but where do I put it. I'm learning, I very much appreciate the help. – Toby Booth Nov 28 '17 at 21:45
how about string only with spaces ? `if s` will not remove it. – furas Nov 28 '17 at 21:45
true, added a strip() in the if, not really tested myself, but I suppose that works for strings with only spaces. If you want to deal with other characters I would probably put that in it's own loop over clean afterwards, makes it a bit easier to understand what is done where. – ahed87 Nov 28 '17 at 21:50
This is inefficient as you are stripping each string twice – Joe Iddon Nov 28 '17 at 22:16
yepp, can't argue with that, but not everything in the world needs to be efficient, you are welcome to do a regex solution or a nested list or something else you had in mind. – ahed87 Nov 28 '17 at 22:18
I just wanted to say thank you. You have helped me out a great deal. – Toby Booth Nov 29 '17 at 13:27

score 1 · Answer 2 · answered Nov 28 '17 at 22:12

1

Let's reduce your problem down to a basic list:

l = ['', '', '', '                    1                ', '  Text Example', '                        (4)']

then use a list-comp:

[i for i in (s.strip() for s in l) if i]

to get your result of:

['1', 'Text Example', '(4)']

answered Nov 28 '17 at 22:12

Joe Iddon

20,101
7
33
54

I just wanted to say thank you. You have helped me out a great deal. – Toby Booth Nov 29 '17 at 13:27

Clean up a scraped text string with Python

2 Answers2