1

I have been trying to remove unnecessary parts of a scraped string and I'm having difficulty. I'm sure it's simple but I'm probably lacking the terminology to search for an effective solution.

I have all the information I need and am now trying to create a clean output. I am using this code...

for each in soup.findAll('div', attrs={'class': 'className'}):
    print(each.text.split('\n'))

And the output, a mix of numbers and text with variable spaces, is similar to...

['', '', '', '                    1                ', '  Text Example', '                        (4)']

What I need to produce is a list like...

['1', 'Text Example', '(4)']

Perhaps even removing the brackets "()" from the number 4.

Thanks.

Toby Booth
  • 175
  • 2
  • 10
  • 1
    Possible duplicate of [How to remove whitespace in BeautifulSoup](https://stackoverflow.com/questions/4270742/how-to-remove-whitespace-in-beautifulsoup) – Cfreak Nov 28 '17 at 21:37
  • I have tried removing the whitespace with split() and strip() variants and I haven't been able to figure out the combination I need. – Toby Booth Nov 28 '17 at 21:39
  • `text.strip()` without parameters removes spaces, tabs, enters. If you have list then you have do `result = [x.strip() for x in your_list if x.strip() != '']` – furas Nov 28 '17 at 21:40
  • @furas and yet when I'm doing it that way, it keeps splitting the two word text I need, eg. ['text', 'example']. I need them together. – Toby Booth Nov 28 '17 at 21:42
  • `strip()` only removes at the ends - `split()` splits text into words so don't use it. – furas Nov 28 '17 at 21:44

2 Answers2

2
clean = []
for each in soup.findAll('div', attrs={'class': 'className'}):
    clean.append([s.strip() for s in each.text.strip() if s.strip()])
print(clean)

should do it, full code for where do I put it...

Update:

Since there was a comment about inefficiency, out of curiosity I timed dual strip vs nested list, on py3. It seems like there is something behind when people say it's best to profile...

%timeit [s.strip() for s in l if s.strip()]
1.83 µs ± 21.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%timeit [i for i in (s.strip() for s in l) if i]
2.16 µs ± 24.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Results are as usual a bit different with larger data amounts...

%timeit [s.strip() for s in l*1000 if s.strip()]
1.57 ms ± 85.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit [i for i in (s.strip() for s in l*1000) if i]
1.45 ms ± 16.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
ahed87
  • 1,240
  • 10
  • 10
  • I've seen that used, but where do I put it. I'm learning, I very much appreciate the help. – Toby Booth Nov 28 '17 at 21:45
  • how about string only with spaces ? `if s` will not remove it. – furas Nov 28 '17 at 21:45
  • true, added a strip() in the if, not really tested myself, but I suppose that works for strings with only spaces. If you want to deal with other characters I would probably put that in it's own loop over clean afterwards, makes it a bit easier to understand what is done where. – ahed87 Nov 28 '17 at 21:50
  • This is inefficient as you are stripping each string twice – Joe Iddon Nov 28 '17 at 22:16
  • yepp, can't argue with that, but not everything in the world needs to be efficient, you are welcome to do a regex solution or a nested list or something else you had in mind. – ahed87 Nov 28 '17 at 22:18
  • I just wanted to say thank you. You have helped me out a great deal. – Toby Booth Nov 29 '17 at 13:27
1

Let's reduce your problem down to a basic list:

l = ['', '', '', '                    1                ', '  Text Example', '                        (4)']

then use a list-comp:

[i for i in (s.strip() for s in l) if i]

to get your result of:

['1', 'Text Example', '(4)']
Joe Iddon
  • 20,101
  • 7
  • 33
  • 54