-2

I wanted to scrape this page and get the containt without whitespace.

```
 import requests
from bs4 import BeautifulSoup
def getdat(url):
    r = requests.get(url)
    return r.text
newsurl = "https://www.msei.in/downloads/equity-reports/fii-dii-activities"
data = getdat(newsurl)
soup = BeautifulSoup(data, 'html.parser')
result = soup.findAll('tr')
for i in result:
    print(i.text)
```

The output of the code is- enter image description here

My requirment is get the text without the blankspace.How can I get the text without the blankspace?

Swagoto
  • 31
  • 4
  • 1
    Does this answer your question? [How to remove whitespace in BeautifulSoup](https://stackoverflow.com/questions/4270742/how-to-remove-whitespace-in-beautifulsoup) – Russ J Mar 20 '21 at 04:20
  • No i have already tried the solutions but my problem is not solved – Swagoto Mar 20 '21 at 04:44

5 Answers5

2

Using regular expressions you can remove ALL of the whitespace easily (or less if you want to, with a little more effort).

import requests
import re

from bs4 import BeautifulSoup
def getdat(url):
    r = requests.get(url)
    return r.text
newsurl = "https://www.msei.in/downloads/equity-reports/fii-dii-activities"
data = getdat(newsurl)
soup = BeautifulSoup(data, 'html.parser')
result = soup.findAll('tr')

ans = [re.sub(r"\u0000+", "\n", re.sub(r"\s+", "", re.sub(r"\n+", "\u0000", x.text))).strip() for x in result]

for i in ans:
    print(i)
Cresht
  • 1,020
  • 2
  • 6
  • 15
1

You do know that the .strip function removes whitespace from both ends of a string?

for i in result:
    txt = i.text.strip()
    if txt:
        print(txt)
Tim Roberts
  • 48,973
  • 4
  • 21
  • 30
1

I tried using the approach which Tim Roberts mentioned, but, to my surprise, it did not work. Here's what I came up with:

import bs4
import requests

res = requests.get("https://www.msei.in/downloads/equity-reports/fii-dii-activities")
soup = bs4.BeautifulSoup(res.text, features="html.parser")

elems = soup.select("tr")
text = []
for e in elems:
    print(e.getText().split())

I found that calling the split() method was the easiest way to get a clean list of strings, with no whitespace.

['Category', 'Date', 'Buy', 'Value', 'Sell', 'Value', 'Net', 'Value']
['FII/FPI', '19-Mar-2021', '24,193.67', '22,775.24', '1,418.43']
['As', 'on', '19', 'Mar,', '2021']
['Category', 'Date', 'Buy', 'Value', 'Sell', 'Value', 'Net', 'Value']
['DII', '19-Mar-2021', '7,503.70', '6,944.08', '559.62']
['As', 'on', '19', 'Mar,', '2021']
Jacob Lee
  • 4,405
  • 2
  • 16
  • 37
0

If you want to remove all whitespace leading and trailing the printed text then you would do:

print(i.text.strip())

If you want to remove all whitespace everywhere in the text then you would do something like:

import re
...
removedWhiteSpaceText = re.sub(r'\s+', '', i.text)
print(removedWhiteSpaceText)
...
Majin Bui
  • 41
  • 1
  • 7
0

Try using replace when printing the output.

for i in result:
    print(i.text.replace(" ",""))
John Holmes
  • 381
  • 1
  • 14
  • This remove the literal spaces, but does not removed newlines. OP wanted the content without *any* whitespace, which includes newlines. – Jacob Lee Mar 20 '21 at 04:30