4

I'm trying to scrape some data for my app. My question is I need some Here is the HTML code:

<tr>
  <td>
    This
    <a class="tip info" href="blablablablabla">is a first</a>
    sentence.
    <br>
    This
    <a class="tip info" href="blablablablabla">is a second</a>
    sentence.
    <br>This
    <a class="tip info" href="blablablablabla">is a third</a>
    sentence.
    <br>
  </td>
</tr>

I want output to looks like

This is a first sentence.
This is a second sentence.
This is a third sentence.

Is it possible to do that?

jack45j
  • 43
  • 1
  • 4
  • Have you tried the below solutions? People are trying to solve your issue but you don't even care to respond @user4937980!! – SIM Feb 11 '18 at 05:12
  • Sorry I just woke up for hours. Finally I used SIM's method and it just work like a boss. All below solutions are brilliant. BTW web-Scraping is really hard to learn :'( – jack45j Feb 11 '18 at 10:00

4 Answers4

2

It's certainly possible. I'll answer in slightly greater generality because I doubt that you want merely to process that chunk of HTML.

First, get a pointer to the td element,

td = soup.find('td')

Now, notice that you can get a list of this element's children,

>>> td_kids = list(td.children)
>>> td_kids
['\n    This\n    ', <a class="tip info" href="blablablablabla">is a first</a>, '\n    sentence.\n    ', <br/>, '\n    This\n    ', <a class="tip info" href="blablablablabla">is a second</a>, '\n    sentence.\n    ', <br/>, 'This\n    ', <a class="tip info" href="blablablablabla">is a third</a>, '\n    sentence.\n    ', <br/>, '\n']

Some of the items in this list are string, some are HTML elements. Crucially, some are br elements.

You could split the list first of all into one or more lists by looking for,

isinstance(td_kid[<some k>], bs4.element.Tag)

for each item in the list.

Then, you could go through each of the sublists repeatedly replacing tags by turning them into soup and then getting the lists of children for these. Eventually, you will have several sublists containing only what BeautifulSoup calls 'navigable strings' that you can manipulate as usual.

Join the elements together, then I would suggest that you eliminate white space using a regex sub like this:

result = re.sub(r'\s{2,}', '', <joined list>)
Bill Bell
  • 21,021
  • 5
  • 43
  • 58
1
htmlText = """<tr>
  <td>
    This
    <a class="tip info" href="blablablablabla">is a first</a>
    sentence.
    <br>
    This
    <a class="tip info" href="blablablablabla">is a second</a>
    sentence.
    <br>This
    <a class="tip info" href="blablablablabla">is a third</a>
    sentence.
    <br>
  </td>
</tr>"""
from bs4 import BeautifulSoup
# these two steps are to put everything into one line. may not be necessary for you
htmlText = htmlText.replace("\n", " ")
while "  " in htmlText:
    htmlText = htmlText.replace("  ", " ")

# import into bs4
soup = BeautifulSoup(htmlText, "lxml")

# using https://stackoverflow.com/a/34640357/5702157
for br in soup.find_all("br"):
    br.replace_with("\n")

parsedText = soup.get_text()
while "\n " in parsedText:
    parsedText = parsedText.replace("\n ", "\n") # remove spaces at the start of new lines
print(parsedText.strip())
yenter
  • 26
  • 2
  • 5
1

Try this. It should give you the desired output. Just consider the content variable used within the below script to be the holder of your above pasted html elements.

from bs4 import BeautifulSoup

soup = BeautifulSoup(content,"lxml")
items = ','.join([''.join([item.previous_sibling,item.text,item.next_sibling]) for item in soup.select(".tip.info")])
data = ' '.join(items.split()).replace(",","\n")
print(data)

Output:

This is a first sentence. 
This is a second sentence. 
This is a third sentence.
SIM
  • 21,997
  • 5
  • 37
  • 109
1

You can easily do this using bs4 and basic string manipulation like so:

from bs4 import BeautifulSoup

data = '''
<tr>
  <td>
    This
    <a class="tip info" href="blablablablabla">is a first</a>
    sentence.
    <br>
    This
    <a class="tip info" href="blablablablabla">is a second</a>
    sentence.
    <br>This
    <a class="tip info" href="blablablablabla">is a third</a>
    sentence.
    <br>
  </td>
</tr>
'''

soup = BeautifulSoup(data, 'html.parser')
for i in soup.find_all('td'):
    print ' '.join(i.text.split()).replace('. ', '.\n')

This will give as output:

This is a first sentence.
This is a second sentence.
This is a third sentence.
game0ver
  • 1,250
  • 9
  • 22
  • @novice-coder yes, I know - but web-scraping depends a lot on the content format (in this case the OP wants full sentences - thus the dot). Anyway This can be easily fixed by the OP depending on the actual content. The important thing in this answer is `i.text` since many programmers tend to forget or ignore it even exist! – game0ver Feb 10 '18 at 22:51