2

Here is my html:

<html>
<body>
<h2>Pizza</h2>
<p>This is some random paragraph without child tags.</p>
<p>Delicious homebaked pizza.<br><em></em>$8.99 pp</em></p>
<h2>Eggplant Parmesan</h2>
<p>Try the authentic <i>Italian flavor</i> of baked aubergine.<br><em>$6.99 pp</em></p>
<h2>Italian Ice Cream</h2>
<p>Our dessert specialty.<br><em>$3.99 pp</em></p>
</body>
</html>

Using BeautifulSoup, I want to grab the text that is displayed for the h2 and p tags, replace them with a prefixed version in the tree, and also print them out on screen. For the h2 tags, this works fine:

from bs4 import BeautifulSoup

with open("/var/www/html/Test/index.html", "r") as f:
 soup = BeautifulSoup(f, "lxml")

f = open("/var/www/html/Test/I18N_index.html", "w+")

for h2 in soup.find_all('h2'):
    i18n_string = "I18N_"+h2.string
    h2.string.replace_with(i18n_string)
    print(h2.string)

f.write(str(soup))


###Output:##############################################
# $ python ./test.py
# I18N_Pizza
# I18N_Eggplant Parmesan
# I18N_Italian Ice Cream
########################################################

In my I18N_index.html, all 3 strings appear correctly prefixed with 'I18N_'.

However, my p tags contain child tags, and for these the return type is 'None'. As a result, the concatenation no longer works:

    for p in soup.find_all('p'):
        i18n_string = "I18N_"+p.string
        p.string.replace_with(i18n_string)
        print(p.string)

    f.write(str(soup))

###Output:##################################################
# $ python ./test.py
# I18N_Pizza
# I18N_Eggplant Parmesan
# I18N_Italian Ice Cream
# I18N_This is some random paragraph without child tags.
# Traceback (most recent call last):
  # File "./test.py", line 15, in <module>
    # i18n_string = "I18N_"+p.string
# TypeError: cannot concatenate 'str' and 'NoneType' objects
############################################################

From this thread I learned about the join function. It let's me do the concatenation and print out the resulting strings on screen, but not the replacement in the soup tree:

for p in soup.find_all('p'):
    joined = ''.join(p.strings)
    i18n_string = "I18N_"+joined
    #joined.replace_with(i18n_string)
    print (i18n_string)

###Output with 'joined.replace_with(i18n_string)' DISABLED:###
# I18N_Pizza
# I18N_Eggplant Parmesan
# I18N_Italian Ice Cream
# I18N_This is some random paragraph without child tags.
# I18N_Delicious homebaked pizza.$8.99 pp
# I18N_Try the authentic Italian flavor of baked aubergine.$6.99 pp
# I18N_Our dessert specialty$3.99 pp
############################################################

###Output with 'joined.replace_with(i18n_string)' ENABLED:#####
# I18N_Pizza
# I18N_Eggplant Parmesan
# I18N_Italian Ice Cream
# Traceback (most recent call last):
  # File "./test.py", line 41, in <module>
    # joined.replace_with(i18n_string)
# AttributeError: 'unicode' object has no attribute 'replace_with'
############################################################

In that thread, another solution based on isinstance is mentioned, but I could not make that work.

If I understand correctly, the join function joins the strings but returns a 'unicode' object, not a string object, and this is why the 'replace_with' attribute doesn't work. How can I work around this? Any help is much appreciated.

cbp
  • 55
  • 7

2 Answers2

3

replace_with() method does not work not because joined is a unicode object, but because it is a method specific to bs4 object. See this: BeautifulSoup-replace_with

By the way the join() method return a str See this: python3-join

Now to give you a solution, I would simply remove the string after the p tag:

from bs4 import BeautifulSoup

with open("index.html", "r") as f:
 soup = BeautifulSoup(f, "lxml")

f = open("I18N_index.html", "w+")

for h2 in soup.find_all('h2'):
    i18n_string = "I18N_"+h2.string
    h2.string.replace_with(i18n_string)
    print(h2.string)

for p in soup.find_all('p'):
    joined = ''.join(p.strings)
    i18n_string = "I18N_"+joined
    p.replace_with(i18n_string)
    print (i18n_string)


f.write(str(soup))

OUTPUT:

I18N_Pizza I18N_Eggplant Parmesan I18N_Italian Ice Cream I18N_This is some random paragraph without child tags. I18N_Delicious homebaked pizza.$8.99 pp I18N_Try the authentic Italian flavor of baked aubergine.$6.99 pp I18N_Our dessert specialty.$3.99 pp

Maaz
  • 2,405
  • 1
  • 15
  • 21
1

With a simplified version of your code (that is, just taking care of the p tags issue), it looks like you have to replace p.string with p.text:

soup = BeautifulSoup([your html], "lxml")

 for p in soup.find_all('p'):
   print('before: ',p.text)
   i18n_string = "I18N_"+p.text
   print('after ',i18n_string)

Output:

before:  This is some random paragraph without child tags.
after  I18N_This is some random paragraph without child tags.
before:  Delicious homebaked pizza.$8.99 pp
after  I18N_Delicious homebaked pizza.$8.99 pp
before:  Try the authentic Italian flavor of baked aubergine.$6.99 pp
after  I18N_Try the authentic Italian flavor of baked aubergine.$6.99 pp
before:  Our dessert specialty.$3.99 pp
after  I18N_Our dessert specialty.$3.99 pp
Jack Fleeting
  • 24,385
  • 6
  • 23
  • 45
  • Thanks for your reply. I had tried 'text' before, but it did not resolve my inability to use 'replace_with'. – cbp Mar 06 '19 at 14:40