2

I need to convert a list of words to a span using BeautifulSoup.

For example

<html><body>word-one word-two word-one</body></html>

needs to be

<html><body><span>word-one</span> word-two <span>word-one</span></body></html>

where word-one needs to be moved into a span

So far I am able to find those elements using:

for html_element in soup(text=re.compile('word-one')):
    print(html_element)

However replacing such texts to span isn't clear.

Nishant
  • 20,354
  • 18
  • 69
  • 101
  • Are you also using [`lxml`](http://lxml.de/)? See [python lxml append element after another element](https://stackoverflow.com/questions/7474972/python-lxml-append-element-after-another-element) – Peter Wood May 13 '17 at 17:04
  • No just trying BS as I am finding it easier – Nishant May 13 '17 at 17:17

1 Answers1

3

I've done something like this, where the variable html is your code <html><body>word-one word-two word-one</body></html> and I separated the text and the code then added them together.

soup = BeautifulSoup(html,'html.parser')
text = soup.text # Only the text from the soup

soup.body.clear() #Clear the text between the body tags

new_text = text.split() # Split beacuse of the spaces much easier

for i in new_text:
    new_tag = soup.new_tag('span') #Create a new tag
    new_tag.append(i) #Append i to it (from the list that's split between spaces)
    #example new_tag('a') when we append 'word' to it it will look like <a>word</a>
    soup.body.append(new_tag) #Append the whole tag e.g. <span>one-word</span)

We could also do this with Regular Expressions to match some word.

soup = BeautifulSoup(html, 'html.parser')
text = soup.text  # Only the text from the soup

soup.body.clear()  # Clear the text between the body tags

theword = re.search(r'\w+', text)  # Match any word in text
begining, end = theword.start(), theword.end()

soup.body.append(text[:begining])  # We add the text before the match

new_tag = soup.new_tag('span')  # Create a new tag

new_tag.append(text[begining:end])
# We add the word that we matched in between the new tag
soup.body.append(new_tag)  # We append the whole text including the tag
soup.body.append(text[end:])  # Append everything that's left

I'm sure we could use .insert in a similar manner.

innicoder
  • 2,612
  • 3
  • 14
  • 29
  • But won't you use lose some important tags like div etc if you do like this? I mean I want to span them but not disturb div's or p's or table's – Nishant May 14 '17 at 06:00
  • I'm not quite sure what you mean, you gave me an HTML and I gave you the way of doing it in a case like this. You're probably doing it on a real website, therefore, you'll have to close down to the parent tag and do the same thing. Example: a=soup.find('p') and then a.div.clear and you'll clear everything between it I did put comments so you can understand what's going on. Please try to understand the code I can refer you to the beautifulsoup docs for each of these if it's easier. – innicoder May 14 '17 at 13:39
  • Wanted to clarify. Imagine body also had a div, so taking the text and clearing won't work no? Anyways lots of ideas in this one. – Nishant May 14 '17 at 13:43
  • Right, you're right if the body had a div it would clear the div too. You can use .extract() to extract only a portion or if the text is in the div you can use soup.body.div and then within the div clear and in the second case I explained how to use regular expressions. That would work even better in your case. – innicoder May 14 '17 at 13:45
  • An example with a div would help if you have time. I am not familiar with extract and and all. – Nishant May 14 '17 at 13:47
  • 1
    I'd be happy to help you, over a chat, skype or other. Is there a way to setup a chat on stackoverflow. Provide me with the particular HTML and I'll do my best. – innicoder May 14 '17 at 13:50
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/144179/discussion-between-elvir-muslic-and-nishant). – innicoder May 14 '17 at 14:54