how to crawl through html string content (tag by tag) using python

Question

I have html string and would like to find the text elements and replace with the tokens. I used beautifulsoup to get the data but get_text is giving only text not corresponding elements. How to go thorugh html string from root node to last node and finding out the text elements to replace with the tokens I define. I get dynamic large html string from source. small example has been given

<html>
<body>
<p>Hi</p>
<p>Hello</p>
</body>
</html>

To
----
<html>
<body>
<p>Token1</p>
<p>Token2</p>
</body>
</html>

I think this is what you are looking for. https://stackoverflow.com/questions/36108621/get-all-html-tags-with-beautiful-soup — Kapil Lamichhane, Jan 13 '20 at 20:27
Does this answer your question? [Get all HTML tags with Beautiful Soup](https://stackoverflow.com/questions/36108621/get-all-html-tags-with-beautiful-soup) — Prayson W. Daniel, Jan 13 '20 at 20:29

score 0 · Answer 1 · answered Jan 14 '20 at 01:35

0

You are probably looking for something like this:

from bs4 import BeautifulSoup as bs
subst = """
<html>
<body>
<p>Hi</p>
<p>Hello</p>
</body>
</html>
"""
tokens = ['Token1','Token2']
soup = bs(subst, 'lxml')
targets = soup.find_all('p')
for target in targets:
    loc=targets.index(target)
    target.string.replace_with(tokens[loc])    
print(soup)

Output:

<html>
<body>
<p>Token1</p>
<p>Token2</p>
</body>
</html>

answered Jan 14 '20 at 01:35

Jack Fleeting

24,385
6
23
45

The html string is dynamic meaning it's not always contains p tags. Code should go through all tags and find text elements and gets updated with the tokens I have – Bobby Nagendra Jan 14 '20 at 07:13
@BobbyNagendra - You should have made this clear in your question and have an html string that is representative of the actual one. Please edit your question accordingly. – Jack Fleeting Jan 14 '20 at 11:29
I have mentioned it in the question – Bobby Nagendra Jan 16 '20 at 07:36

how to crawl through html string content (tag by tag) using python

1 Answers1