I have a webpage of unordered lists, and I want to turn them into a pandas dataframe as the first step of an NLP workflow.
import pandas as pd
from bs4 import BeautifulSoup
html = '''<html>
<body>
<ul>
<li>
Name
<ul>
<li>Many</li>
<li>Stories</li>
</ul>
</li>
</ul>
<ul>
<li>
More
</li>
</ul>
<ul>
<li>Stuff
<ul>
<li>About</li>
</ul>
</li>
</ul>
</body>
</html>'''
soup = BeautifulSoup(html, 'lxml')
The goal is for each top level list to turn into a dataframe, that would look something like this output:
0 1 2
0 Name Many Stories
1 More null null
2 Stuff About null
I tried to use the following code to get all the list items (complete with sublists)
target = soup.find_all('ul')
But it returns double outputs:
[<li>
Name
<ul>
<li>Many</li>
<li>Stories</li>
</ul>
</li>, <li>Many</li>, <li>Stories</li>, <li>
More
</li>, <li>Stuff
<ul>
<li>About</li>
</ul>
</li>, <li>About</li>]
Really lost here. Thanks.
` following another `
– W Stokvis May 09 '18 at 15:58` without a `
` then clearly those elements will be nested and thus you need to treat them differently.