0

I have a webpage of unordered lists, and I want to turn them into a pandas dataframe as the first step of an NLP workflow.

import pandas as pd
from bs4 import BeautifulSoup
html = '''<html>
        <body>
          <ul>
              <li>
              Name
                    <ul>
                        <li>Many</li>
                        <li>Stories</li>
                    </ul>
                </li> 
          </ul>
          <ul>
              <li>
              More
              </li>
         </ul>
         <ul>
             <li>Stuff 
                     <ul>
                         <li>About</li>
                    </ul>
            </li>
        </ul>
        </body>
        </html>'''

 soup = BeautifulSoup(html, 'lxml')

The goal is for each top level list to turn into a dataframe, that would look something like this output:

   0    1     2
0 Name  Many  Stories
1 More  null  null
2 Stuff About null

I tried to use the following code to get all the list items (complete with sublists)

target = soup.find_all('ul')

But it returns double outputs:

[<li>
                   Name
                         <ul>
 <li>Many</li>
 <li>Stories</li>
 </ul>
 </li>, <li>Many</li>, <li>Stories</li>, <li>
                   More
                   </li>, <li>Stuff 
                          <ul>
 <li>About</li>
 </ul>
 </li>, <li>About</li>]

Really lost here. Thanks.

epic556
  • 79
  • 7
  • For one, you're looking for all `ul` elements on the page so because "Many" and "Stories" are contained under two different `ul` elements, you're gonna see them twice. It may be better specify which `ul` elements you want by using specific pathing rather than a blanket search for all – W Stokvis May 07 '18 at 18:38
  • To add to the point above about pathing, it seems you're looking for `ul` elements that appear after the `body` tag so define that path. The duplicated instances are due to `body > ul > li > ul` which holds true with `soup.find_all('ul')` – W Stokvis May 07 '18 at 18:41
  • 1
    I recommend looking [here](https://stackoverflow.com/questions/11465555/can-we-use-xpath-with-beautifulsoup) so that you can use XPaths which will help you resolve this issue. Your XPath would end up being something like this: `body/ul`. The single slash indicates any `ul` element that is a direct child of the `body` element – W Stokvis May 07 '18 at 18:54
  • Thanks @WStokvis, I can use xpath to get the top level of each list, but cannot figure out how to iterate down each one. I have thousands of these to sort, and it's all untagged html from the 1990s. – epic556 May 08 '18 at 18:18
  • Without more content to work with, I can't provide insight into how to properly code it so you get what you want. I guess the first thing I'd look at is whether you can write some logic that searches the tree for nested `ul` elements. For instance, if you see `
      ` following another `
        ` without a `
      ` then clearly those elements will be nested and thus you need to treat them differently.
    – W Stokvis May 09 '18 at 15:58
  • Thanks. Literally this is what the content looks like. Thousands of unsorted lists, all completely untagged. Some with one level, one entry, i.e. the item "More." Many more with two levels -- like "Name" and "Stuff" with anything from two to a dozen sublist items. That's really it. – epic556 May 10 '18 at 16:38

1 Answers1

1

Breakdown in the comments, enjoy!

from lxml import etree
import re

#Convert html string to correct format for parsing with XPATHs
root = etree.XML(html)
tree = etree.ElementTree(root)

#Your XPATH Selector 
xpathselector = 'body/ul'

#List of lxml items that need to be decoded
hold = tree.xpath(xpathselector)

'''
1. Get strings of each item in hold
2. Decode to string
3. Remove all tags and \n in each list
4. Split on spaces to create list of lists
'''
df = pd.DataFrame([re.sub('(\\n)|(\<.{0,3}\>)','',etree.tostring(i).decode('utf-8')).split() for i in hold])
df
       0      1        2
0   Name   Many  Stories
1   More   None     None
2  Stuff  About     None
W Stokvis
  • 1,409
  • 8
  • 15