4

I'm trying to create a list of filter facets. I've loaded all the <span> in to a list with bs4 and now need to grab a specific substring out of the larger string that is the <span>. I want to load each filter facet name in to a list to end up with a list that looks like this: [size, width, colour, etc].

list generated with bs4

[<span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Size" data-v-05f803b1="">Size</span>, <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Width" data-v-05f803b1="">Width</span>, <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Colour" data-v-05f803b1="">Colour</span>, <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Heel Height" data-v-05f803b1="">Heel Height</span>, <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Product Type" data-v-05f803b1="">Product Type</span>, <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Function" data-v-05f803b1="">Function</span>, <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Age" data-v-05f803b1="">Age</span>, <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Technology" data-v-05f803b1="">Technology</span>, <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Material" data-v-05f803b1="">Material</span>, <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Price" data-v-05f803b1="">Price</span>]

what I've tried and doesn't seem to get me anywhere:

facetcode = [str(i) for i in spans]

facets = []

for i in facetcode:
    facetcode1 = i.split(' ')
    for y in facetcode1:
        if 'data-facet-name' == True:
            print(y)

when I print(y) it give me a blank list but I'm expecting something like: data-facet-name="Size"

The result I want:

[size, width, colour, etc]

Am I over complicating this? The idea is to iterate over each list element and load only the text I want in to a new list.

QHarr
  • 83,427
  • 12
  • 54
  • 101
LvP
  • 55
  • 4
  • Converting to string and then parsing those strings seems like totally the wrong approach. You have a structured markup language there; use a tool which understands its structure instead of writing your own ad hoc parser. – tripleee Sep 06 '19 at 18:15
  • 1
    This is one of the best first posts I've seen in a while! Well done – Kevin Welch Sep 06 '19 at 18:15

4 Answers4

2

You want to extract the attribute data-facet-name from the span's that have that attribute. If you really want a list you can convert the set to a list after.

from bs4 import BeautifulSoup as bs

html = '''
<html>
 <head></head>
 <body>
  <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Size" data-v-05f803b1="">Size</span>, 
  <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Width" data-v-05f803b1="">Width</span>, 
  <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Colour" data-v-05f803b1="">Colour</span>, 
  <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Heel Height" data-v-05f803b1="">Heel Height</span>, 
  <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Product Type" data-v-05f803b1="">Product Type</span>, 
  <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Function" data-v-05f803b1="">Function</span>, 
  <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Age" data-v-05f803b1="">Age</span>, 
  <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Technology" data-v-05f803b1="">Technology</span>, 
  <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Material" data-v-05f803b1="">Material</span>, 
  <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Price" data-v-05f803b1="">Price</span>
 </body>
</html>
  '''
soup = bs(html, 'lxml') #or 'html.parser'
print({i['data-facet-name'] for i in soup.select('span[data-facet-name]')})

enter image description here

QHarr
  • 83,427
  • 12
  • 54
  • 101
  • Thanks, this works! What is causing the printed result to be in a different order than the spans in the html? – LvP Sep 06 '19 at 21:33
  • 1
    Sets don’t have order. You could just use a list comprehension then do unique on it I suspect. – QHarr Sep 06 '19 at 21:34
1

Here's a greedy list comprehension, assuming your data is in a list named bs4_arr:

attributes = ['='.join(word.split('=')[1:]).strip('"') for word in bs4_arr.split() if word.split('=')[0] == 'data-facet-name']

Here's what it's doing:

  • iterate through every word in your HTML list
  • split the word on =
  • if the attribute name is data-facet-name, then we append the attribute value to our result

This is greedy because it calls word.split('=') twice.

You can do it without a list comprehension, as well (less greedy):

attributes = []
for word in bs4_arr.split():
    tokens = word.split('=')
    name = tokens[0]
    value = '='.join(tokens[1:]).strip('"')
    if name == 'data-facet-name':
         attributes.append(value)

A better approach, however, would be to continue using BeautifulSoup to parse your HTML.

wcarhart
  • 2,685
  • 1
  • 23
  • 44
1

I think you might be missing some of the power of BS4!

import bs4

soup = bs4.BeautifulSoup('''<span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Size" data-v-05f803b1="">Size</span>, <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Width" data-v-05f803b1="">Width</span>, <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Colour" data-v-05f803b1="">Colour</span>, <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Heel Height" data-v-05f803b1="">Heel Height</span>, <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Product Type" data-v-05f803b1="">Product Type</span>, <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Function" data-v-05f803b1="">Function</span>, <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Age" data-v-05f803b1="">Age</span>, <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Technology" data-v-05f803b1="">Technology</span>, <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Material" data-v-05f803b1="">Material</span>, <span class="col-sm-8 col-xs-9 facet-menu-facet__filter-name-spacing" data-facet-name="Price" data-v-05f803b1="">Price</span>''', 'html.parser')

for span in soup.find_all('span', **{'data-facet-name': True}):
    print(span['data-facet-name'])
mattbornski
  • 11,895
  • 4
  • 31
  • 25
  • 1
    I think you're right! thanks for taking a moment to teach. ```facets = [] for span in soup.findAll('span', **{'data-facet-name': True}): facets.append(span['data-facet-name']) print (facets)``` – LvP Sep 06 '19 at 21:39
0

You're printing y only when the string "data-facet-name" is equal to True, which it never is. I think you want that line to be if y == "data-facet-name".