1

I've used the map function on a dataframe column of postcodes to create a new Series of tuples which I can then manipulate into a new dataframe.

def scrape_data(series_data):
    #A bit of code to create the URL goes here

    r = requests.get(url)
    root_content = r.content
    root = lxml.html.fromstring(root_content)
    
    address = root.cssselect(".lr_results ul")
    for place in address:
        address_property = place.cssselect("li a")[0].text
        house_type = place.cssselect("li")[1].text
        house_sell_price = place.cssselect("li")[2].text
        house_sell_date = place.cssselect("li")[3].text
        return address_property, house_type, house_sell_price, house_sell_date

df = postcode_subset['Postcode'].map(scrape_data)

While it works where there is only one property on a results page, it fails to create a tuple for multiple properties.

What I'd like to be able to do is iterate through a series of pages and then add that content to a dataframe. I know that Pandas can convert nested dicts into dataframes, but really struggling to make it work. I've tried to use the answers at How to make a nested dictionary and dynamically append data but I'm getting lost.

halfer
  • 19,824
  • 17
  • 99
  • 186
elksie5000
  • 7,084
  • 12
  • 57
  • 87

2 Answers2

1

At the moment your function only returns for the first place in address (usually in python you would yield (rather than return) to retrieve all the results as a generator.

When subsequently doing an apply/map, you'll usually want the function to return a Series...

However, I think you just want to return the following DataFrame:

return pd.DataFrame([{'address_ property': place.cssselect("li a")[0].text,
                      'house_type': place.cssselect("li")[1].text,
                      'house_sell_price': place.cssselect("li")[2].text,
                      'house_sell_date': place.cssselect("li")[3].text}
                          for place in address],
                    index=address)
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • Thanks for that - I'll give it a go. I'm still very green to Python and its data structures although finding Pandas is definitely worth the effort to learn in my work. (Also thank you for tidying up my code) – elksie5000 Jun 05 '13 at 09:37
  • That returned - SyntaxError: Generator expression must be parenthesized if not sole argument - Not quite sure where to put the parentheses, though. – elksie5000 Jun 05 '13 at 09:48
  • @elksie5000 whoops! corrected by making it a list comprehension (sorry I had done this on my test). – Andy Hayden Jun 05 '13 at 09:50
  • That's brilliant - and super fast with your help. Really appreciate your time. – elksie5000 Jun 05 '13 at 09:55
0

To make the code work, I eventually reworked Andy Hayden's solution to:

listed = []
    for place in address:
        results = [{'postcode':postcode_bit,'address_ property': place.cssselect("li a")[0].text,
                  'house_type': place.cssselect("li")[1].text,
                  'house_sell_price': place.cssselect("li")[2].text,
                  'house_sell_date': place.cssselect("li")[3].text}]

        listed.extend(results)
    return listed

At least I understand a bit more about how Python data structures work now.

elksie5000
  • 7,084
  • 12
  • 57
  • 87