2

This might be a completely foolish question, but google is to no avail. First of course importing the libraries I need:

from lxml import html
from lxml import etree
import requests

Simple enough. Now to run and parse some code. The link in this case is the weekly lunchmenu for a local restaurant. Here we prep the code for extracting our bits from it.

page = requests.get("http://www.farozon.se/lunchmeny-20207064")
tree = html.fromstring(page.text)
htmlparser = etree.HTMLParser()
tree2 = etree.parse(page.raw, htmlparser)

Now let's take a look at the menu! As you can see I am testing several different ways of getting the desired output.

friday = tree.cssselect("#block_82470858 > div > div > div.h24_frame_personal_text.h24_frame_padding > div > table > tbody > tr:nth-child(4)")
test = tree.xpath("/html/body")

Let's just print the output to see what we get.

print page
print tree.cssselect('#block_82470858 > div > div > div.h24_frame_personal_text.h24_frame_padding > div > table > tbody > tr:nth-child(4)')
print tree2
print friday
print test

Looking forward for to eat some... Wait, that aint food. The heck is that? In my attempt above, and in my IDE, I've tried Google's top 20 links for lxml and requests, they all output the same thing, but claim to output the actual html. I ain't got no clue what's going on.

<Response [200]>
[<Element tr at 0x30139f0>]
<lxml.etree._ElementTree object at 0x2db0dd0>
[<Element tr at 0x30139f0>]
[<Element body at 0x3013a48>]
roippi
  • 25,533
  • 4
  • 48
  • 73
Ruhpun
  • 25
  • 3
  • try adding `.text` on to the end of some of your objects... – MattDMo Jan 09 '15 at 02:42
  • I've have tried adding `.text` to literally anywhere I can, it either outputs the same or an error. This is my first python project, so if you had a specific place in mind, please do share. – Ruhpun Jan 09 '15 at 02:45

3 Answers3

2

Going through the lxml.etree and requests tutorials should help with understanding the basics.

<Response [200]>

This is a requests.Response object which is returned by, in this case - requests.get() call.

<lxml.etree._ElementTree object at 0x2db0dd0>

This is an ElementTree object returned by the parse() method.

tree.cssselect() and tree.xpath() in this case return you a list of lxml.etree.Element instances, every item in the list corresponds to an HTML element on the page.


Here is an example code for extracting the menu items:

from lxml import html
import requests

page = requests.get("http://www.farozon.se/lunchmeny-20207064")
tree = html.fromstring(page.text)

days = tree.cssselect("#block_82470858 table tr")[1:-1]
for item in days:
    cells = item.findall('td')
    day = cells[0].text_content().strip()
    dishes = cells[-1].text_content().strip()

    print day
    print dishes
    print "----"

Prints:

Måndag
----
Tisdag
----
Onsdag
  Helstekt kalkonbröstfile med rödkål, gele
  Panpizza med skinka,ananas,lök,bacon, vitkålssallad
 
----
Torsdag
 Ärtsoppa med fläsk, pannkaka, sylt, grädde
 Köttfärslimpa pampas med gräddsås, lingonsylt
...

As you can see, I'm using text_content() method to extract the contents of an Element object.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • I think your encoding is a bit off ;) – Padraic Cunningham Jan 09 '15 at 03:12
  • This is quite genius. If I understand this correctly you first define days as the html element starting at [1] and going to [-1] (the last element?) Then you cycle through all the cells ( in html) and extract the day, followed by dishes, as they are next to each other in tags. And print. – Ruhpun Jan 09 '15 at 03:17
  • @PadraicCunningham I don't know that language, who knows what letters they have :) (fixed, thanks) – alecxe Jan 09 '15 at 04:01
  • @Ruhpun right, iterating over `tr` tags (`[1:-1]` here just eliminates the first and the last row - header and the last `Veckans Sallad` row). day and dishes are extracted from the `td` elements inside the row. Hope this is clear. – alecxe Jan 09 '15 at 04:04
2

You might find beautifulSoup to be an easier tool to use:

import requests
page = requests.get("http://www.farozon.se/lunchmeny-20207064")
from bs4 import BeautifulSoup

soup = BeautifulSoup(page.content)
s = soup.find("div",attrs={"class":"h24_frame_personal_text h24_frame_padding"}).find("table").text

print "\n".join(s.strip().splitlines())

Dagens v. 2


Måndag
 

  
 
 



Tisdag
 

 
   
 



Onsdag
 

  Helstekt kalkonbröstfile med rödkål, gele
  Panpizza med skinka,ananas,lök,bacon, vitkålssallad
 



Torsdag
 

 Ärtsoppa med fläsk, pannkaka, sylt, grädde
 Köttfärslimpa pampas med gräddsås, lingonsylt
 



 Fredag
 

 Brässerad skinkstek med äppelchutney
 Nasi goreng med sweetchili creme
 



  Lördag 
  10/1
 
 

   


 




  Söndag
    11/1
    

 

Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
  • You did get the encoding right, which makes my Swedish a lot easier to read, but I've been at the darn lxml day for hours now, and it'd be a shame if I couldn't get it working. I'll definitely revert to this if I can't get the lxml way as pretty looking. Thanks for your time man. – Ruhpun Jan 09 '15 at 03:22
  • @Ruhpun, no worries, alecxe's answer is nicer, personally I just prefer using beautifulSoup. – Padraic Cunningham Jan 09 '15 at 03:26
  • Yeah, if only I could figure out how to get tasty Köttfärslimpa (meatloaf) instead of Köttfärslimpa (deadloaf). Either way I go with the solution, the fact that you put your time into helping me is great, and I thank you for that. – Ruhpun Jan 09 '15 at 03:29
  • @Ruhpun, try `page.text` – Padraic Cunningham Jan 09 '15 at 03:34
  • @Ruhpun, or `parser = html.HTMLParser(encoding=page.encoding) tree = html.fromstring(page.content, parser=parser)` – Padraic Cunningham Jan 09 '15 at 03:41
1

If you are looking for the HTML, you need etree.tostring(). When you do searches, you get back lists of elements, so print each individually. Like so:

for e in friday:
    print etree.tostring(e)

Or, in the case of unique items:

print etree.tostring(friday[0])

The docs are here. The pretty_print, method, and with_tail options are the most important.

Jonathan Eunice
  • 21,653
  • 6
  • 75
  • 77