0

[In order to open the example urls you need to login to Shazam]

So I'm writing a script that downloads my Shazam history so I can then manipulate it to write playlists to other services. Anyways, I can't directly parse the history from http://www.shazam.com/myshazam because there's a lot of JavaScript reloading going on there and I guess it would be harder to solve that problem. So that's why I want to manipulate the file you can download which you can find here http://www.shazam.com/myshazam/download-history

I'm trying to find a way to do this but I'm running with some problems here.

1st I was planning to use urlretrieve

import urllib
urllib.urlretrieve ("http://www.shazam.com/myshazam/download-history, "myshazam-history.html")

but I'm not even sure that's going to work at all because when I try to download that file there's not an actual URL path like http://www.shazam.com/myshazam/download-history/myshazam-history.html (that gives you a 404 error). Instead when you hit that URL it immediately redirects to http://www.shazam.com and it prompts the download window of the browser.

The 2nd problem is that I still need to hold the cookies of the sessions and I don't know how to pass that to urlretrieve to test if it works. Below there is a test code I wrote that is logging in, holding the session and then parse a webpage.

def LoginFB(username,password):
   opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookielib.CookieJar()))
   url = "https://www.facebook.com/login.php?skip_api_lo....allthe_loginshazam_stuff)"
   data = "&email="+username+"&pass="+password
   socket = opener.open(url)
   return socket, opener

def shazamParse(opener):
   url = "http://www.shazam.com/myshazam/"
   content = opener.open(url).read()
   soup = BeautifulSoup(content)
   finalParse = soup.prettify()
   return finalParse.encode("utf-8")

(socket, opener) = LoginFB("email","password")

shazamParse(opener)    

What I want to do is hit the download url as a logged user(holding the session cookies), download the file into the memory, put the contents of the file into a string and then parse it with BeautifulSoup. Exactly the same approach as my shazamParse function only that I'm reading from a string with the contents of the myshazam-history.html file.

Any ideas or hints on how can I do this?

whoisjuan
  • 437
  • 1
  • 6
  • 15

1 Answers1

0

While i'll provide a direct answer here, there are several other libraries that will do this type of thing in a much cleaner, more maintainable manner for you. They are:

  1. Scrapy - A web spider which handles auth. It's a large tool, but will work well if you do a lot of scraping.
  2. requests library - This is what urllib2 should have been. Highly recommended for this job!

To do this with urllib2, you'll need to use the CookieJar library, so that urllib2 has a chance to hold onto your session and cookie variables set in the initial auth request.

import urllib2
from cookielib import CookieJar

cj = CookieJar()
# Creates a custom page opener, which is cookie aware
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

# Make the actual url request from the server
data = {}  # Any optional data to add to the headers.
response = opener.open("http://www.example.com/page/page2", data)
page_data = response.read()

# Look at the HTML from the response
print page_data[0:200]

Once you setup a urllib2 opener with a CookieJar, all future requests from this opener will be able to access cookies setup in previous requests.

There is another issue you might run into about using facebook auth to access their site, but that's should be posted on a new question should you possibly run into that!

Hope that helps!

VooDooNOFX
  • 4,674
  • 2
  • 23
  • 22
  • Thanks. So I think I have covered the problem of the cookies with that same implementation you just posted. My problem is that I don't how to pass the cookie to the .urlretrieve() method so I can use that method to download a file. – whoisjuan Oct 21 '14 at 03:31
  • @user3381594: I edited my answer so you can see how to retrieve the page's HTML data. There's no need to use `urlretrieve`, since that stores it on the drive. We can work with it all in memory. – VooDooNOFX Oct 21 '14 at 22:57
  • Thanks man. I tried with your method but it's not working. I get a HTTP error 401. The thing is that the URL I'm trying to parse would not render an HTML at all. What it does it that it prompts you to DOWNLOAD an HTML file. That's your Shazam history. What I'm trying to achieve is to download that file into the memory and once is there parse it as normal HTML. But I want to repeat that this URL DOESN'T RENDERS an HTML. Instead it gives you an HTML file to download. A couple of weeks ago they were giving the list in PDF but they changed to a HTML file some days ago. Any Hint? – whoisjuan Oct 23 '14 at 19:38
  • 401 error means Shazam doesn't think you're authorized to view the page. Work on the auth section first. `urllib2` will download whatever page the server returns to the request, even if that is a file download response. – VooDooNOFX Oct 23 '14 at 22:23