Scraping a webpage that is using a firebase database

Question

DISCLAIMER: I'm just learning by doing, I have no bad intentions

So, I would like to fetch the list of the applications listed on this website: http://roaringapps.com/apps

I've done similar things in the past, but with simpler websites; this time I'm having problems getting my hands on the data behind this webpage.

The scrolling from page to page is blazing fast so, to understand how the webpage works, I've fired up a packet sniffer and analyzed the traffic. I've noticed that, after the initial loading, no traffic is exchanged between the server and my client, even if I scroll over 2500 records in the browser. How is that possible?

Anyhow. My understanding is that the website is loading the data from a stream of some sort, and render it via Javascript. Am I correct?

So, I've fired up chromium devtools a looked at the "network" tab, and saw that a WebSocket request is made to the following address: wss://s-usc1c-nss-123.firebaseio.com

At this point, after googling a bit, I've tried to query the very same server, using the "v=5&ns=roaringapps" query I saw on the devtools window:

from websocket import create_connection
ws = create_connection('wss://s-usc1c-nss-123.firebaseio.com')
ws.send('v=5&ns=roaringapps')
print json.loads(ws.recv())

And got this reply:

{u't': u'c', u'd': {u't': u'h', u'd': {u'h': u's-usc1c-nss-123.firebaseio.com', u's': u'JUL5t1nC2SXfGaIjwecB6G13j1OsmMVv', u'ts': 1476799051047L, u'v': u'5'}}}

I was expecting to see a json response with the raw data about applications & so on. What I'm doing wrong?

Thanks a lot!

UPDATE

Actually, I just found out that the website is using json to load its data. I was not seeing it in iterated requests probably because of caching - but disabling it in chromium did the trick.

score 7 · Accepted Answer · answered Oct 18 '16 at 15:23

While the Firebase Database allows you to read/write JSON data. But its SDKs don't simply transfer the raw JSON data, they do many tricks on top of that to ensure an efficient and smooth experience. W

hat you're getting there is Firebase's wire protocol. The protocol is not publicly documented and (if you're new to it) trying to unravel it is going to give you an unpleasant time.

To retrieve the actual JSON at a location, it's easiest to use Firebase's REST API. You can get that by simply appending .json to the URL and firing a HTTP GET request against that.

So if the initial data is being loaded from:

https://mynamespace.firebaseio.com/path/to/data

You'd get the raw JSON by firing a HTTP GET against:

https://mynamespace.firebaseio.com/path/to/data.json

Thanks Frank, now I have a least a clue of what I was messing with :) I will have a look at the docs, it's all new to me. — Delta, Oct 19 '16 at 06:39

Scraping a webpage that is using a firebase database

1 Answers1