7

I am trying to fetch facebook a user's profile page using "wget" but keep getting a non-profile page called "browser.php" which has nothing to do with that particular user. The profile page's URL as I see in the browser happens to be of the following format:

http://www.facebook.com/user-name

and that's what I have been using as the argument to the wget command:

wget http://www.facebook.com/user-name

I am also interested in using wget to fetch a user's friends' list but even that is giving me the same unhelpful result ("browser.php"):

wget http://www.facebook.com/user-name?sk=friends&v=friends

Could someone kindly advise me what I'm doing wrong here? In other words, am I missing out some key options for wget command or does wget not fit such a scenario at all?

Any help will be greatly appreciated.

To add context to this query, I need to figure out how to fetch these pages from Facebook using wget as it would then help me write a script/program to look up friends' profile URLs from the HTML source code and then look up some other keywords on them, etc. I am basically hoping that this would help me in doing some kind of selective-crawling (with Facebook's permission of course) of people I am not connected to.

rogerchucker
  • 243
  • 1
  • 3
  • 8

6 Answers6

2

First, Facebook have probably created a condition where certain user agents (e.g. wget) cannot crawl the pages. So they redirect certain user agents yo a different page which would probably say something like "your browser is not supported" They do that to protect people from doing exactly what you are doing. However you can tell wget to identify itself as a different agent using -U argument to wget (read the wget man page). e.g. wget -U Mozilla http://....

Second, Facebooks privacy setting rarely allows you to read any/much information unless you are logged in as a user, and probably only as a user who is friend to the profile you are trying to scrape.

Thridly, there is an Facebook API which you need to use to crawl and extract information from facebook -- you are likely in violation of the Acceptable Use policy if you try to obtain information in any other way.

Soren
  • 14,402
  • 4
  • 41
  • 67
  • If a person who is not in my network publishes her friends' list, then is it still a private information from Facebook's perspective? – rogerchucker Jul 25 '11 at 20:28
  • I was first thinking of going the Facebook API route (specifically the Graph API), but it seems like all friends information requires access-token and I wanted to do this unobtrusively. Would this be in violation of Facebook's policy even if it is a strictly academic research? – rogerchucker Jul 25 '11 at 20:29
  • The general rule for facebook data via the Facebook API is simple -- *if you can get it via the Facebook API, then it is either data which you have been granted access to or data which is public.* Most of the facebook APIs will allow you to ask for data from a user, and the API will return the data which you have asked **and** which you are allowed to see. hence you application when dealing with facebook data should be build so that it can accept data **and** no-data being returned for similar requests. – Soren Jul 26 '11 at 04:09
  • @user611846 -- Im not sure if there is a precise line where Facebook consider it violation of TOS (I have no affiliation with Facebook), however I believe they are looking at abnormal behaviour patterns, and react as per their discression. Many companies **do** however want to support accademic research, and they often have programs for such where they will put contractual conditions in place and potentially grant you some access to some anonymized data -- if this is truly for academic research, then why don't you contact facebook directly and ask if they have such program? – Soren Jul 26 '11 at 04:19
1

If you want to save the logged in page, you can log in with Firefox with "Keep me logged in" selected, then copy those cookies to a file and use them with the cookiejar option. You will still have quite a bit of dynamic script loaded content that WGET isn't going to save.

There's many ways to skin this cat. If you need to extract a specific item, check out the API. If you're simply wanting to archive a snapshot of the page as it would appear in a web browser, try CutyCapt. It's much like wget, except it parses the entire document as a web broswer would and stores an image of the page.

David
  • 11
  • 2
1

Check the following open-source projects:

  • facebook-cli, it's a command-line utility to interact with the Facebook API.
  • facebook-friends which can generate an HTML page of all of your Facebook friends.
kenorb
  • 155,785
  • 88
  • 678
  • 743
1

I donno why you want to use wget ..facebook offers an excellent API .

wget --user-agent=Firefox http://www.facebook.com/markzuckerberg

will save the publicly available content to a file.

you should consider using their API.

Facebook Developers

Vamsi Krishna B
  • 11,377
  • 15
  • 68
  • 94
  • Thanks Krish. Unfortunately that doesn't work since it gives me a file (where the filename is the username) that doesn't have the information that Also if I was "lying" to Facebook by changing the user-agent wouldn't Facebook actually object when seeking real permission for crawling (since that's my final objective)? I am trying to use wget because I don't know anything else. Any other suggestion would be greatly helpful as well - I am looking for anything that could work from within a script or a program. – rogerchucker Jul 25 '11 at 20:25
  • 1
    Krish, Facebook API requires access-token of every user whose profile I'm trying to fetch. That would be impractical in an unobtrusive data collection. – rogerchucker Jul 25 '11 at 20:40
0

You can reuse Firefox cookies easily to login, see:

Who can see your friend list is configurable, so if someone configures it to Friends only, you cannot extract that information.

Also I recommend using the mobile site, which uses pagination instead of AJAX loading and has much simpler, smaller HTML: https://m.facebook.com/USER/friends?startindex=24

And here are the (very restrictive) scrape terms: https://www.facebook.com/apps/site_scraping_tos_terms.php

kenorb
  • 155,785
  • 88
  • 678
  • 743
Ciro Santilli OurBigBook.com
  • 347,512
  • 102
  • 1,199
  • 985
0

To download a Facebook page using wget, you can use Chrome DevTools in your web-browser (also in Firefox, Opera and other).

First, you need to convert it to curl command by going to Network tab (refresh page if necessary or tick Preserve log), find the page of your interest (you can filter the list), right click on the request/page, then select Copy as cURL. Then paste the command to the terminal.

To convert from curl format to wget, do the following conversions:

  • remove --compress parameter,
  • change -H to --header in all places.

Consider also adding the following wget parameters:

  • -k or --convert-links, to convert the links in the document to make them suitable for local viewing.
  • -p or --page-requisites, to download all the files that are necessary to properly display a page.

See also:

kenorb
  • 155,785
  • 88
  • 678
  • 743