1

I am using crawler4j to pull some data from Google play store (https pages). However, I checked my downloaded html content and found that it is slightly different from the page source I see in the browser. Why? Is it because Google detected that I am using a bot client (so my http request is handled differently)?

Can anyone help me? Thanks a lot!

I have solved the problem. THANKS for all the help :)

andrew
  • 885
  • 2
  • 8
  • 16
  • I think so, yes. Can you change the useragent string for crawler4j? – Mr Lister Feb 02 '13 at 08:37
  • Yes. I checked my machine's user agent string and it is "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17". Then I set it as crawler4j's user agent, but this does not work. The pulled content is the same as before (different from what I see in browsers). Anyway, thanks a lot for your comment. Please give more advice if you know other solutions. Thanks! – andrew Feb 02 '13 at 13:48
  • Not sure I have any other ideas, sorry. – Mr Lister Feb 02 '13 at 13:50
  • Are you executing JavaScript? If you're not and the browser is, then you might get different content as the HTML document might be modified by the script. – Kiril Feb 05 '13 at 15:51
  • You are right. I found out three reasons why I get different webpages (1) user agent: google server may detect that I am a robot (2) http header's accept-language: I need to set the accept-language as otherwise the obtained webpage could be in other language (3) javascripts: some scripts are executed when a page is loaded. thanks for your comments. – andrew Feb 06 '13 at 02:01

0 Answers0