0

Ok, i found this link https://code.google.com/p/gwt-platform/wiki/CrawlerSupport#Using_gwtp-crawler-service that explain how you can make your GWTP app crawlable.

I got some GWTP experience, but i know nothing about AppEngine.

Google said its "crawlservice.appspot.com" can parse any Ajax page. Now I have a page "http://mydomain.com#!article" that has an artice that was pulled from Database. Say that page has the text "this is my article". Now I open this link:

crawlservice.appspot.com/?key=123456&url=http://mydomain.com#!article, then i can see all javascript but I couldn't find the text "this is my article".

Why?

Now let check with a real life example

open this link https://groups.google.com/forum/#!topic/google-web-toolkit/Syi04ArKl4k & you will see the text "If i open that url in IE"

Now you open http://crawlservice.appspot.com/?key=123456&url=https://groups.google.com/forum/#!topic/google-web-toolkit/Syi04ArKl4k you can see all javascript but there is no text "If i open that url in IE",

Why is it?

SO if i use http://crawlservice.appspot.com/?key=123456&url=mydomain#!article then Can google crawler be able to see the text in mydomain#!article?

also why the key=123456, it means everyone can use this service? do we have our own key? does google limit the number of calls to their service?

Could you explain all these things?

Extra Info:

Christopher suggested me to use this example https://github.com/ArcBees/GWTP-Samples/tree/master/gwtp-samples/gwtp-sample-crawler-service

However, I ran into other problem. My app is a pure GWTP, it doesn't have appengine-web.xml in WEB-INF. I have no idea what is appengine or GAE mean or what is Maven.

DO i need to register AppEngine?

My Appp may have a lot of traffic. Also I am using Godaddy VPS. I don't want to register App Engine since I have to pay for Google for extra traffic.

Everything in my GWTP App is ok right now except Crawler Function.

So if I don't use Google App Engine, then how can i build Crawler Function for GWTP?

I tried to use HTMLUnit for my app, but HTMLUnit doesn't work for GWTP (See details in here Why HTMLUnit always shows the HostPage no matter what url I type in (Crawlable GWT APP)? )

Community
  • 1
  • 1
Tum
  • 3,614
  • 5
  • 38
  • 63
  • possible duplicate of [How to approach Google groups discussions crawler](http://stackoverflow.com/questions/2211887/how-to-approach-google-groups-discussions-crawler) – Howli May 18 '14 at 14:07
  • @Howlin, this is Crawlable solution for GWTP, not for Python, this is a very valuable question, please do not close it. – Tum May 20 '14 at 22:58

2 Answers2

0

I believe you are not allowed to crawl Google Groups. Probably they are actively trying to prevent this, so you do not see the expected content.

Community
  • 1
  • 1
Peter Knego
  • 79,991
  • 11
  • 123
  • 154
  • so what about my own domain, i didn't see any text in http://crawlservice.appspot.com/?key=123456&url=mydomain#!article & i did not set anything in my website – Tum May 17 '14 at 10:56
  • I can't test it as you did not provide url to your app. Also you are asking why a particular tool (crawlservice) isn't working: this is not a programming question, nor is it related to AppEngine or Gwt. – Peter Knego May 17 '14 at 12:32
0

There's a couple points I wish to elaborate on:

  1. The Google Code documentation is no longer maintained. You should look on Github instead: https://github.com/ArcBees/GWTP/wiki/Crawler-Support
  2. You shouldn't use http://crawlservice.appspot.com. This isn't a Google service, it's out of date and we may decide to delete it down the road. This only serves as a public example. You should create your own application on App Engine (https://appengine.google.com/)
  3. There is a sample here (https://github.com/ArcBees/GWTP-Samples/tree/master/gwtp-samples/gwtp-sample-crawler-service) using GWTP's Crawler Service. You can basically copy-paste it. Just make sure you update the <application> tag in appengine-web.xml to the name of your application and use your own service key in CrawlerModule.

Finally, if your client uses GWTP and you followed the documentation, it will work. If you want to try it manually, you must encode the Query Parameters. For example http://crawlservice.appspot.com/?key=123456&url=http://www.arcbees.com#!service will not work because the hash (everything including and after #) is not sent to the server. On the other hand http://crawlservice.appspot.com/?key=123456&url=http%3A%2F%2Fwww.arcbees.com%2F%23!service will work.

Chris
  • 165
  • 1
  • 5
  • Hi Christopher, I asked so many question about Crawlable GWTp in many days before, but no one could explain me clearly what is going on. Your answer is like a light at the end of the tunnel. It's a very valuable piece of info. However, I got another problem. I updated my question, so please read my question at the end form "Extra info". – Tum May 20 '14 at 22:53