0

Making your GWT app to be indexed by Search Engine is very important, but very very little info to tell you Step-By-Step Guidelines of How to make GWT app Crawlable dynamically.

Ok, Here is what I understood but I am not sure 100% I am correct or not. SO please correct me if you can.

To make a Gwt page ex myDomain.com#article;articleID=1 to be indexed by search engine, you need to:

-1st, convert myDomain.com#article;articleID=1 to myDomain.com#!article;articleID=1

-2nd, when Google /Yahoo bot visits that page (myDomain.com#!article;articleID=1), it will convert that page into (myDomain.com?_escaped_fragment_=article&articleID=1) & request that page into your webserver.The Web server then will try to render that page to the Bot.

So if we have a static page of myDomain.com?_escaped_fragment_=article&articleID=1, then the content of that page will be read by the Bot & can be indexed.

But the article is dynamic cos user could enter myDomain.com?_escaped_fragment_=article&articleID=2 or ...article=3... but we can't manually make the static page for each of article.

So the solution is HtmlUnit.

A tool like HtmlUnit will dynamically convert myDomain.com?_escaped_fragment_=article&articleID=2 into a page that the bot can read.

But I don't know step-by-step guidelines of How to set HtmlUnit up?

Is HtmlUnit like a jar file that we can put into lib? what should we do next?

used solution from Patrik, then compile then out int into webapp of Tomcat7 but got this err in the browser

type Exception report

message Filter execution threw an exception

description The server encountered an internal error that prevented it from fulfilling this request.

exception

javax.servlet.ServletException: Filter execution threw an exception
    com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66)
    com.gwtplatform.dispatch.server.AbstractHttpSessionSecurityCookieFilter.doFilter(AbstractHttpSessionSecurityCookieFilter.java:67)
    com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
    com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
    com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
    com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
root cause

java.lang.NoClassDefFoundError: org/w3c/css/sac/ErrorHandler
    myproject.server.CrawlFilter.doFilter(CrawlFilter.java:101)
    com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66)
    com.gwtplatform.dispatch.server.AbstractHttpSessionSecurityCookieFilter.doFilter(AbstractHttpSessionSecurityCookieFilter.java:67)
    com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
    com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
    com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
    com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
root cause

java.lang.ClassNotFoundException: org.w3c.css.sac.ErrorHandler
    org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1720)
    org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1571)
    myproject.server.CrawlFilter.doFilter(CrawlFilter.java:101)
    com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66)
    com.gwtplatform.dispatch.server.AbstractHttpSessionSecurityCookieFilter.doFilter(AbstractHttpSessionSecurityCookieFilter.java:67)
    com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
    com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
    com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
    com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
note The full stack trace of the root cause is available in the Apache Tomcat/7.0.53 logs.
Tum
  • 3,614
  • 5
  • 38
  • 63

1 Answers1

1

See this SO working example inspired from this gwt-platform example and this Google documentation among others.

You setup HtmlUnit on the server side like any other library by adding their jar in you lib folder -- as far as I recall, that's all there is to it.

Community
  • 1
  • 1
Patrick
  • 1,561
  • 2
  • 11
  • 22
  • do u use htmlunit-r5662-gae.jar? why i don't have method .setThrowExceptionOnScriptError & .setJavaScriptEnabled & .pumpEventLoop – Tum May 16 '14 at 22:44
  • couldn't run cos [WARN] FAILED CrawlFilter: java.lang.InstantiationException: myapp.server.CrawlFilter – Tum May 16 '14 at 23:02
  • i used BrowserVersion.FIREFOX_17, maybe that is the problem – Tum May 16 '14 at 23:22
  • finally it has no error but it still shows the hostpage, why? Do you have that problem? – Tum May 17 '14 at 01:06
  • I use htmlunit-2.12.jar but I see the latest release is 2.14. Make sure to have in your classpath all the lib dependencies found in the htmlunit download lib directory. What do you mean by "showing the hostpage"? If your webapp link is myDomain.com#!article;articleID=1 then your static crawler link should be something like myDomain.com?_escaped_fragment_=article;articleID=1 and you should see the same page as static html so without the dynamic javascript capabilities. Not sure whether you'd need to encode in your crawler link ";" or/and "=" as "%3B" and "%3D". – Patrick May 17 '14 at 03:51
  • no it showed the content of hostpage , please read this question http://stackoverflow.com/questions/23697516/why-htmlunit-always-shows-the-hostpage-no-matter-what-url-i-type-in-crawlable-g – Tum May 17 '14 at 04:22
  • by the way, my app use GWTP platform & maybe that is the main reason. HTMLUnit may not be able to parse GWTP page. Does your app buid on GWTP? – Tum May 17 '14 at 04:24
  • the url is very simple, mydomain.com#!article, nothing needs to encode. It actually showed the result, so the HTMLUnit works but not work properly. – Tum May 17 '14 at 04:27
  • I do not use GWTP platform, so not sure what could go wrong with that one. I am also not clear about what the problem you are having: can you describe what "HtmlUnit works but not properly" means? I posted an answer to your other question http://stackoverflow.com/a/23712812/1143684. So perhaps the issues I pointed at therein may be causing your griefs? – Patrick May 17 '14 at 15:09