Questions tagged [crawler4j]

Crawler4j is an open source Java web crawler.

Crawler4j is an open source Java crawler which provides a simple interface for crawling the Web.

Reference: https://github.com/yasserg/crawler4j

174 questions
0
votes
1 answer

convert a basic crawler4j to focussed crawler

I've implemented a basic crawler that retrieves data from seed Urls and is able to download the pages. further i am able to keep my crawler in the same seed website until the specified depth is achieved. How can I impose more restrictions on my…
0
votes
0 answers

How to crawl latest articles in a specific domain using a specific set of websites?

I'm interested to build a program to get all latest articles in a specific domain ("computer science") from a specific set of websites ("ScienceDirect" for example). As you know, some websites publish a page for each research article, such as:…
AmirHJ
  • 827
  • 1
  • 11
  • 21
0
votes
1 answer

Parse a page (partly generated by JavaScript) by using Selenium

I've got a problem: I want to parse a page (e.g. this one) to collect information about the offered apps and save these information into a database. Moreover I am using crawler4j for visiting every (available) page. But the problem - as I can see -…
Hisushi
  • 67
  • 1
  • 11
0
votes
1 answer

Check HTTP Status for jpg files using jsoup

I am getting http status codes for urls using jsoup as follows: Connection.Response response = null Document doc = Jsoup.connect(url).ignoreContentType(true).get() response = Jsoup.connect(url) …
clever_bassi
  • 2,392
  • 2
  • 24
  • 43
0
votes
1 answer

Get seed of URL in crawler4j visit()

Hi how do I get the seed where it came from of the page in crawler4j's visit function? So far i have the url of the page but i cant figure out what was the seed that lead to there. public void visit(Page page) { String url =…
pinpox
  • 179
  • 2
  • 10
0
votes
1 answer

why is crawler4j hanging randomly?

I've been using crawler4j for a few months now. I recently started noticing that it hangs on some of the sites to never return. The recommended solution is to set resumable to true. This is not an option for me as I am limited on space. I ran…
Salim
  • 199
  • 3
  • 18
0
votes
1 answer

Get Http status using crawler4j & Jsoup

I am creating a Groovy & Grails app using MongoDB in the backend. I am using crawler4j for crawling and JSoup for parsing functionality. I need to get the http status of a URL and save it to database. I am trying the following: @Override void…
clever_bassi
  • 2,392
  • 2
  • 24
  • 43
0
votes
1 answer

"Operation not allowed after ResultSet closed" with Datasource and crawler4j

After reading through a lot of similar questions I have not been able to get a solution that works for me. I have this methods: In a crawler4j Controller I do this: ArrayList urls = Urls.getURLs(100); for (String s : urls) { …
pinpox
  • 179
  • 2
  • 10
0
votes
1 answer

Crawler4j not working for https urls

I am developing a grails app using crawler4j. I know this is an old question and I came across this solution here. I tried the solution provided but am not sure where to keep the another fetcher and mockssl java files. Also, I am not sure how…
clever_bassi
  • 2,392
  • 2
  • 24
  • 43
0
votes
1 answer

Crawler4j calculate depth of a page

I am developing a web crawler using groovy & grails and mongodb Is there any way to calculate depth of a page using crawler4j? I know I can limit upto what depth I want to crawl but haven't come across anything that suggests how to calculate depth…
clever_bassi
  • 2,392
  • 2
  • 24
  • 43
0
votes
1 answer

Implementing Crawler4j with Selenium in Java doesn`t work

I'm trying to use Crawler4j simultaneous with Selenium for some Website testing. After a webpage is crawled, Selenium should start simultaneous a test with the parameters he got from the crawler. This would be the URL he should open or the Id's of…
juzwani
  • 53
  • 2
  • 7
0
votes
1 answer

Crawler4j crawl jquery live content

I have a website but on its category page , product list generated after page loaded via javascript. And my crawler goes it and couldnt find any product. How can i solve that problem ? CrawlConfig config = new CrawlConfig(); …
Muhammet Arslan
  • 975
  • 1
  • 9
  • 33
0
votes
1 answer

Running crawler4j on multiple computers | different instances | Root Folder Lock

I'm trying to implement a crawler by using crawler4j. It's running fine until: I Run only 1 copy of it. I run it continuously without restart. If i restart the crawler, the url's collected are not unique. It is because, the crawler locks the root…
Lavneet
  • 516
  • 5
  • 19
0
votes
1 answer

Crawler4j Stops Silently

In my application,I am using crawler4j. Though application is big, but I have even tested code with sample codes given here : https://code.google.com/p/crawler4j/source/browse/src/test/java/edu/uci/ics/crawler4j/examples/basic/ Problem is, it works…
akshayb
  • 1,219
  • 2
  • 18
  • 44
0
votes
1 answer

Java The return type is incompatible with WebCrawler.visit(Page)

I'm using some crawler code from http://code.google.com/p/crawler4j/. Now, what I'm trying to do is to access every URLs found in the MyCrawler class from another class. I start the crawler with : // * Start the crawl. This is a blocking operation,…
PinkPanties
  • 35
  • 1
  • 5