Questions tagged [crawler4j]

Crawler4j is an open source Java web crawler.

Crawler4j is an open source Java crawler which provides a simple interface for crawling the Web.

Reference: https://github.com/yasserg/crawler4j

174 questions
0
votes
1 answer

collect only relevant links from url

What I need is to collect the relevant links from the url. For example from a link like http://beechplane.wordpress.com/ , i need to collect the links that contains the actual articles. ie, links like…
Dinoop Nair
  • 2,663
  • 6
  • 31
  • 51
0
votes
0 answers

Quartz scheduler + crawler4J http connection error

I'm trying to combine Quartz scheduler with crawler4j. The problem is that when I execute the C4J code in a main method it works well, but in the quartz Job execute() method, there is a Http connection error. We are working behind a proxy but it's…
strategesim
  • 327
  • 2
  • 3
  • 13
0
votes
3 answers

crawl https pages with crawler4j

For months now we used crawler4j to crawl a https site. Suddenly, since last friday, we're not able to crawl the very same https site. Has something changed in the https-protocol? The site is https://enot.publicprocurement.be/enot-war/home.do As a…
Heinz Uller
  • 33
  • 2
  • 5
0
votes
1 answer

Crawler4j ImageCrawler String args

I´m trying to start the crawler4j example of: crawler4j When I start the ImageCrawlController I allready fail by the first step args.length < 3, because its 0. How can I make sure, that args is bigger then 3? public class ImageCrawlController { …
csnewb
  • 1,190
  • 2
  • 19
  • 37
0
votes
1 answer

Calling Controller.Start in loop in Crawler4j?

I asked one question here. But this is kind of other question that sounds similar. Using crawler4j, I want to crawl multiple seed urls with restriction on domain name (that is domain name check in shouldVisit). Here is an example of how to do it.…
akshayb
  • 1,219
  • 2
  • 18
  • 44
0
votes
1 answer

How to fix error "Failed to load Main-Class manifest from ..."

I download crawler4j on [https://code.google.com/p/crawler4j/downloads/detail?name=crawler4j-3.5.zip&can=2&q=]. I saved in my desktop. After I run crawler4j-3.5.jar, a error is displayed: "Failed to load Main-Class manifest from ..." How can I fix…
MP3
  • 13
  • 2
  • 7
0
votes
1 answer

What is a .lck file and why can't I read it with a buffered reader?

I'm trying to use crawler4j to crawl websites. I was able to follow the instructions on the crawler4j website. When it is done it creates a folder with two different .lck files, one .jdb file and one .info.0 file. I tried to read in the file using…
j.jerrod.taylor
  • 1,120
  • 1
  • 13
  • 33
0
votes
2 answers

some information about pattern matching in a Java web crwaler using crawler4j library

I want implement a very simple web crawler using Java and I have find this library: crawler4j: http://code.google.com/p/crawler4j/ I need a crawler that do the following thing: Start from an URL (specificated by me) and recognizes if in the current…
AndreaNobili
  • 40,955
  • 107
  • 324
  • 596
0
votes
1 answer

How to run crawler4j.jar with MyCrawler.java Controller.java files

I am new to crawlers and I want to run my first crawler program. I have three files Crawler4j.jar Mycrawler.java Controller.java when i enter javac -cp crawler4j-3.1.jar MyCrawler.java Controller.java at terminal i get following…
0
votes
3 answers

Erroneous tree type in java

I am trying to run the following code for BasicCrawlController in java but I get some error: /** * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this…
orezvani
  • 3,595
  • 8
  • 43
  • 57
0
votes
2 answers

Accessing .lck and jdb files stored via web crawler

I'm currently using crawler4j as my web crawler of choice, and I am trying to teach myself how web crawlers work. I've started the crawl and I expected it to quickly return the crawled data at crawlStorageFolder (/data/crawl/root) seen below public…
Octavius
  • 583
  • 5
  • 19
0
votes
3 answers

Determining parameters on crawler4j

I am trying to use crawler4j like it was shown to be used in this example and no matter how I define the number of crawlers or change the root folder I continue to get this error from the code stating: "Needed parameters: rootFolder (it will…
Octavius
  • 583
  • 5
  • 19
0
votes
1 answer

Java - Eclipse - The declared package "edu.uci.ics.crawler4j.examples.basic" does not match the expected package ""

I am trying to set up the example code for Crawler4j, but Eclipse is throwing an error that I don't understand. The error is: The declared package "edu.uci.ics.crawler4j.examples.basic" does not match the expected package "" The path…
Crayl
  • 1,883
  • 7
  • 27
  • 43
0
votes
2 answers

Selectively disable log4j debug log in Play console

I have a Play 2.0 app, ran play console from the command line. Somewhere in one of the libraries I use, it uses log4j and started to stream debug output for [crawler4j][1], I'm trying to figure out how to selectively disable that output in the play…
Bob
  • 8,424
  • 17
  • 72
  • 110
0
votes
2 answers

Controlling the list of URL(s) to be crawled at runtime

In crawler4j we can override a function boolean shouldVisit(WebUrl url) and control whether that particular url should be allowed to be crawled by returning 'true' and 'false'. But can we add URL(s) at runtime ? if yes , what are ways to do that…
user801154
1 2 3
11
12