crawler4j do not recognize all links on a page

Question

Basically I am facing a problem where crawler4j do not recognize all links on the page.

say for example there are 5 links existing on the page out of them only 3 gets recognized and hence fetched. Rest 2 are not even recognized.

What is the expected output? What do you see instead? All the links in a page shall be recognized so that they can be fetched

What version of the product are you using? crawler4j 4.1

Please provide any additional information below. Only difference I found in the links which are not recognized is that these links has angled bracket in it.

ex.

<a title="some text" href="http://www.example.com/abc/xyz-<sometext>-abc-xyz/abc_xyz" >some text</a>

score 0 · Answer 1 · answered Aug 24 '15 at 14:08

Yes, it seems like a bug in the crawler4j page parser.

It finds the tag, then it searches for a closing bracket - here is the failure point I assume.

Please submit an issue to the new crawler4j site - on github: https://github.com/yasserg/crawler4j/issues

Thanks

crawler4j do not recognize all links on a page

1 Answers1