Crawl Image using Apache Nutch

Question

I installed Apache Nutch 2.3.1 and Solr 6.5.1 and MongoDB 3.4.7. After I crawl urls that contain many images, in Solr and mongoDB isn't any image and video. I also changed regex-urlfilter.txt file in apache nutch and delete postfix that were related to image(.png,.jpeg,.gift,...). After that I changed suffix-urlfilter.txt file and comment jpeg,gif,png too.
After do that works the Apache Nutch doesn't crawl image. Now I want to know how I can crawl image and see that in Solr? As I read about it, I understand that I should create plug-ins.Is my impression correct?

score 0 · Answer 1 · answered Dec 03 '17 at 13:28

0

Nutch supports several formats: Plain Text, HTML/XHTML+XML, XML, MS Office files, Adobe PDF, RSS, RTF, MP3. Unfortunately, there is not support for any sort of image files. Apart from this, I'm curious, what do you want to index in image file?

answered Dec 03 '17 at 13:28

Mysterion

9,050
3
30
52

thanks for you response. actually I want to use get the all of the image that are inside a specific url. So, is any solution to crawl image? – Sajjad Rostami Dec 03 '17 at 14:50
Nutch is indexing tool, what do you want to index from image? – Mysterion Dec 04 '17 at 07:56
I just want to crawl image and make a big data set to do image processing. Actually I want to use apache nutch instead download image one by one! – Sajjad Rostami Dec 06 '17 at 00:00
How you solved your problem ? Have you use Nutch to download images ? Please give a little details ? – Hafiz Muhammad Shafiq Aug 18 '20 at 10:34

score 0 · Answer 2 · answered Dec 04 '17 at 10:12

0

If I understand your question what you want to accomplish is extracting all the metadata from the images and indexing only this in Solr, right?

If Nutch is not even fetching your images then is more likely that some of the URL filters is excluding the URL from being fetched (check the logs). You need to describe your changes to the different files otherwise it will be impossible to help you.

Now, back to the original question, if you want to only index image URLs (along with the metadata) then you need to filter what you index into Solr. Unfortunately Nutch 2.3 doesn't offer (out of the box) this feature. In Nutch 1.x you could use mimetype-filter which allows you to specify what you want to index into Solr/ES depending on the mime type of the URL. My suggestion is to use Nutch 1.x unless you have a very good reason to use Nutch 2.x. Otherwise you could port the mimetype-filter plugin to 2.x or write your own IndexingFiler that supports your own logic.

Keep in mind that the information that you'll get in Solr is only limited to what tika can extract from the image file (metadata) which is usually not very well curated.

answered Dec 04 '17 at 10:12

Jorge Luis

3,098
2
16
21

Thanks for your reply. Actually I want to crawl and save data such as images, videos, texts and other formats separately in mongoDB and then do my image processing and text mining on extracted data. But now, after I crwaled an url I just see parts of text in mongoDB. – Sajjad Rostami Dec 05 '17 at 23:45
After a lot of search that I do these days and also you propose I understand that mine-type plugin is many my solution . I want to test this plugin too. But I don't know how I must add this plugin to my installed apache nutch??? I can not find a step by step totorial. Another thing that it make busy my mind is that , is it possible that Taika can be my solution ? What is it ? And what is defirrent between Taika and mine-type plugin?? – Sajjad Rostami Dec 05 '17 at 23:45
If you're referring to Tika, Nutch already use tika to extract the metadata that I was mentioning in my answer. Since you're using Nutch 2.3.1, the `mimetype-plugin` is not available for this version of Nutch. So you can either start using Nutch 1.x or try to port the plugin to Nutch 2.x. – Jorge Luis Dec 06 '17 at 15:31
Thanks for response.I sill have challenge to store image in mongoBD using apache nutch. As I understand I have to create a plugin to crwal image. Do you know a standard image plugin for apache nutch? – Sajjad Rostami Dec 09 '17 at 00:08
Actually I want to store text in mongoDB but after crwal I just see many links instead of text. Is any way to store content of links directly into mongoDB? – Sajjad Rostami Dec 09 '17 at 00:11

Crawl Image using Apache Nutch

2 Answers2