2

I installed Apache Nutch 2.3.1 and Solr 6.5.1 and MongoDB 3.4.7. After I crawl urls that contain many images, in Solr and mongoDB isn't any image and video. I also changed regex-urlfilter.txt file in apache nutch and delete postfix that were related to image(.png,.jpeg,.gift,...). After that I changed suffix-urlfilter.txt file and comment jpeg,gif,png too.
After do that works the Apache Nutch doesn't crawl image. Now I want to know how I can crawl image and see that in Solr? As I read about it, I understand that I should create plug-ins.Is my impression correct?

Sajjad Rostami
  • 303
  • 2
  • 3
  • 12

2 Answers2

0

Nutch supports several formats: Plain Text, HTML/XHTML+XML, XML, MS Office files, Adobe PDF, RSS, RTF, MP3. Unfortunately, there is not support for any sort of image files. Apart from this, I'm curious, what do you want to index in image file?

Mysterion
  • 9,050
  • 3
  • 30
  • 52
0

If I understand your question what you want to accomplish is extracting all the metadata from the images and indexing only this in Solr, right?

If Nutch is not even fetching your images then is more likely that some of the URL filters is excluding the URL from being fetched (check the logs). You need to describe your changes to the different files otherwise it will be impossible to help you.

Now, back to the original question, if you want to only index image URLs (along with the metadata) then you need to filter what you index into Solr. Unfortunately Nutch 2.3 doesn't offer (out of the box) this feature. In Nutch 1.x you could use mimetype-filter which allows you to specify what you want to index into Solr/ES depending on the mime type of the URL. My suggestion is to use Nutch 1.x unless you have a very good reason to use Nutch 2.x. Otherwise you could port the mimetype-filter plugin to 2.x or write your own IndexingFiler that supports your own logic.

Keep in mind that the information that you'll get in Solr is only limited to what tika can extract from the image file (metadata) which is usually not very well curated.

Jorge Luis
  • 3,098
  • 2
  • 16
  • 21
  • Thanks for your reply. Actually I want to crawl and save data such as images, videos, texts and other formats separately in mongoDB and then do my image processing and text mining on extracted data. But now, after I crwaled an url I just see parts of text in mongoDB. – Sajjad Rostami Dec 05 '17 at 23:45
  • After a lot of search that I do these days and also you propose I understand that mine-type plugin is many my solution . I want to test this plugin too. But I don't know how I must add this plugin to my installed apache nutch??? I can not find a step by step totorial. Another thing that it make busy my mind is that , is it possible that Taika can be my solution ? What is it ? And what is defirrent between Taika and mine-type plugin?? – Sajjad Rostami Dec 05 '17 at 23:45
  • If you're referring to Tika, Nutch already use tika to extract the metadata that I was mentioning in my answer. Since you're using Nutch 2.3.1, the `mimetype-plugin` is not available for this version of Nutch. So you can either start using Nutch 1.x or try to port the plugin to Nutch 2.x. – Jorge Luis Dec 06 '17 at 15:31
  • Thanks for response.I sill have challenge to store image in mongoBD using apache nutch. As I understand I have to create a plugin to crwal image. Do you know a standard image plugin for apache nutch? – Sajjad Rostami Dec 09 '17 at 00:08
  • Actually I want to store text in mongoDB but after crwal I just see many links instead of text. Is any way to store content of links directly into mongoDB? – Sajjad Rostami Dec 09 '17 at 00:11