I am using nutch for crawling some pages into solr index. For the purpose of specifying The documents that are read I added a boolean value called "read_flag"
into "schema.xml"
with default value of false. When an application user read a document it will send a solr update query to set the solr index and set the read_flag
to true
. On application side I user solr query on all documents that their "read_flag"
is false to see all the documents that are not read yet. I also defined url to be uniquekey so Solr will overwrite new documents if it finds out that its url is duplicate. My problem is when the new document is sent to solr for indexing its read_flag would be false which is overwritten to solr existing document that might have read_flag=true
! I am thinking of some solution but all of them have some performance cost.
1) Use to different documents one for read_flag another for other parts. User solr join in query time. The problem is I am not able to use solrCloud and multi sharding in this way!
2) Changing Nutch to send update query instead of add query for every new document. Probably the performance of indexing decreases in this method. Since this is not applicable for the documents that should be insert for the first time probably it is not feasible solution.
3) Changing solr code to misses some fields for overwriting on duplicate unique-key.
4) Using solrj inside nutch to read document read_flag
value and send it for index with this read_flag
value.
My question would be what method is best suitable for my situation? and How can I do such method?
Rgards.