MapReduceIndexerTool - Best way to index HDFS files in Solr?

Question

I have a requirement where I have to index HDFS files (includes TXT, PDF, DOCX, other rich documents) into Solr.

Currently, I am using DirectoryIngestMapper of the LucidWorks connector to achieve the same. https://github.com/lucidworks/hadoop-solr

But I cannot work with this because it has certain limitations (the main one being that you cannot specify the filetypes to be considered).

So now I am looking into the possibility of using MapReduceIndexerTool. But it doesn't have many beginner (I mean absolute basic!) level examples.

Could someone post some links with examples for starting with the MapReduceIndexerTool? Is there some other better or easier way to index files in HDFS?

score 3 · Answer 1 · answered Jun 07 '18 at 15:03

On Cloudera I think that you have these options:

MapReduceIndexerTool
CrunchIndexerTool
Custom spark or map reduce task, for example using spark-solr

About MapReduceIndexerTool here a quick guide:

Index a csv to SolR using MapReduceIndexerTool

This guide show you how to index/upload a .csv file to SolR using MapReduceIndexerTool. This procedure will read the csv from HDFS and write directly the index inside HDFS.

See also https://www.cloudera.com/documentation/enterprise/latest/topics/search_mapreduceindexertool.html .

Assuming that you have:

a valid cloudera installation (see THIS_IS_YOUR_CLOUDERA_HOST, if using Docker Quickstart it should be quickstart.cloudera)
a csv file stored in HDFS (see THIS_IS_YOUR_INPUT_CSV_FILE, like /your-hdfs-dir/your-csv.csv)
a valid destination SolR collection with the expected fields already configured (see THIS_IS_YOUR_DESTINATION_COLLECTION)
- the output directory will be the SolR configured instanceDir (see THIS_IS_YOUR_CORE_INSTANCEDIR) and should be an HDFS path

For this example we will process a TAB separated file with uid, firstName and lastName columns. The first row contains the headers. The Morphlines configuration files will skip the first line, so the actual column name doesn't matter, columns are expected just in this order. On SolR we should configure the fields with something similar:

<field name="_version_" type="long" indexed="true" stored="true" />
<field name="uid" type="string" indexed="true" stored="true" required="true" />
<field name="firstName" type="text_general" indexed="true" stored="true" />
<field name="lastName" type="text_general" indexed="true" stored="true" />
<field name="text" type="text_general" indexed="true" multiValued="true" />

Then you should create a Morphlines configuration file (csv-to-solr-morphline.conf) with the following code:

# Specify server locations in a SOLR_LOCATOR variable; used later in
# variable substitutions:
SOLR_LOCATOR : {
  # Name of solr collection
  collection : THIS_IS_YOUR_DESTINATION_COLLECTION

  # ZooKeeper ensemble
  zkHost : "THIS_IS_YOUR_CLOUDERA_HOST:2181/solr"
}


# Specify an array of one or more morphlines, each of which defines an ETL
# transformation chain. A morphline consists of one or more potentially
# nested commands. A morphline is a way to consume records such as Flume events,
# HDFS files or blocks, turn them into a stream of records, and pipe the stream
# of records through a set of easily configurable transformations on the way to
# a target application such as Solr.
morphlines : [
  {
    id : morphline1
    importCommands : ["org.kitesdk.**"]

    commands : [
      {
        readCSV {
          separator : "\t"
          # This columns should map the one configured in SolR and are expected in this position inside CSV
          columns : [uid,lastName,firstName]
          ignoreFirstLine : true
          quoteChar : ""
          commentPrefix : ""
          trim : true
          charset : UTF-8
        }
      }

      # Consume the output record of the previous command and pipe another
      # record downstream.
      #
      # This command deletes record fields that are unknown to Solr
      # schema.xml.
      #
      # Recall that Solr throws an exception on any attempt to load a document
      # that contains a field that is not specified in schema.xml.
      {
        sanitizeUnknownSolrFields {
          # Location from which to fetch Solr schema
          solrLocator : ${SOLR_LOCATOR}
        }
      }

      # log the record at DEBUG level to SLF4J
      { logDebug { format : "output record: {}", args : ["@{}"] } }

      # load the record into a Solr server or MapReduce Reducer
      {
        loadSolr {
          solrLocator : ${SOLR_LOCATOR}
        }
      }

    ]
  }
]

To import run the following command inside a cluster:

hadoop jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar \
  org.apache.solr.hadoop.MapReduceIndexerTool \
  --output-dir hdfs://quickstart.cloudera/THIS_IS_YOUR_CORE_INSTANCEDIR/  \
  --morphline-file ./csv-to-solr-morphline.conf \
  --zk-host quickstart.cloudera:2181/solr \
  --solr-home-dir /THIS_IS_YOUR_CORE_INSTANCEDIR \
  --collection THIS_IS_YOUR_DESTINATION_COLLECTION \
  --go-live \
  hdfs://THIS_IS_YOUR_CLOUDERA_HOST/THIS_IS_YOUR_INPUT_CSV_FILE

Some considerations:

You can use sudo -u hdfs to run the above command because you should not have permissiong to write in the HDFS output directory.
By default Cloudera QuickStart has a very small memory and heap memory configuration. If you receive out of memory exception or heap exception I suggest to increase it using Cloudera Manager->Yarn->Configurations (http://THIS_IS_YOUR_CLOUDERA_HOST:7180/cmf/services/11/config#filterdisplayGroup=Resource+Management ) I have used 1 GB for memory and 500MB for heap for both map and reduce jobs. Consider also changing yarn.app.mapreduce.am.command-opts, mapreduce.map.java.opts, mapreduce.map.memory.mb and mapreduce.map.memory.mb inside /etc/hadoop/conf/map-red-sites.xml

Other resources:

acesar · Answer 2 · 2016-09-22T16:38:33.810

1

But I cannot work with this because it has certain limitations (the main one being that you cannot specify the filetypes to be considered).

With the https://github.com/lucidworks/hadoop-solr the input is a path.

So, you can specify by file name.

-i /path/*.pdf

Edit:

you can add the add.subdirectories argument. But the *.pdf is not set recursively gitsource

-Dadd.subdirectories=true

edited Sep 22 '16 at 16:38

answered Sep 21 '16 at 23:10

acesar

170
4

But when I give input in the format that you suggested, sub-directories are not crawled. This is a very important requirement. – Sep 22 '16 at 04:28
Also, the sub-directory crawling is very problematic. Sometimes, it doesn't consider files in a sub-directory. I know for a fact that it's not the problem of the file because I am able to index it individually. – Sep 22 '16 at 04:31
Thanks. I have commented on the issue you have opened elaborating the problem I faced. – Sep 22 '16 at 18:57

MapReduceIndexerTool - Best way to index HDFS files in Solr?

2 Answers2

Index a csv to SolR using MapReduceIndexerTool