On Cloudera I think that you have these options:
About MapReduceIndexerTool here a quick guide:
Index a csv to SolR using MapReduceIndexerTool
This guide show you how to index/upload a .csv
file to SolR using MapReduceIndexerTool.
This procedure will read the csv from HDFS and write directly the index inside HDFS.
See also https://www.cloudera.com/documentation/enterprise/latest/topics/search_mapreduceindexertool.html .
Assuming that you have:
- a valid cloudera installation (see
THIS_IS_YOUR_CLOUDERA_HOST
, if using Docker Quickstart it should be quickstart.cloudera
)
- a csv file stored in HDFS (see
THIS_IS_YOUR_INPUT_CSV_FILE
, like /your-hdfs-dir/your-csv.csv
)
- a valid destination SolR collection with the expected fields already configured (see
THIS_IS_YOUR_DESTINATION_COLLECTION
)
- the output directory will be the SolR configured
instanceDir
(see THIS_IS_YOUR_CORE_INSTANCEDIR
) and should be an HDFS path
For this example we will process a TAB separated file with uid
, firstName
and lastName
columns. The first row contains the headers. The Morphlines configuration files will skip the first line, so the actual column name doesn't matter, columns are expected just in this order.
On SolR we should configure the fields with something similar:
<field name="_version_" type="long" indexed="true" stored="true" />
<field name="uid" type="string" indexed="true" stored="true" required="true" />
<field name="firstName" type="text_general" indexed="true" stored="true" />
<field name="lastName" type="text_general" indexed="true" stored="true" />
<field name="text" type="text_general" indexed="true" multiValued="true" />
Then you should create a Morphlines configuration file (csv-to-solr-morphline.conf
) with the following code:
# Specify server locations in a SOLR_LOCATOR variable; used later in
# variable substitutions:
SOLR_LOCATOR : {
# Name of solr collection
collection : THIS_IS_YOUR_DESTINATION_COLLECTION
# ZooKeeper ensemble
zkHost : "THIS_IS_YOUR_CLOUDERA_HOST:2181/solr"
}
# Specify an array of one or more morphlines, each of which defines an ETL
# transformation chain. A morphline consists of one or more potentially
# nested commands. A morphline is a way to consume records such as Flume events,
# HDFS files or blocks, turn them into a stream of records, and pipe the stream
# of records through a set of easily configurable transformations on the way to
# a target application such as Solr.
morphlines : [
{
id : morphline1
importCommands : ["org.kitesdk.**"]
commands : [
{
readCSV {
separator : "\t"
# This columns should map the one configured in SolR and are expected in this position inside CSV
columns : [uid,lastName,firstName]
ignoreFirstLine : true
quoteChar : ""
commentPrefix : ""
trim : true
charset : UTF-8
}
}
# Consume the output record of the previous command and pipe another
# record downstream.
#
# This command deletes record fields that are unknown to Solr
# schema.xml.
#
# Recall that Solr throws an exception on any attempt to load a document
# that contains a field that is not specified in schema.xml.
{
sanitizeUnknownSolrFields {
# Location from which to fetch Solr schema
solrLocator : ${SOLR_LOCATOR}
}
}
# log the record at DEBUG level to SLF4J
{ logDebug { format : "output record: {}", args : ["@{}"] } }
# load the record into a Solr server or MapReduce Reducer
{
loadSolr {
solrLocator : ${SOLR_LOCATOR}
}
}
]
}
]
To import run the following command inside a cluster:
hadoop jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar \
org.apache.solr.hadoop.MapReduceIndexerTool \
--output-dir hdfs://quickstart.cloudera/THIS_IS_YOUR_CORE_INSTANCEDIR/ \
--morphline-file ./csv-to-solr-morphline.conf \
--zk-host quickstart.cloudera:2181/solr \
--solr-home-dir /THIS_IS_YOUR_CORE_INSTANCEDIR \
--collection THIS_IS_YOUR_DESTINATION_COLLECTION \
--go-live \
hdfs://THIS_IS_YOUR_CLOUDERA_HOST/THIS_IS_YOUR_INPUT_CSV_FILE
Some considerations:
- You can use
sudo -u hdfs
to run the above command because you should not have permissiong to write in the HDFS output directory.
- By default Cloudera QuickStart has a very small memory and heap memory configuration.
If you receive out of memory exception or heap exception I suggest to increase it using Cloudera Manager->Yarn->Configurations (http://THIS_IS_YOUR_CLOUDERA_HOST:7180/cmf/services/11/config#filterdisplayGroup=Resource+Management
)
I have used 1 GB for memory and 500MB for heap for both map and reduce jobs.
Consider also changing
yarn.app.mapreduce.am.command-opts
, mapreduce.map.java.opts
, mapreduce.map.memory.mb
and mapreduce.map.memory.mb
inside /etc/hadoop/conf/map-red-sites.xml
Other resources: