0

Whilst going through the cloudera quickstart tutorial, I ran into this error:

Input path does not exist: hdfs://quickstart/user/hive/warehouse/products

The problem occurred whilst I was executing this spark code:

val orders = order_items.map { x => (
    x.get("order_item_product_id"),
    (x.get("order_item_order_id"), x.get("order_item_quantity")))
}.join(
  products.map { x => (
    x.get("product_id"),
    (x.get("product_name")))
  }
).map(x => (
    scala.Int.unbox(x._2._1._1), // order_id
    (
        scala.Int.unbox(x._2._1._2), // quantity
        x._2._2.toString // product_name
    )
)).groupByKey()

How to resolve this?

Dennis Jaheruddin
  • 21,208
  • 8
  • 66
  • 122

1 Answers1

0

The HDFS path refers to the hive warehouse. A quick check confirmed that the path on HDFS does not exist (and neither does the hive table).

In this specific case it is due to the fact that though the tutorial has several topics, they are not independent. Hence the code from the first topic (squoop import) is needed to ensure that the data is actually in place for the spark section.

You can go back a few steps in the tutorial to find the relevant code, in my case it was:

sqoop import-all-tables \
    -m 1 \
    --connect jdbc:mysql://quickstart:3306/retail_db \
    --username=retail_dba \
    --password=cloudera \
    --compression-codec=snappy \
    --as-parquetfile \
    --warehouse-dir=/user/hive/warehouse \
    --hive-import

Note that you will want to exit the spark shell before running the squoop command.

Dennis Jaheruddin
  • 21,208
  • 8
  • 66
  • 122