I am trying to use HIVE 0.13 to access cassandra 2.0.8 column families created with CQL3.
Here is how I created my column families:
CREATE KEYSPACE IF NOT EXISTS Identification
WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy',
'DC1' : 2 };
USE Identification;
CREATE TABLE IF NOT EXISTS entitylookup (
name varchar,
value varchar,
entity_id uuid,
PRIMARY KEY ((name, value), entity_id))
WITH
caching=all
;
I followed the instructions from the README of this project: https://github.com/tuplejump/cash/tree/master/cassandra-handler
I generated hive-cassandra-1.2.6.jar, copied it and cassandra-all-1.2.6.jar, cassandra-thrift-1.2.6.jar to hive lib folder.
Then I started hive and tried the following:
CREATE EXTERNAL TABLE identification.entitylookup(name string, value string, entity_id binary)
STORED BY 'org.apache.hadoop.hive.cassandra.cql.CqlStorageHandler' WITH SERDEPROPERTIES("cql.primarykey" = "name, value", "cassandra.host" = "localhost", "cassandra.port "= "9160")
TBLPROPERTIES ("cassandra.ks.name" = "identification", "cassandra.ks.stratOptions"="'DC1':2", "cassandra.ks.strategy"="NetworkTopologyStrategy");
Here is the output:
hive> mvalle@mvalle:~/hadoop$ hive
14/05/30 12:02:02 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
14/05/30 12:02:02 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
14/05/30 12:02:02 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
14/05/30 12:02:02 INFO Configuration.deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node
14/05/30 12:02:02 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
14/05/30 12:02:02 INFO Configuration.deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack
14/05/30 12:02:02 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
14/05/30 12:02:02 INFO Configuration.deprecation: mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use mapreduce.job.committer.setup.cleanup.needed
Logging initialized using configuration in jar:file:/home/mvalle/hadoop/apache-hive-0.13.0-bin/lib/hive-common-0.13.0.jar!/hive-log4j.properties
OpenJDK 64-Bit Server VM warning: You have loaded library /home/mvalle/hadoop/hadoop-2.2.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
hive> CREATE EXTERNAL TABLE identification.entitylookup(name string, value string, entity_id binary)
> STORED BY 'org.apache.hadoop.hive.cassandra.cql.CqlStorageHandler' WITH SERDEPROPERTIES("cql.primarykey" = "name, value", "cassandra.host" = "ident.s1mbi0se.com", "cassandra.port "= "9160")
> TBLPROPERTIES ("cassandra.ks.name" = "identification", "cassandra.ks.stratOptions"="'DC1':2", "cassandra.ks.strategy"="NetworkTopologyStrategy");
FAILED: SemanticException [Error 10072]: Database does not exist: identification
Question: how do I do to get more information about what is going wrong? I tried the same hive commando using "Identification" (capital I), but same result. Is it possible to access CQL3 column families in cassandra community? It seems the keyspace has not been mapped, but I don't see how to map then. In DSE, they are automatically mapped...
EDIT:
To clarify more, if I create an empty database and then try to create the external table, here is what I get:
hive> create database identification;
OK
Time taken: 0.154 seconds
hive> CREATE EXTERNAL TABLE identification.entity_lookup(name string, value string, entity_id binary)
> STORED BY 'org.apache.hadoop.hive.cassandra.cql.CqlStorageHandler' WITH SERDEPROPERTIES("cql.primarykey" = "name, value", "cassandra.host" = "localhost", "cassandra.port "= "9160")
> TBLPROPERTIES ("cassandra.ks.name" = "identification", "cassandra.ks.stratOptions"="'DC1':3", "cassandra.ks.strategy"="NetworkTopologyStrategy");
OK
Time taken: 3.58 seconds
hive> select * from identification.entity_lookup limit 10;
OK
Exception in thread "main" java.lang.InstantiationError: org.apache.hadoop.mapreduce.JobContext
at org.apache.hadoop.hive.cassandra.input.cql.HiveCqlInputFormat.getSplits(HiveCqlInputFormat.java:166)
at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:418)
at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:561)
at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:534)
at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:137)
at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1488)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:285)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:220)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:792)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:686)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:625)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)