-1

I processed data using pySpark and sqlContext using the following query:

(sqlContext.sql("select LastUpdate,Count(1) as Count" from temp_t)
           .rdd.coalesce(1).saveAsTextFile("/apps/hive/warehouse/Count"))

It is stored in the following format:

Row(LastUpdate=u'2016-03-14 12:27:55.01', Count=1)
Row(LastUpdate=u'2016-02-18 11:56:54.613', Count=1)
Row(LastUpdate=u'2016-04-13 13:53:32.697', Count=1)
Row(LastUpdate=u'2016-02-22 17:43:37.257', Count=5)

But I want to store the data in a Hive table as

LastUpdate                           Count

2016-03-14 12:27:55.01                   1
.                                        .
.                                        .

Here is how I create the table in Hive:

CREATE TABLE Data_Count(LastUpdate string, Count int )
ROW FORMAT DELIMITED fields terminated by '|';

I tried many options but was not successful. Please help me on this.

Alicia Garcia-Raboso
  • 13,193
  • 1
  • 43
  • 48
Prasad
  • 3
  • 1
  • 2

2 Answers2

1

Why not load the data into Hive itself, without going through the process of saving the file and then loading it to hive.

from datetime import datetime, date, time, timedelta
hiveCtx = HiveContext(sc)

#Create sample data
currTime = datetime.now()
currRow = Row(LastUpdate=currTime)
delta = timedelta(days=1)
futureTime = currTime + delta
futureRow = Row(LastUpdate=futureTime)
lst = [currRow, currRow, futureRow, futureRow, futureRow]

#parallelize the list and convert to dataframe
myRdd = sc.parallelize(lst)
df = myRdd.toDF()
df.registerTempTable("temp_t")
aggRDD = hiveCtx.sql("select LastUpdate,Count(1) as Count from temp_t group by LastUpdate")
aggRDD.saveAsTable("Data_Count")
phuclv
  • 37,963
  • 15
  • 156
  • 475
Ravi
  • 1,811
  • 1
  • 18
  • 31
  • thanks , used the same approach..but the data is stored as 2016-03-14 12:27:55.01 1 2016-02-18 11:56:54.613 1 not as table format... with columns names and i cannot query like we do on table ex: dl commands – Prasad Jul 28 '16 at 08:15
  • can you please post a sample dl commands that wouldn't work here? – Ravi Jul 28 '16 at 15:43
0

You created a table, now you need to fill it with the data you generated.

This could be ran from a Spark HiveContext, I believe

LOAD DATA INPATH '/apps/hive/warehouse/Count' INTO TABLE Data_Count

Alternatively, you may want to build a table over the data

CREATE EXTERNAL TABLE IF NOT Exists Data_Count(
    LastUpdate DATE, 
    Count INT
   ) 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION '/apps/hive/warehouse/Count';
phuclv
  • 37,963
  • 15
  • 156
  • 475
OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • I stored result to one variable (ex:result) in spark, after running the above query, when i did result.show()...data displayed in 2 columns with pipe as seperator. yes, i did "LOAD DATA INPATH '/apps/hive/warehouse/Count' INTO TABLE Data_Count" but result showing under one column "LastUpdate" both date and count, and other column "Count" showing as NULL. – Prasad Jul 28 '16 at 05:44
  • When you show an RDD, it formats with a pipe. It does not save data to a text file with a pipe. You can cat the HDFS file to inspect the actual delimiter. You are getting null in the second column because everything is shoved to the first column – OneCricketeer Jul 28 '16 at 05:56
  • Hello, checked..its delimited by "," and when i changed your query accordingly and executed now i am getting both the columns as NULL – Prasad Jul 28 '16 at 06:44