Save data as text file from spark to hdfs

Question

I processed data using pySpark and sqlContext using the following query:

(sqlContext.sql("select LastUpdate,Count(1) as Count" from temp_t)
           .rdd.coalesce(1).saveAsTextFile("/apps/hive/warehouse/Count"))

It is stored in the following format:

Row(LastUpdate=u'2016-03-14 12:27:55.01', Count=1)
Row(LastUpdate=u'2016-02-18 11:56:54.613', Count=1)
Row(LastUpdate=u'2016-04-13 13:53:32.697', Count=1)
Row(LastUpdate=u'2016-02-22 17:43:37.257', Count=5)

But I want to store the data in a Hive table as

LastUpdate                           Count

2016-03-14 12:27:55.01                   1
.                                        .
.                                        .

Here is how I create the table in Hive:

CREATE TABLE Data_Count(LastUpdate string, Count int )
ROW FORMAT DELIMITED fields terminated by '|';

I tried many options but was not successful. Please help me on this.

score 1 · Accepted Answer · edited Jul 28 '16 at 17:34

1

Why not load the data into Hive itself, without going through the process of saving the file and then loading it to hive.

from datetime import datetime, date, time, timedelta
hiveCtx = HiveContext(sc)

#Create sample data
currTime = datetime.now()
currRow = Row(LastUpdate=currTime)
delta = timedelta(days=1)
futureTime = currTime + delta
futureRow = Row(LastUpdate=futureTime)
lst = [currRow, currRow, futureRow, futureRow, futureRow]

#parallelize the list and convert to dataframe
myRdd = sc.parallelize(lst)
df = myRdd.toDF()
df.registerTempTable("temp_t")
aggRDD = hiveCtx.sql("select LastUpdate,Count(1) as Count from temp_t group by LastUpdate")
aggRDD.saveAsTable("Data_Count")

edited Jul 28 '16 at 17:34

phuclv

37,963
15
156
475

answered Jul 28 '16 at 07:29

Ravi

1,811
1
18
31

thanks , used the same approach..but the data is stored as 2016-03-14 12:27:55.01 1 2016-02-18 11:56:54.613 1 not as table format... with columns names and i cannot query like we do on table ex: dl commands – Prasad Jul 28 '16 at 08:15
can you please post a sample dl commands that wouldn't work here? – Ravi Jul 28 '16 at 15:43

score 0 · Answer 2 · edited Jul 28 '16 at 17:35

0

You created a table, now you need to fill it with the data you generated.

This could be ran from a Spark HiveContext, I believe

LOAD DATA INPATH '/apps/hive/warehouse/Count' INTO TABLE Data_Count

Alternatively, you may want to build a table over the data

CREATE EXTERNAL TABLE IF NOT Exists Data_Count(
    LastUpdate DATE, 
    Count INT
   ) 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION '/apps/hive/warehouse/Count';

edited Jul 28 '16 at 17:35

phuclv

37,963
15
156
475

answered Jul 28 '16 at 05:26

OneCricketeer

179,855
19
132
245

I stored result to one variable (ex:result) in spark, after running the above query, when i did result.show()...data displayed in 2 columns with pipe as seperator. yes, i did "LOAD DATA INPATH '/apps/hive/warehouse/Count' INTO TABLE Data_Count" but result showing under one column "LastUpdate" both date and count, and other column "Count" showing as NULL. – Prasad Jul 28 '16 at 05:44
When you show an RDD, it formats with a pipe. It does not save data to a text file with a pipe. You can cat the HDFS file to inspect the actual delimiter. You are getting null in the second column because everything is shoved to the first column – OneCricketeer Jul 28 '16 at 05:56
Hello, checked..its delimited by "," and when i changed your query accordingly and executed now i am getting both the columns as NULL – Prasad Jul 28 '16 at 06:44

Save data as text file from spark to hdfs

2 Answers2