0

I am new to spark shell and I am trying to add new table and read it. I have added this file:

workers.txt:

1201, satish, 25
1202, krishna, 28
1203, amith, 39
1204, javed, 23
1205, prudvi, 23

and run the commands:

spark-shell
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("CREATE TABLE workers (id INT, name VARCHAR(64), age INT)")
sqlContext.sql("LOAD DATA LOCAL INPATH 'workers.txt' INTO TABLE workers")
 >> res5: org.apache.spark.sql.DataFrame = []
val resultW = sqlContext.sql("FROM workers SELECT *")
>> resultW: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]
resultW.show()
>>
+----+----+----+                                                                
|  id|name| age|
+----+----+----+
|null|null|null|
|null|null|null|
|null|null|null|
|null|null|null|
|null|null|null|
+----+----+----+

but as you see table has only nulls, why is that? The file workers.txt is in same working dirictory.

Atheel Massalha
  • 424
  • 1
  • 6
  • 18

1 Answers1

1

The reason you get null in your df is because spark:

The line separator handles all \r, \r\n and \n by default

that means your line such as

1201, satish, 25

is handled as a value for spark.

And spark tries to make it fit into your declared type Int, which is not possible since your line contains ',', space and chars. That's why it gives null value everywhere but it has right number of lines.

The best way for your case is to read it through SparkSession method:

spark.read.option("delimiter", ",").csv("./workers.txt").show()

you can then cast it to the Type you like or persist into a table:

val schema = new StructType()
  .add("Id", IntegerType, true)
  .add("Name", StringType, true)
  .add("Age", IntegerType, true)
spark.read.option("delimiter", ",").schema(schema).csv("./workers.txt").write.option("path", "/some/path").saveAsTable("t")

However pay attention to your file content, it has leading space before your numbers and the reader would have problem parsing it.

For more info about saving dataframe to a table in spark:

https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#saving-to-persistent-tables

Young
  • 536
  • 2
  • 19
  • thanks! the command you worte realy works, but how can I then create a table from this? What I tried is: sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) Row FORMAT DELIMITED FIELDS TERMINATED BY ',' USING hive") but didnt work – Atheel Massalha Sep 21 '22 at 05:25
  • edited answer for your question – Young Sep 21 '22 at 16:45