Issue in Spark not able to read S3 subfolders of a hive table

Question

I have 3 non-partitioned tables in Hive.

drop table default.test1;
CREATE EXTERNAL TABLE `default.test1`(                                                                                                    
`c1` string,                                                                                                                                                                     
`c2` string,                                                                                                                                                                    
`c3` string)                                                                                                                                                                
ROW FORMAT SERDE                                                                                                                                                                        
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'                                                                                                                         
STORED AS INPUTFORMAT                                                                                                                                                                   
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'                                                                                                                       
OUTPUTFORMAT                                                                                                                                                                            
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'                                                                                                                      
LOCATION                                                                                                                                                                                
's3://bucket_name/dev/sri/sri_test1/';

drop table default.test2;
CREATE EXTERNAL TABLE `default.test2`(                                                                                                    
`c1` string,                                                                                                                                                                     
`c2` string,                                                                                                                                                                    
`c3` string)                                                                                                                                                                
ROW FORMAT SERDE                                                                                                                                                                        
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'                                                                                                                         
STORED AS INPUTFORMAT                                                                                                                                                                   
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'                                                                                                                       
OUTPUTFORMAT                                                                                                                                                                            
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'                                                                                                                      
LOCATION                                                                                                                                                                                
's3://bucket_name/dev/sri/sri_test2/'; 


drop table default.test3;
CREATE EXTERNAL TABLE `default.test3`(                                                                                                    
`c1` string,                                                                                                                                                                     
`c2` string,                                                                                                                                                                    
`c3` string)                                                                                                                                                                
ROW FORMAT SERDE                                                                                                                                                                        
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'                                                                                                                         
STORED AS INPUTFORMAT                                                                                                                                                                   
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'                                                                                                                       
OUTPUTFORMAT                                                                                                                                                                            
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'                                                                                                                      
LOCATION                                                                                                                                                                                
's3://bucket_name/dev/sri/sri_test3/';`

`
--INSERT:
insert into default.test1 values("a","b","c");
insert into default.test2 values("d","e","f");
insert overwrite default.test3 select * from default.test1 union all default.test2;
`

If I look in s3 two subfolders have been created because of the union all operation.

aws s3 ls s3://bucket_name/dev/sri/sri_test3/`;

PRE 1/
PRE 2/

Now the issue is that if I try to read the default.test3 table in pyspark and create dataframe.

df = spark.sql("select * from default.test3")                               
df.count()                                                                  
0

How to fix this issue?

If I try to read the table default.test3 in hive I am able to get the data. Because hive is able to read the subfolders in s3 but not spark. — Sri, Aug 05 '17 at 14:42
since the subfolders are created under 'sri_test3' I just modified DDL os test3 table s3 path and made it pointing to "s3://bucket_name/dev/sri/sri_test3/1/". Then I tried to create dataframe for the test3 table. If I show the dataframe I am able to see only one record. Because other record will be there in second subfolder. — Sri, Aug 05 '17 at 15:24
Is there any parameter in Spark which needs to be enabled such that it can read sub folders of a hive table in S3? — Sri, Aug 07 '17 at 10:31

Issue in Spark not able to read S3 subfolders of a hive table

0 Answers0