There are 2 Hive tables created using the same sources and same logic, but with slightly different queries:
Table 1 query is:
create table test.table1 stored as orc as
select
f1,
mc.f2 as f2,
mc.f3 as f3,
f4
from src.test_table lateral view explode(multiple_field) mcTable as mc
union all
select
f1,
f5 as f2,
f6 as f3,
f4
from src.test_table
where multiple_field is null or size(multiple_field) < 1
;
Next, table 2 query - same logic, shortened using outer explode:
create table test.table2 stored as orc as
select
f1,
if(mc is null, f5, mc.f2) as f2,
if(mc is null, f6, mc.f3) as f3,
f4
from src.test_table lateral view outer explode(multiple_field) mcTable as mc
;
Both tables were created successfully, contain the same number of rows and identical data (checked by Hive Beeline client). Then I try to read table's data with Spark:
SparkSession sparkSession = SparkSession
.builder().config("hive.execution.engine","mr")
.appName("OrcExportJob")
.enableHiveSupport()
.getOrCreate();
String hql = "select * from test.table1"; // or test.table2
Dataset<Row> sqlDF = sparkSession.sql(hql);
In case of test.table2 it's OK - sqlDF contains all the data. Reading test.table1 leads to different result - sqlDF contains no data at all (0 rows). Spark logs shows no errors - just like the table is really empty.
I've heard Spark has some problems reading transactional or partitioned Hive tables - but this isn't the case.
Digging around I explored that Hive stores ORC files for my tables in different ways:
/
├─ user/
│ ├─ hive/
│ │ ├─ warehouse/
│ │ │ ├─ test.db/
│ │ │ │ ├─ table1/
│ │ │ │ │ ├─ 1/
│ │ │ │ │ │ ├─ 1/
│ │ │ │ │ │ │ ├─ 000000_0
│ │ │ │ │ ├─ 2/
│ │ │ │ │ │ ├─ 000000_0
│ │ │ │ │ │ ├─ 000001_0
│ │ │ │ │ │ ├─ 000002_0
│ │ │ │ │ │ ├─ 000003_0
│ │ │ │ ├─ table2/
│ │ │ │ │ ├─ 000000_0
│ │ │ │ │ ├─ 000001_0
│ │ │ │ │ ├─ 000002_0
│ │ │ │ │ ├─ 000003_0
Could someone help me figure out the reason Spark doesn't see Table 1 data?
Why does Hive keep 5 files with complex directory structure for Table 1 and only 4 files with a simple structure for Table 2?
Could it somehow affect Spark reading process?
P.S. Hive version is 2.3.3, Spark version is 2.4.4