select distinct filename from dfs.contoso.`folder/CSVs/`
> 2021-01.csv
> 2021-02.csv
> ...
or
select count(*) as cnt from dfs.contoso.`folder/CSVs/`
where filename = '2021-01.csv'
> 4562751239
The problem is both of these queries take AN HOUR. From the plan is obvious that Drill goes through ALL files in destination folder and AFTER THEN it filters the data by filename. That's absolutely unusable for bigger datasets.
Unfortunately, I cannot change the data structure and I cannot have single file in the from clause (from dfs.contoso.
folder/CSVs/2021-01.csv`) because at that point Drill does not use created CSV schema which I need.
Is there any reason why Drill does this? How can we do it effectively?
Drill 1.19
UPDATE
The main problem is not enumerating files in a folder but reading a data from a single file from many in a directory.
Having this filesystem:
CsvHistory/2019-01.csv [2GB]
CsvHistory/2019-02.csv [3GB]
...
CsvHistory/2021-09.csv [6GB]
We needed to do a query directly from one file without reading the others from the folder and without changing the filesystem structure since it's not allowed.
We needed this query not to traverse all the other files because it's huge waste of performance.