WHERE filename in Apache Drill does a full scan in all files

Question

select distinct filename from dfs.contoso.`folder/CSVs/`
> 2021-01.csv
> 2021-02.csv
> ...

or

select count(*) as cnt from dfs.contoso.`folder/CSVs/`
where filename = '2021-01.csv'
> 4562751239

The problem is both of these queries take AN HOUR. From the plan is obvious that Drill goes through ALL files in destination folder and AFTER THEN it filters the data by filename. That's absolutely unusable for bigger datasets.

Unfortunately, I cannot change the data structure and I cannot have single file in the from clause (from dfs.contoso.folder/CSVs/2021-01.csv`) because at that point Drill does not use created CSV schema which I need.

Is there any reason why Drill does this? How can we do it effectively?

Drill 1.19

UPDATE

The main problem is not enumerating files in a folder but reading a data from a single file from many in a directory.

Having this filesystem:

CsvHistory/2019-01.csv  [2GB]
CsvHistory/2019-02.csv  [3GB]
...
CsvHistory/2021-09.csv  [6GB]

We needed to do a query directly from one file without reading the others from the folder and without changing the filesystem structure since it's not allowed.

We needed this query not to traverse all the other files because it's huge waste of performance.

cgivre · Answer 1 · 2021-09-24T03:36:10.153

I'm sorry you gave up on Drill, but I'm going to post this for anyone else who might be reading this.

You do have to understand a bit about how Drill handles schemas. Firstly, Drill attempts to infer the schema from the underlying data. For the queries listed above, it looks like you are trying to find the file names in a given directory and count rows in each file. Neither of these requires a schema at all. As I mentioned, you should use the INFORMATION_SCHEMA to query directories or a SHOW FILES IN <dir> query for that.

Providing a Schema

If the schema Drill infers isn't cutting it for you, you can provide a schema to Drill either at query time or by running a CREATE SCHEMA query which will create a hidden schema file. Here is a link to the docs for that functionality: https://drill.apache.org/docs/create-or-replace-schema/. I've not used this functionality extensively, but I do know that you can certainly provide a schema file for single files. Not sure for entire directories but I believe it was meant to do that.

The documentation for the inline schema is a bit lacking but you can also do this at query time as follows:

SELECT Year 
FROM table(dfs.test.`test.csvh` (schema=>'inline=(`Year` int)')) 
WHERE Make = 'Ford'

The result here would be that the Year column would be interpreted as an INT rather than the default VARCHAR.

**UPDATE: ** It turns out that you CAN also provide a schema in the table() function above. See below

SELECT Year 
FROM table(dfs.test.`test.csvh` (schema => 'path=`/path/to/schema`')) 
WHERE Make = 'Ford'

Thanks for detailed explaination about the schemas. 1) I wasn't familiar with the `table(...)` notation and had no idea you can specify an inline schema, mostly due to lack of the docs. Don't you know how we can pass an existing schema file directly? I tried variants of `(schema=>'path=dfs.test.`example/.drill.schema`')` but it says that the file cannot be found in all of the attempts. 2) `Not sure for entire directories but I believe it was meant to do that.`: CREATE SCHEMA works perfectly when querying whole directory but when using on single file, it will fail. — rudolfdobias, Sep 17 '21 at 14:45
In this thread, I was thinking that a useful feature to add would be to be able to specify a schema file in the `table()` function. I'm going to create a JIRA for that. — cgivre, Sep 17 '21 at 14:52
Thanks for that. FYI – I updated the question to be more clear about what we were trying to achieve. — rudolfdobias, Sep 17 '21 at 14:58

score 1 · Answer 2 · answered Sep 09 '21 at 14:09

Possibly Drill shuffles filenames and the records in it. You might move each file to directories and group by on dir0:

# hadoop fs -mv /folder/CSVs/2021-01.csv /folder/CSVs/2021-01/2021-01.csv
sql> select dir0 as fileprefix,count(1) from dfs.contoso.`/folder/CSVs/` group by dir0

score 1 · Answer 3 · answered Sep 14 '21 at 03:05

I'm a little unclear as to what you're trying to do.

If you are querying an individual file, you should not specify the file name as you are doing. As you've noted, Drill will do a full recursive directory scan, then open the file you requested. See below as to how to query a specific file. The better way to query an individual file is to specify the file name in the FROM clause as shown below.

SELECT count(*) as cnt 
FROM dfs.contoso.`folder/CSVs/2021-01.csv`

You can also use globs and wildcards as well in the file path. You might also want to look at the drill docs 1 for some more info about efficiently querying directories.

If you are looking to explore what files are present in various directories, you'll want to use the INFORMATION_SCHEMA for that. Take a look here 2 and take a look at the section about files.

For instance:

SELECT * 
FROM information_schema.`files` 
WHERE schema_name = 'dfs.contoso'

Unfortunately, as I mentioned in the second paragraph, when you use `FROM `, Drill wont use the table schema declared for the directory. You cannot declare schema for a single file and without that schema it's unusable for us, among other things, due to absent error handling, [see another question](https://stackoverflow.com/questions/59226235/how-to-get-more-error-details-in-apache-drill). — rudolfdobias, Sep 17 '21 at 12:10
As Drill unacceptable for such behavior and tons of errors, we already solved this issue by migrating company to Trino in which we dont have to solve such problems anymore. Thanks for your answer, though. — rudolfdobias, Sep 17 '21 at 12:14

WHERE filename in Apache Drill does a full scan in all files

3 Answers3

Providing a Schema