I have a service that is constantly updating files in GCS bucket with hive format:
bucket
device_id=aaaa
month=01
part-0.parquet
month=02
part-0.parquet
....
device_id=bbbb
month=01
part-0.parquet
month=02
part-0.parquet
....
If today we are at month=02
and I ran the following with BigQuery:
SELECT DISTINCT event_id
FROM `project_id.dataset.table`
WHERE month = '02';
I get the error: Not found: Files /bigstore/bucket_name/device_id=aaaa/month=02/part-0.parquet
I checked and the file is there when the query ran.
If I run
SELECT DISTINCT event_id
FROM `project_id.dataset.table`
WHERE month = '01';
I get results without any errors. I guess the error is related to the fact that I'm modifying the data while querying it. But as I understand this should not be the case with GCS, this is from their docs.
Because uploads are strongly consistent, you will never receive a 404 Not Found response or stale data for a read-after-write or read-after-metadata-update operation.
I saw some posts that this could be related to my bucket been Multi-region
.
Any other insights?