4

I am trying to read some parquet files using dask.dataframe.read_parquet method. In the data I have a column named timestamp, which contains data such as:

0     2018-12-20 19:00:00
1     2018-12-20 20:00:00
2     2018-12-20 21:00:00
3     2018-12-20 22:00:00
4     2018-12-20 23:00:00
5     2018-12-21 00:00:00
6     2018-12-21 01:00:00
7     2018-12-21 02:00:00
8     2018-12-21 03:00:00
9     2018-12-21 04:00:00
10    2018-12-21 05:00:00
11    2018-12-21 06:00:00
12    2018-12-21 07:00:00
13    2018-12-21 08:00:00
14    2018-12-21 09:00:00
15    2018-12-21 10:00:00
16    2018-12-21 11:00:00
17    2018-12-21 12:00:00
18    2018-12-21 13:00:00
19    2018-12-21 14:00:00
20    2018-12-21 15:00:00

and I would like to filter based on timestamp and return say, data within the last 10 days. How do I do this?

I tried something like:

filter_timestamp_days = pd.Timestamp(datetime.today() - timedelta(days=days))
filters = [('timestamp', '>', filter_timestamp_days)]
df = dask_df.read_parquet(DATA_DIR, engine='pyarrow', filters=filters)

But I am getting the error:

TypeError: Cannot compare type 'Timestamp' with type 'bytes_'

Emre Sevinç
  • 8,211
  • 14
  • 64
  • 105
  • You might be hitting the following issue: https://github.com/pandas-dev/pandas/issues/20089 – Emre Sevinç Sep 16 '19 at 09:57
  • 2
    What do you get if you run `df.dtypes` *without* applying any filter? – Emre Sevinç Sep 16 '19 at 09:58
  • Thanks, but not exactly, while datetime might not be a first class datatype in pandas, Timestamp is and in my case, before converting to pandas dataframe using `.compute()`, I would like to filter the data being loaded, so that I do not have to load the data that I do not need. It is much easier to filter the dates in pandas once I have loaded all the data, but I do not want to do this, I would load faster if I only load what I need. In simple terms, my problem is: ``` Dask read.parquet should load parquet files from the given date``` – Aladejubelo Oluwashina Sep 16 '19 at 10:05
  • ` timestamp datetime64[ns] revenue float64 conversions int64 cogs float64 ... ` – Aladejubelo Oluwashina Sep 16 '19 at 10:07
  • Does it work if you try `filters = [('timestamp', '>', filter_timestamp_days.to_datetime64())]` ? – Emre Sevinç Sep 16 '19 at 10:26
  • It didn't work. Testing a new idea inspired by this suggestion. – Aladejubelo Oluwashina Sep 16 '19 at 10:50
  • `TypeError: '>=' not supported between instances of 'numpy.ndarray' and 'numpy.bytes_'` – Aladejubelo Oluwashina Sep 16 '19 at 11:19
  • How is data partitioned? – rpanai Sep 16 '19 at 18:01
  • 1
    Have you read this part of docstring for read_parquet *to prevent the loading of some chunks of the data, and only if relevant statistics have been included in the metadata*? – rpanai Sep 16 '19 at 18:03
  • 1
    You should answer @EmreSevinç question about `dtype`. Do you mind trying to produce a [mcve](/help/mcve)? In particular creating a sample of your dataframe. I can't reproduce your error. – rpanai Sep 16 '19 at 18:32
  • 1
    I actually did. It's `datetime64[ns]`, that was why he suggested trying `filters = [('timestamp', '>', filter_timestamp_days.to_datetime64())]`. Which did not work. – Aladejubelo Oluwashina Sep 16 '19 at 18:46
  • The data is partitioned by row based on the number of processors. – Aladejubelo Oluwashina Sep 16 '19 at 18:47

1 Answers1

3

It turned out that the problem was from the data source I was working with. I tested a different data source originally written with dask and it worked simply as:

filter_timestamp_days = pd.Timestamp(datetime.today() - timedelta(days=days))
filters = [('timestamp', '>', filter_timestamp_days)]
df = dask_df.read_parquet(DATA_DIR, engine='fastparquet', filters=filters)

I did not need to convert filter_timestamp_days any further. The former data source was written with a Scala client and it seems somehow the metadata is not readable in dask.

Thank you all for your contributions.

  • 2
    Thanks for taking the time to share what you've discovered. I find it interesting, and concerning that there's such a compatibility issue between Scala ↔ Dask ↔ PyArrow ↔ Parquet. – Emre Sevinç Sep 17 '19 at 07:13
  • 1
    The problem was that the Scala client writes an empty metadata file with a name like this `*_SUCCESS`. I guess that's enough to show that the write process was done successfully but lacks the optimizations that comes with writing with engines such as `pyarrow` and `fastparquet` that would add metadata containing statistics about partitions that would have been useful in my case. – Aladejubelo Oluwashina Sep 17 '19 at 07:18
  • Spark writes metadata statistics in the Parquet file footer, per the Parquet spec. You can write out Parquet files with Spark and query the metadata with PyArrow if you'd like to check ;) @Emre Sevinç - I'm not sure there is a compatibility issue between Dask / PyArrow / Parquet / Spark. Think they all write Parquet files, per the Parquet spec, so they're interoperable. – Powers Oct 04 '21 at 12:18