2

I'm writing in Python and would like to use PyArrow to generate Parquet files.

Per my understanding and the Implementation Status, the C++ (Python) library already implemented the MAP type. From the Data Types, I can also find the type map_(key_type, item_type[, keys_sorted]).

So, I tested with several different approaches in Python/PyArrow. But all of them failed.

E.g.:

df = pd.DataFrame({
        'col1': pd.Series([
            [('key', 'aaaa'), ('value', '1111')],
            [('key', 'bbbb'), ('value', '2222')],
        ]),
        'col2': pd.Series(['foo', 'bar'])
    }
)

udt = pa.map_(pa.string(), pa.string())
schema = pa.schema([pa.field('col1', udt), pa.field('col2', pa.string())])

table = pa.Table.from_pandas(df, schema)
pq.write_table(table, FILE_NAME)

When I read the file with parquet-tools cat rand_gen_test_map.parquet, I got:

col1:
.key_value:
.key_value:
col2 = foo

col1:
.key_value:
.key_value:
col2 = bar

It seems to me that the Map values are not outputted correctly (or missed). Though the schema is correct:

message schema {
  optional group col1 (MAP) {
    repeated group key_value {
      required binary key (UTF8);
      optional binary value (UTF8);
    }
  }
  optional binary col2 (UTF8);
}

All in all, I have two questions (all in Python):

  1. what is the best way to generate Parquet files with MAP datatype (if will be great if an example can be attached)

  2. I understand that we can use a STRUCT to mimic a map structure. But since Parquet provided the MAP type, we still want to use it. If the MAP data type can't be generated, what is the reason behind providing a MAP type?

Yucan
  • 21
  • 3

1 Answers1

1

There was a bug in writing map types. This should be fixed in pyarrow 2.0 (also reading is now supported natively)

Micah Kornfield
  • 1,325
  • 5
  • 10