Write Parquet MAP datatype by PyArrow

Question

I'm writing in Python and would like to use PyArrow to generate Parquet files.

Per my understanding and the Implementation Status, the C++ (Python) library already implemented the MAP type. From the Data Types, I can also find the type map_(key_type, item_type[, keys_sorted]).

So, I tested with several different approaches in Python/PyArrow. But all of them failed.

E.g.:

df = pd.DataFrame({
        'col1': pd.Series([
            [('key', 'aaaa'), ('value', '1111')],
            [('key', 'bbbb'), ('value', '2222')],
        ]),
        'col2': pd.Series(['foo', 'bar'])
    }
)

udt = pa.map_(pa.string(), pa.string())
schema = pa.schema([pa.field('col1', udt), pa.field('col2', pa.string())])

table = pa.Table.from_pandas(df, schema)
pq.write_table(table, FILE_NAME)

When I read the file with parquet-tools cat rand_gen_test_map.parquet, I got:

col1:
.key_value:
.key_value:
col2 = foo

col1:
.key_value:
.key_value:
col2 = bar

It seems to me that the Map values are not outputted correctly (or missed). Though the schema is correct:

message schema {
  optional group col1 (MAP) {
    repeated group key_value {
      required binary key (UTF8);
      optional binary value (UTF8);
    }
  }
  optional binary col2 (UTF8);
}

All in all, I have two questions (all in Python):

what is the best way to generate Parquet files with MAP datatype (if will be great if an example can be attached)
I understand that we can use a STRUCT to mimic a map structure. But since Parquet provided the MAP type, we still want to use it. If the MAP data type can't be generated, what is the reason behind providing a MAP type?

score 1 · Answer 1 · answered Oct 21 '20 at 17:59

1

There was a bug in writing map types. This should be fixed in pyarrow 2.0 (also reading is now supported natively)

answered Oct 21 '20 at 17:59

Micah Kornfield

1,325
5
10

thank you @micah-kornfield, the new version solved the issue for me indeed. much appreciated! – Holger Nösekabel Oct 22 '20 at 10:32
note there was also some bugs in 2.0 unfortunately, 3.0 hopefully has them all fixed. – Micah Kornfield Feb 11 '21 at 08:00

Write Parquet MAP datatype by PyArrow

1 Answers1