5

I converted one sample dataframe to .arrow file using pyarrow

import numpy as np
import pandas as pd
import pyarrow as pa

df = pd.DataFrame({"a": [10, 2, 3]})
df['a'] = pd.to_numeric(df['a'],errors='coerce')
table = pa.Table.from_pandas(df)
writer = pa.RecordBatchFileWriter('test.arrow', table.schema)
writer.write_table(table)
writer.close()

This creates a file test.arrow

df.info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 3 entries, 0 to 2
    Data columns (total 1 columns):
    a    3 non-null int64
    dtypes: int64(1)
    memory usage: 104.0 bytes

Then in NodeJS I load the file with arrowJS. https://arrow.apache.org/docs/js/

const fs = require('fs');
const arrow = require('apache-arrow');

const data = fs.readFileSync('test.arrow');
const table = arrow.Table.from(data);

console.log(table.schema.fields.map(f => f.name));
console.log(table.count());
console.log(table.get(0));

This prints like

[ 'a' ]
0
null

I was expecting this table will have a length 3 and table.get(0) gives the first row instead of null.

This is the table scehem looks like console.log(table._schema)

[ Int_ [Int] { isSigned: true, bitWidth: 16 } ]
Schema {
  fields:
   [ Field { name: 'a', type: [Int_], nullable: true, metadata: Map {} } ],
  metadata:
   Map {
     'pandas' => '{"index_columns": [{"kind": "range", "name": null, "start": 0, "stop": 5, "step": 1}], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "a", "field_name": "a", "pandas_type": "int16", "numpy_type": "int16", "metadata": null}], "creator": {"library": "pyarrow", "version": "0.15.0"}, "pandas_version": "0.22.0"}' },
  dictionaries: Map {} }

Any idea why it is not getting the data as expected?

Sarath
  • 9,030
  • 11
  • 51
  • 84
  • Can you write to one of the project's mailing lists or open a JIRA issue? We can help you more there – Wes McKinney Oct 10 '19 at 14:29
  • @WesMcKinney I am not in the . mailing list, please give me the link to JIRA I can create an issue. – Sarath Oct 10 '19 at 19:10
  • Looking at the doc for arrowJS it looks like you have to do `const table = arrow.Table.from([data]);` – 0x26res Oct 16 '19 at 08:32
  • 1
    @WesMcKinney I think this is a regression in pyarrow 0.15. I can reproduce this with pyarrow `0.15.0-py37h8b68381_0` on conda-forge, but rolling back to `0.14.1-py37h8b68381_2` works. – Joe Quigley Oct 16 '19 at 20:25
  • Raised a JIRA ticket for this: https://issues.apache.org/jira/browse/ARROW-6921 – Joe Quigley Oct 17 '19 at 17:15

1 Answers1

2

This is due to a format change in Arrow 0.15, as mentioned by Wes on the Apache JIRA. This means that all Arrow libraries, not just PyArrow, will surface this issue when sending IPC files to older versions of Arrow. The fix is to upgrade ArrowJS to 0.15.0, so that you can round-trip between other Arrow libraries and the JS library. If you can't update for some reason, you can instead use one of the workarounds below:

Pass use_legacy_format=True as a kwarg to RecordBatchFileWriter:

with pa.RecordBatchFileWriter('file.arrow', table.schema, use_legacy_format=True) as writer:
    writer.write_table(table)

Set the environment variable ARROW_PRE_0_15_IPC_FORMAT to 1:

$ export ARROW_PRE_0_15_IPC_FORMAT = 1
$ python
>>> import pyarrow as pa
>>> table = pa.Table.from_pydict( {"a": [1, 2, 3], "b": [4, 5, 6]} )
>>> with pa.RecordBatchFileWriter('file.arrow', table.schema) as writer:
...   writer.write_table(table)
...

Or downgrade PyArrow to 0.14.x:

$ conda install -c conda-forge pyarrow=0.14.1
Joe Quigley
  • 175
  • 1
  • 8