1

How can I read a CSV into a DataFusion DataFrame with datafusion-python?

Here's what I have so far:

import datafusion

ctx = datafusion.SessionContext()

I couldn't find any instructions in the docs.

I am using DataFusion v0.6.0.

Powers
  • 18,150
  • 10
  • 103
  • 108

1 Answers1

1

There is some documentation here - https://github.com/apache/arrow-datafusion/blob/master/docs/source/python/index.rst

Here is one of the examples:

import datafusion
from datafusion import functions as f
from datafusion import col
import pyarrow

# create a context
ctx = datafusion.SessionContext()

# register a CSV
ctx.register_csv('example', 'example.csv')

# create a new statement via SQL
df = ctx.sql("SELECT a+b, a-b FROM example")

# execute and collect the first (and only) batch
result = df.collect()[0]

assert result.column(0) == pyarrow.array([5, 7, 9])
assert result.column(1) == pyarrow.array([-3, -3, -3])

There is work under way to move the documentation to the datafusion-python repo (see https://github.com/apache/arrow-datafusion/issues/2866)

Andy Grove
  • 131
  • 3
  • Andy, Attempting to read 'example.csv.gz' file results in '++++' instead of a df. Is this a datafussion or arrow issue? Thx – Frank Jan 06 '23 at 19:10
  • 1
    Andy, When using the above code in a VS Code notebook or a Jupyter notebook, `df.show()` command shows nothing in the notebook but it prints to the Jupyter console instead. Is this a notebook issue or a datafussion issue? Thanks – Frank Jan 06 '23 at 19:42
  • No, I was not and no answer from Andy either. If you need this functionality, please use polars instead. – Frank Jan 17 '23 at 16:20