1

I would like to read data from DataTap using cpython.

In spark, I can do something like:

df = spark.read.csv("dtap://MaprClus2/tmp/airline-safety.csv")

How can I do the same if I am using cpython, for example when I don't have a pyspark Jupyter kernel?

Chris Snow
  • 23,813
  • 35
  • 144
  • 309

1 Answers1

0

One option is to use a subprocess to call out to the hadoop cli command:

from subprocess import check_output
import pandas as pd
from io import BytesIO

def hdfs_read(fpath):
    out = check_output(['hadoop', 'fs', '-cat', fpath])
    return BytesIO(out) 

data = hdfs_read("dtap://MaprClus2/tmp/airline-safety.csv")

# row 1 contains hadoop cli warning so remove it
pd.read_csv(data, sep=",", skiprows=1) 
Chris Snow
  • 23,813
  • 35
  • 144
  • 309
  • That's a valid answer. Is there any more pythonic solution? Any package that supports dtap reads in python? – NightFurry Apr 02 '21 at 13:04
  • Yes - using PyArrow and Pydoop. The setup is manual at the moment, however, I'm hoping the next release will automate the setup. I'll post a new answer if that happens. – Chris Snow Apr 03 '21 at 07:29