How to read data from DataTap using pyspark on Cloudera 5.x?

Question

I have created a Cloudera 5.x cluster with the Spark option set:

I would like to run a simple test using PySpark to read data from one Datatap and write it to another Datatap.

What are the steps for doing this with PySpark?

Chris Snow · Accepted Answer · 2019-08-29T16:02:02.837

For this example, I'm going to use the TenantStorage DTAP that is created by default for my Tenant.

I've uploaded a dataset from https://raw.githubusercontent.com/fivethirtyeight/data/master/airline-safety/airline-safety.csv

Next, locate the controller node and ssh into it:

Because the tenant is setup with the default Cluster Superuser Privileges (Site Admin and Tenant Admin), I can download the tenant ssh key from the cluster page and use that to ssh into the controller node:

ssh bluedata@x.x.x.x -p 10007 -i ~/Downloads/BD_Demo\ Tenant.pem

x.x.x.x for me is the public IP address of my BlueData gateway.

Note that we are connecting to port 10007 which is the port of the controller.

Run pyspark:

$ pyspark --master yarn --deploy-mode client --packages com.databricks:spark-csv_2.10:1.4.0

Access the datafile and retrieve the first record:

>>> from pyspark.sql import SQLContext
>>> sqlContext = SQLContext(sc)
>>> df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('dtap://TenantStorage/airline-safety.csv')
>>> df.take(1)

The results are:

[Row(airline=u'Aer Lingus', avail_seat_km_per_week=320906734, incidents_85_99=2, fatal_accidents_85_99=0, fatalities_85_99=0, incidents_00_14=0, fatal_accidents_00_14=0, fatalities_00_14=0)]

If you want to read the data from one Datatap, process it and save it to another Datatap it would look something like this:

>>> df_filtered = df.filter(df.incidents_85_99 == 0)
>>> df_filtered.write.parquet('dtap://OtherDataTap/airline-safety_zero_incidents.parquet')

How to read data from DataTap using pyspark on Cloudera 5.x?

1 Answers1