-1

I use a docker with some containers (one for Jupyter-Lab, one for Spark and 3 for each products of ELK (ElasticSearch, Kibana and Logstash).

I also use sparkmagic for my jupyter's notebooks.

So what I'm trying to do is to send an Output of a cell to spark and then use it to create a spark Dataframe.

First of all, I created a python script working with pandas for analyze an Excel file (sys.argv[1] is my excel File and sys.argv[2] is my sheet's name) and return me data (in my case the data is stored in a dict)

Here is my python code :

import pandas as pd
import numpy as np
import json
from os import sys

def prct_KPY():
    perct_dep = {}
    perct_dep['val1'] = round(df.iloc[0, 1]*100)
    perct_dep['val2'] = round(df.iloc[0, 2]*100)
    perct_dep['val3'] = round(df.iloc[0, 3]*100)
    perct_dep['val4'] = round(df.iloc[0, 4]*100)
    return perct_dep

df = pd.read_excel(sys.argv[1], sys.argv[2], skiprows=50)
var = prct_KPY()
print(var)

This python code is stored in a python file, nammed "test.py".

Afterwards, I want to use this dict into a spark DataFrame as an arg (and therefore i'll send it to my Elastic).

So I call my script by using this code in a notebook's cell :

%%!
python3 test.py "Path_Of_My_Excel_File" "Name_Of_My_Sheet"

and I get an output :

["{'val1': 96, 'val2': 94, 'val3': 96, 'val4': 96}", '']

this is the object's type : .

I can use the result with "_" in another cell but when I try to use it in a spark cell, it doesn't work ! I have this error message :

An error was encountered: name '' is not defined Traceback (most recent call last): NameError: name '' is not defined

How can i spend this output in a spark cell ?

Thanks for any help !

hlas95
  • 3
  • 2

2 Answers2

0

Is there a reason you can't do all of this in one paragraph? As long as the version of Python your PySpark job is using has access to Pandas, technically this should be possible.

If you can do this, this would be a lot easier. You can just use the SparkSession.createDataFrame function, which can take a pandas dataframe and give you a spark dataframe back.

http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html

If not, you might try pickling the pandas dataframe, and then pull that in, unpickle it, and do the same as above. I'm not familiar with SparkMagic at all, so I don't know the specifics of using previous parameters, but as long as that is working, this should work as well.

Dave McGinnis
  • 585
  • 4
  • 17
0

Sparkmagic has some specifics since it works with the remote Spark context. In your case basically what you need to do is to use Sparkmagic magic command %%send_to_spark. Please refer the example here.

Please note the warning: this example assumes that both (py)Spark cluster and your local machine both have the same python packages versions

  • Hi, I've already tried to use "send_to_spark" but I have the Following error : UsageError: Cell magic `%%local` not found. My Spark cluster and my Jupyter have the same python packages versions.. – hlas95 Dec 17 '19 at 10:32
  • Could you share the example so I could reproduce your issue? And please share Sparkmagic version you're using. – Aliaksandr Sasnouskikh Dec 17 '19 at 15:06
  • 1
    Hi. I fixed the problem. The issue was my notebook's kernels ! PySpark wasn't up.. – hlas95 Dec 18 '19 at 10:55