Working with Spark dataframes imported from Hive, sometimes I end up with several columns that I don't need. Supposing that I don't want to filter them with
df = SqlContext.sql('select cols from mytable')
and I'm importing the entire table with…
I'm splitting an HTTP request to look at the elements, and I was wondering if there was a way to specify the element I'd like to look at in the same call without having to do another operation.
For example:
from pyspark.sql import functions as…
I'm running Spark in a standalone cluster. A python driver application on the same node as Master and 2 worker nodes. The business logic is python code that is run by the executors created on the Worker nodes.
I'm ending up in a situation where…
Lately I've been tuning the performance of some large, shuffle heavy jobs. Looking at the spark UI, I noticed an option called "Shuffle Read Blocked Time" under the additional metrics section.
This "Shuffle Read Blocked Time" seems to account for…
I have a dataframe, with columns time,a,b,c,d,val.
I would like to create a dataframe, with additional column, that will contain the row number of the row, within each group, where a,b,c,d is a group key.
I tried with spark sql, by defining a window…
I'm trying to write a simple pyspark job, which would receive data from a kafka broker topic, did some transformation on that data, and put the transformed data on a different kafka broker topic.
I have the following code, which reads data from a…
The problem I'm actually trying to solve is to take the first/last N rows of a PySpark dataframe and have the result be a dataframe. Specifically, I want to be able to do something like this:
my_df.head(20).toPandas()
However, because head()…
I'm trying to serialize a PySpark Pipeline object so that it can be saved and retrieved later. Tried using the Python pickle library as well as the PySpark's PickleSerializer, the dumps() call itself is failing.
Providing the code snippet while…
Can someone please give an example of how you would save a ML model in pySpark?
For
ml.classification.LogisticRegressionModel
I try to use the following:
model.save("path")
but it does not seem to work.
I need some suggestions to build a good model to make recommendation by using Collaborative Filtering of spark. There is a sample code in the official website. I also past it following:
from pyspark.mllib.recommendation import ALS,…
Is it possible to register a UDF (or function) written in Scala to use in PySpark ?
E.g.:
val mytable = sc.parallelize(1 to 2).toDF("spam")
mytable.registerTempTable("mytable")
def addOne(m: Integer): Integer = m + 1
// Spam: 1, 2
In Scala, the…
I am using Spark 1.6.0 on three VMs, 1x Master (standalone), 2x workers w/ 8G RAM, 2CPU each.
I am using the kernel configuration below:
{
"display_name": "PySpark ",
"language": "python3",
"argv": [
"/usr/bin/python3",
"-m",
…
I'm building stand alone python programs that will use pyspark (and elasticsearch-hadoop connector). I am also addicted to the Python Debugger (PDB) and want to be able to step through my code.
It appears I can't run pyspark with the PDB like I…
I'm wondering if PySpark supports S3 access using IAM roles. Specifically, I have a business constraint where I have to assume an AWS role in order to access a given bucket. This is fine when using boto (as it's part of the API), but I can't find a…