Questions tagged [pyspark]

The Spark Python API (PySpark) exposes the Apache Spark programming model to Python.

The Spark Python API (PySpark) exposes the Spark programming model to Python.

Useful Links:

Related Tags:

39058 questions
8
votes
2 answers

Select specific columns in a PySpark dataframe to improve performance

Working with Spark dataframes imported from Hive, sometimes I end up with several columns that I don't need. Supposing that I don't want to filter them with df = SqlContext.sql('select cols from mytable') and I'm importing the entire table with…
Ivan
  • 19,560
  • 31
  • 97
  • 141
8
votes
3 answers

Select array element from Spark Dataframes split method in same call?

I'm splitting an HTTP request to look at the elements, and I was wondering if there was a way to specify the element I'd like to look at in the same call without having to do another operation. For example: from pyspark.sql import functions as…
flybonzai
  • 3,763
  • 11
  • 38
  • 72
8
votes
0 answers

Spark Application Not Recovering when Executor Lost

I'm running Spark in a standalone cluster. A python driver application on the same node as Master and 2 worker nodes. The business logic is python code that is run by the executors created on the Worker nodes. I'm ending up in a situation where…
user481a
  • 123
  • 6
8
votes
1 answer

Spark - Shuffle Read Blocked Time

Lately I've been tuning the performance of some large, shuffle heavy jobs. Looking at the spark UI, I noticed an option called "Shuffle Read Blocked Time" under the additional metrics section. This "Shuffle Read Blocked Time" seems to account for…
dayman
  • 680
  • 5
  • 10
8
votes
1 answer

pyspark row number dataframe

I have a dataframe, with columns time,a,b,c,d,val. I would like to create a dataframe, with additional column, that will contain the row number of the row, within each group, where a,b,c,d is a group key. I tried with spark sql, by defining a window…
matlabit
  • 838
  • 2
  • 13
  • 31
8
votes
1 answer

how to properly use pyspark to send data to kafka broker?

I'm trying to write a simple pyspark job, which would receive data from a kafka broker topic, did some transformation on that data, and put the transformed data on a different kafka broker topic. I have the following code, which reads data from a…
Eugene Goldberg
  • 14,286
  • 20
  • 94
  • 167
8
votes
1 answer

PySpark -- Convert List of Rows to Data Frame

The problem I'm actually trying to solve is to take the first/last N rows of a PySpark dataframe and have the result be a dataframe. Specifically, I want to be able to do something like this: my_df.head(20).toPandas() However, because head()…
TuringMachin
  • 391
  • 1
  • 4
  • 10
8
votes
1 answer

How to create a z-score in Spark SQL for each group

I have a dataframe which looks like this dSc TranAmount 1: 100021 79.64 2: 100021 79.64 3: 100021 0.16 4: 100022 11.65 5: 100022 0.36 6: 100022 0.47 7: 100025 0.17 8: 100037 0.27 9:…
Bg1850
  • 3,032
  • 2
  • 16
  • 30
8
votes
1 answer

How to serialize a pyspark Pipeline object?

I'm trying to serialize a PySpark Pipeline object so that it can be saved and retrieved later. Tried using the Python pickle library as well as the PySpark's PickleSerializer, the dumps() call itself is failing. Providing the code snippet while…
8
votes
2 answers

pySpark: Save ML Model

Can someone please give an example of how you would save a ML model in pySpark? For ml.classification.LogisticRegressionModel I try to use the following: model.save("path") but it does not seem to work.
ml_0x
  • 302
  • 1
  • 3
  • 18
8
votes
3 answers

how to make RMSE(root mean square error) small when use ALS of spark?

I need some suggestions to build a good model to make recommendation by using Collaborative Filtering of spark. There is a sample code in the official website. I also past it following: from pyspark.mllib.recommendation import ALS,…
8
votes
2 answers

Register UDF to SqlContext from Scala to use in PySpark

Is it possible to register a UDF (or function) written in Scala to use in PySpark ? E.g.: val mytable = sc.parallelize(1 to 2).toDF("spam") mytable.registerTempTable("mytable") def addOne(m: Integer): Integer = m + 1 // Spam: 1, 2 In Scala, the…
8
votes
1 answer

Jupyter & PySpark: How to run multiple notebooks

I am using Spark 1.6.0 on three VMs, 1x Master (standalone), 2x workers w/ 8G RAM, 2CPU each. I am using the kernel configuration below: { "display_name": "PySpark ", "language": "python3", "argv": [ "/usr/bin/python3", "-m", …
pltrdy
  • 2,069
  • 1
  • 11
  • 29
8
votes
0 answers

PySpark and PDB don't seem to mix

I'm building stand alone python programs that will use pyspark (and elasticsearch-hadoop connector). I am also addicted to the Python Debugger (PDB) and want to be able to step through my code. It appears I can't run pyspark with the PDB like I…
cybergoof
  • 1,407
  • 3
  • 16
  • 25
8
votes
5 answers

PySpark using IAM roles to access S3

I'm wondering if PySpark supports S3 access using IAM roles. Specifically, I have a business constraint where I have to assume an AWS role in order to access a given bucket. This is fine when using boto (as it's part of the API), but I can't find a…
Nick
  • 93
  • 1
  • 7