Highest Voted 'pyspark' Questions

8

votes

2 answers

Select specific columns in a PySpark dataframe to improve performance

Working with Spark dataframes imported from Hive, sometimes I end up with several columns that I don't need. Supposing that I don't want to filter them with df = SqlContext.sql('select cols from mytable') and I'm importing the entire table with…

apache-spark pyspark apache-spark-sql

asked Jun 21 '16 at 13:34

Ivan

19,560
31
97
141

8

votes

3 answers

Select array element from Spark Dataframes split method in same call?

I'm splitting an HTTP request to look at the elements, and I was wondering if there was a way to specify the element I'd like to look at in the same call without having to do another operation. For example: from pyspark.sql import functions as…

python apache-spark pyspark apache-spark-sql

asked Jun 07 '16 at 21:42

flybonzai

3,763
11
38
72

8

votes

0 answers

Spark Application Not Recovering when Executor Lost

I'm running Spark in a standalone cluster. A python driver application on the same node as Master and 2 worker nodes. The business logic is python code that is run by the executors created on the Worker nodes. I'm ending up in a situation where…

apache-spark pyspark

asked Jun 07 '16 at 16:34

user481a

123
6

8

votes

1 answer

Spark - Shuffle Read Blocked Time

Lately I've been tuning the performance of some large, shuffle heavy jobs. Looking at the spark UI, I noticed an option called "Shuffle Read Blocked Time" under the additional metrics section. This "Shuffle Read Blocked Time" seems to account for…

apache-spark pyspark apache-spark-sql

asked May 26 '16 at 18:22

dayman

680
5
10

8

votes

1 answer

pyspark row number dataframe

I have a dataframe, with columns time,a,b,c,d,val. I would like to create a dataframe, with additional column, that will contain the row number of the row, within each group, where a,b,c,d is a group key. I tried with spark sql, by defining a window…

python apache-spark pyspark apache-spark-sql

asked May 23 '16 at 12:11

matlabit

838
2
13
31

8

votes

1 answer

how to properly use pyspark to send data to kafka broker?

I'm trying to write a simple pyspark job, which would receive data from a kafka broker topic, did some transformation on that data, and put the transformed data on a different kafka broker topic. I have the following code, which reads data from a…

python-2.7 pyspark spark-streaming kafka-python

asked May 20 '16 at 02:37

Eugene Goldberg

14,286
20
94
167

8

votes

1 answer

PySpark -- Convert List of Rows to Data Frame

The problem I'm actually trying to solve is to take the first/last N rows of a PySpark dataframe and have the result be a dataframe. Specifically, I want to be able to do something like this: my_df.head(20).toPandas() However, because head()…

python apache-spark pyspark apache-spark-sql

asked May 01 '16 at 16:31

TuringMachin

391
1
4
10

8

votes

1 answer

How to create a z-score in Spark SQL for each group

I have a dataframe which looks like this dSc TranAmount 1: 100021 79.64 2: 100021 79.64 3: 100021 0.16 4: 100022 11.65 5: 100022 0.36 6: 100022 0.47 7: 100025 0.17 8: 100037 0.27 9:…

python apache-spark pyspark apache-spark-sql

asked Apr 23 '16 at 07:23

Bg1850

3,032
2
16
30

8

votes

1 answer

How to serialize a pyspark Pipeline object?

I'm trying to serialize a PySpark Pipeline object so that it can be saved and retrieved later. Tried using the Python pickle library as well as the PySpark's PickleSerializer, the dumps() call itself is failing. Providing the code snippet while…

python apache-spark serialization pyspark apache-spark-ml

asked Apr 15 '16 at 14:58

Dinoop Thomas

81
1
2

8

votes

2 answers

pySpark: Save ML Model

Can someone please give an example of how you would save a ML model in pySpark? For ml.classification.LogisticRegressionModel I try to use the following: model.save("path") but it does not seem to work.

apache-spark machine-learning pyspark

asked Apr 13 '16 at 23:04

ml_0x

302
1
3
18

8

votes

3 answers

how to make RMSE(root mean square error) small when use ALS of spark?

I need some suggestions to build a good model to make recommendation by using Collaborative Filtering of spark. There is a sample code in the official website. I also past it following: from pyspark.mllib.recommendation import ALS,…

apache-spark pyspark apache-spark-mllib collaborative-filtering

asked Apr 12 '16 at 13:46

sydridgm

1,012
4
16
30

8

votes

2 answers

Register UDF to SqlContext from Scala to use in PySpark

Is it possible to register a UDF (or function) written in Scala to use in PySpark ? E.g.: val mytable = sc.parallelize(1 to 2).toDF("spam") mytable.registerTempTable("mytable") def addOne(m: Integer): Integer = m + 1 // Spam: 1, 2 In Scala, the…

scala apache-spark pyspark user-defined-functions apache-zeppelin

asked Apr 07 '16 at 14:55

Andarin

799
1
8
17

8

votes

1 answer

Jupyter & PySpark: How to run multiple notebooks

I am using Spark 1.6.0 on three VMs, 1x Master (standalone), 2x workers w/ 8G RAM, 2CPU each. I am using the kernel configuration below: { "display_name": "PySpark ", "language": "python3", "argv": [ "/usr/bin/python3", "-m", …

apache-spark pyspark jupyter

asked Mar 30 '16 at 14:02

pltrdy

2,069
1
11
29

8

votes

0 answers

PySpark and PDB don't seem to mix

I'm building stand alone python programs that will use pyspark (and elasticsearch-hadoop connector). I am also addicted to the Python Debugger (PDB) and want to be able to step through my code. It appears I can't run pyspark with the PDB like I…

pyspark

asked Mar 29 '16 at 18:48

cybergoof

1,407
3
16
25

8

votes

5 answers

PySpark using IAM roles to access S3

I'm wondering if PySpark supports S3 access using IAM roles. Specifically, I have a business constraint where I have to assume an AWS role in order to access a given bucket. This is fine when using boto (as it's part of the API), but I can't find a…

python amazon-web-services amazon-s3 pyspark amazon-iam

asked Mar 22 '16 at 21:36

Nick

93
1
7

Questions tagged [pyspark]

Useful Links:

Related Tags:

Select specific columns in a PySpark dataframe to improve performance

Select array element from Spark Dataframes split method in same call?

Spark Application Not Recovering when Executor Lost

Spark - Shuffle Read Blocked Time

pyspark row number dataframe

how to properly use pyspark to send data to kafka broker?

PySpark -- Convert List of Rows to Data Frame

How to create a z-score in Spark SQL for each group

How to serialize a pyspark Pipeline object?

pySpark: Save ML Model

how to make RMSE(root mean square error) small when use ALS of spark?

Register UDF to SqlContext from Scala to use in PySpark

Jupyter & PySpark: How to run multiple notebooks

PySpark and PDB don't seem to mix

PySpark using IAM roles to access S3