Highest Voted 'pyspark' Questions

8

votes

1 answer

check if a row value is null in spark dataframe

I am using a custom function in pyspark to check a condition for each row in a spark dataframe and add columns if condition is true. The code is as below: from pyspark.sql.types import * from pyspark.sql.functions import * from pyspark.sql import…

apache-spark pyspark user-defined-functions apache-spark-sql isnull

asked Aug 19 '16 at 09:51

sam

173
1
1
10

8

votes

2 answers

How to load table from SQLLite db file from PySpark?

I am trying to load table from a SQLLite .db file stored at local disk. Is there any clean way to do this in PySpark? Currently, I am using a solution that works but not as elegant. First I read the table using pandas though sqlite3. One concern is…

python sqlite apache-spark pyspark data-science

asked Aug 16 '16 at 22:16

Bin

3,645
10
33
57

8

votes

2 answers

How convert ML VectorUDT features from .mllib to .ml type

Using pySpark ML API in version 2.0.0 for a linear regression simple example, I get an error with new ML library. The code is: from pyspark.sql import SQLContext sqlContext =SQLContext(sc) from pyspark.mllib.linalg import…

machine-learning pyspark

asked Aug 11 '16 at 15:57

kgnete

214
2
5

8

votes

2 answers

Timestamp parsing in pyspark

df1: Timestamp: 1995-08-01T00:00:01.000+0000 Is there a way to separate the day of the month in the timestamp column of the data frame using pyspark. Not able to provide the code, I am new to spark. I do not have a clue on how to proceed.

apache-spark pyspark

asked Aug 07 '16 at 01:42

data_person

4,194
7
40
75

8

votes

2 answers

Does Spark Dataframe have an equivalent option of Panda's merge indicator?

The python Pandas library contains the following function : DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, …

python pandas pyspark apache-spark-sql

asked Aug 02 '16 at 13:01

mnos

143
1
8

8

votes

2 answers

Any way to access methods from individual stages in PySpark PipelineModel?

I've created a PipelineModel for doing LDA in Spark 2.0 (via PySpark API): def create_lda_pipeline(minTokenLength=1, minDF=1, minTF=1, numTopics=10, seed=42, pattern='[\W]+'): """ Create a pipeline for running an LDA model on a corpus. This…

python apache-spark pyspark apache-spark-mllib apache-spark-ml

asked Jul 29 '16 at 17:42

Evan Zamir

8,059
14
56
83

8

votes

1 answer

find the closest time between two tables in spark

I am using pyspark and I have two dataframes like this: user time bus A 2016/07/18 12:00:00 1 B 2016/07/19 12:00:00 2 C 2016/07/20 12:00:00 3 bus time stop 1 2016/07/18 11:59:40 sA 1 …

sql apache-spark pyspark apache-spark-sql

asked Jul 27 '16 at 21:50

Finn

103
1
5

8

votes

2 answers

Why is it possible to have "serialized results of n tasks (XXXX MB)" be greater than `spark.driver.memory` in pyspark?

I launched a spark job with these settings (among others): spark.driver.maxResultSize 11GB spark.driver.memory 12GB I was debugging my pyspark job, and it kept giving me the error: serialized results of 16 tasks (17.4 GB) is bigger than…

apache-spark jvm buffer cluster-computing pyspark

asked Jul 17 '16 at 01:39

makansij

9,303
37
105
183

8

votes

10 answers

The SPARK_HOME env variable is set but Jupyter Notebook doesn't see it. (Windows)

I'm on Windows 10. I was trying to get Spark up and running in a Jupyter Notebook alongside Python 3.5. I installed a pre-built version of Spark and set the SPARK_HOME environmental variable. I installed findspark and run the code: import…

python-3.x apache-spark pyspark

asked Jul 16 '16 at 13:53

Andrea

83
1
1
6

8

votes

3 answers

How to know deploy mode of PySpark application?

I am trying to fix an issue with running out of memory, and I want to know whether I need to change these settings in the default configurations file (spark-defaults.conf) in the spark home folder. Or, if I can set them in the code. I saw this…

apache-spark cluster-computing pyspark

asked Jul 14 '16 at 21:04

makansij

9,303
37
105
183

8

votes

1 answer

Which is the fastest way to read Json Files from S3 : Spark

I have a directory with folders and each folder contains compressed JSON file (.gz). Currently I am doing like: val df = sqlContext.jsonFile("s3://testData/*/*/*") df.show() Eg: testData/May/01/00/File.json.gz Each compressed file is about 11 to…

json scala apache-spark amazon-s3 pyspark

asked Jul 06 '16 at 00:00

user4479371

8

votes

3 answers

integrating scikit-learn with pyspark

I'm exploring pyspark and the possibilities of integrating scikit-learn with pyspark. I'd like to train a model on each partition using scikit-learn. That means, when my RDD is is defined and gets distributed among different worker nodes, I'd like…

apache-spark scikit-learn pyspark

asked Jul 04 '16 at 14:59

HHH

6,085
20
92
164

8

votes

1 answer

How to build a sparse matrix in PySpark?

I am new to Spark. I would like to make a sparse matrix a user-id item-id matrix specifically for a recommendation engine. I know how I would do this in python. How does one do this in PySpark? Here is how I would have done it in matrix. The table…

python apache-spark pyspark sparse-matrix recommendation-engine

asked Jun 30 '16 at 22:39

ashish trehan

413
1
5
9

8

votes

1 answer

Forward fill missing values in Spark/Python

I am attempting to fill in missing values in my Spark dataframe with the previous non-null value (if it exists). I've done this type of thing in Python/Pandas but my data is too big for Pandas (on a small cluster) and I'm Spark noob. Is this…

hadoop apache-spark pyspark apache-spark-sql apache-spark-mllib

asked Jun 30 '16 at 19:46

user1624577

547
2
6
15

8

votes

1 answer

boto3 cannot create client on pyspark worker?

I'm trying to send data from the workers of a Pyspark RDD to an SQS queue, using boto3 to talk with AWS. I need to send data directly from the partitions, rather than collecting the RDD and sending data from the driver. I am able to send messages to…

python pyspark boto3

asked Jun 21 '16 at 17:18

EmmaOnThursday

167
4
9

Questions tagged [pyspark]

Useful Links:

Related Tags:

check if a row value is null in spark dataframe

How to load table from SQLLite db file from PySpark?

How convert ML VectorUDT features from .mllib to .ml type

Timestamp parsing in pyspark

Does Spark Dataframe have an equivalent option of Panda's merge indicator?

Any way to access methods from individual stages in PySpark PipelineModel?

find the closest time between two tables in spark

Why is it possible to have "serialized results of n tasks (XXXX MB)" be greater than `spark.driver.memory` in pyspark?

The SPARK_HOME env variable is set but Jupyter Notebook doesn't see it. (Windows)

How to know deploy mode of PySpark application?

Which is the fastest way to read Json Files from S3 : Spark

integrating scikit-learn with pyspark

How to build a sparse matrix in PySpark?

Forward fill missing values in Spark/Python

boto3 cannot create client on pyspark worker?