Questions tagged [pyspark]

The Spark Python API (PySpark) exposes the Apache Spark programming model to Python.

The Spark Python API (PySpark) exposes the Spark programming model to Python.

Useful Links:

Related Tags:

39058 questions
8
votes
1 answer

check if a row value is null in spark dataframe

I am using a custom function in pyspark to check a condition for each row in a spark dataframe and add columns if condition is true. The code is as below: from pyspark.sql.types import * from pyspark.sql.functions import * from pyspark.sql import…
8
votes
2 answers

How to load table from SQLLite db file from PySpark?

I am trying to load table from a SQLLite .db file stored at local disk. Is there any clean way to do this in PySpark? Currently, I am using a solution that works but not as elegant. First I read the table using pandas though sqlite3. One concern is…
Bin
  • 3,645
  • 10
  • 33
  • 57
8
votes
2 answers

How convert ML VectorUDT features from .mllib to .ml type

Using pySpark ML API in version 2.0.0 for a linear regression simple example, I get an error with new ML library. The code is: from pyspark.sql import SQLContext sqlContext =SQLContext(sc) from pyspark.mllib.linalg import…
kgnete
  • 214
  • 2
  • 5
8
votes
2 answers

Timestamp parsing in pyspark

df1: Timestamp: 1995-08-01T00:00:01.000+0000 Is there a way to separate the day of the month in the timestamp column of the data frame using pyspark. Not able to provide the code, I am new to spark. I do not have a clue on how to proceed.
data_person
  • 4,194
  • 7
  • 40
  • 75
8
votes
2 answers

Does Spark Dataframe have an equivalent option of Panda's merge indicator?

The python Pandas library contains the following function : DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, …
mnos
  • 143
  • 1
  • 8
8
votes
2 answers

Any way to access methods from individual stages in PySpark PipelineModel?

I've created a PipelineModel for doing LDA in Spark 2.0 (via PySpark API): def create_lda_pipeline(minTokenLength=1, minDF=1, minTF=1, numTopics=10, seed=42, pattern='[\W]+'): """ Create a pipeline for running an LDA model on a corpus. This…
Evan Zamir
  • 8,059
  • 14
  • 56
  • 83
8
votes
1 answer

find the closest time between two tables in spark

I am using pyspark and I have two dataframes like this: user time bus A 2016/07/18 12:00:00 1 B 2016/07/19 12:00:00 2 C 2016/07/20 12:00:00 3 bus time stop 1 2016/07/18 11:59:40 sA 1 …
Finn
  • 103
  • 1
  • 5
8
votes
2 answers

Why is it possible to have "serialized results of n tasks (XXXX MB)" be greater than `spark.driver.memory` in pyspark?

I launched a spark job with these settings (among others): spark.driver.maxResultSize 11GB spark.driver.memory 12GB I was debugging my pyspark job, and it kept giving me the error: serialized results of 16 tasks (17.4 GB) is bigger than…
makansij
  • 9,303
  • 37
  • 105
  • 183
8
votes
10 answers

The SPARK_HOME env variable is set but Jupyter Notebook doesn't see it. (Windows)

I'm on Windows 10. I was trying to get Spark up and running in a Jupyter Notebook alongside Python 3.5. I installed a pre-built version of Spark and set the SPARK_HOME environmental variable. I installed findspark and run the code: import…
Andrea
  • 83
  • 1
  • 1
  • 6
8
votes
3 answers

How to know deploy mode of PySpark application?

I am trying to fix an issue with running out of memory, and I want to know whether I need to change these settings in the default configurations file (spark-defaults.conf) in the spark home folder. Or, if I can set them in the code. I saw this…
makansij
  • 9,303
  • 37
  • 105
  • 183
8
votes
1 answer

Which is the fastest way to read Json Files from S3 : Spark

I have a directory with folders and each folder contains compressed JSON file (.gz). Currently I am doing like: val df = sqlContext.jsonFile("s3://testData/*/*/*") df.show() Eg: testData/May/01/00/File.json.gz Each compressed file is about 11 to…
user4479371
8
votes
3 answers

integrating scikit-learn with pyspark

I'm exploring pyspark and the possibilities of integrating scikit-learn with pyspark. I'd like to train a model on each partition using scikit-learn. That means, when my RDD is is defined and gets distributed among different worker nodes, I'd like…
HHH
  • 6,085
  • 20
  • 92
  • 164
8
votes
1 answer

How to build a sparse matrix in PySpark?

I am new to Spark. I would like to make a sparse matrix a user-id item-id matrix specifically for a recommendation engine. I know how I would do this in python. How does one do this in PySpark? Here is how I would have done it in matrix. The table…
8
votes
1 answer

Forward fill missing values in Spark/Python

I am attempting to fill in missing values in my Spark dataframe with the previous non-null value (if it exists). I've done this type of thing in Python/Pandas but my data is too big for Pandas (on a small cluster) and I'm Spark noob. Is this…
8
votes
1 answer

boto3 cannot create client on pyspark worker?

I'm trying to send data from the workers of a Pyspark RDD to an SQS queue, using boto3 to talk with AWS. I need to send data directly from the partitions, rather than collecting the RDD and sending data from the driver. I am able to send messages to…
EmmaOnThursday
  • 167
  • 4
  • 9