I am using a custom function in pyspark to check a condition for each row in a spark dataframe and add columns if condition is true.
The code is as below:
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql import…
I am trying to load table from a SQLLite .db file stored at local disk. Is there any clean way to do this in PySpark?
Currently, I am using a solution that works but not as elegant. First I read the table using pandas though sqlite3. One concern is…
Using pySpark ML API in version 2.0.0 for a linear regression simple example, I get an error with new ML library.
The code is:
from pyspark.sql import SQLContext
sqlContext =SQLContext(sc)
from pyspark.mllib.linalg import…
df1:
Timestamp:
1995-08-01T00:00:01.000+0000
Is there a way to separate the day of the month in the timestamp column of the data frame using pyspark. Not able to provide the code, I am new to spark. I do not have a clue on how to proceed.
I've created a PipelineModel for doing LDA in Spark 2.0 (via PySpark API):
def create_lda_pipeline(minTokenLength=1, minDF=1, minTF=1, numTopics=10, seed=42, pattern='[\W]+'):
"""
Create a pipeline for running an LDA model on a corpus. This…
I am using pyspark and I have two dataframes like this:
user time bus
A 2016/07/18 12:00:00 1
B 2016/07/19 12:00:00 2
C 2016/07/20 12:00:00 3
bus time stop
1 2016/07/18 11:59:40 sA
1 …
I launched a spark job with these settings (among others):
spark.driver.maxResultSize 11GB
spark.driver.memory 12GB
I was debugging my pyspark job, and it kept giving me the error:
serialized results of 16 tasks (17.4 GB) is bigger than…
I'm on Windows 10. I was trying to get Spark up and running in a Jupyter Notebook alongside Python 3.5. I installed a pre-built version of Spark and set the SPARK_HOME environmental variable. I installed findspark and run the code:
import…
I am trying to fix an issue with running out of memory, and I want to know whether I need to change these settings in the default configurations file (spark-defaults.conf) in the spark home folder. Or, if I can set them in the code.
I saw this…
I have a directory with folders and each folder contains compressed JSON file (.gz). Currently I am doing like:
val df = sqlContext.jsonFile("s3://testData/*/*/*")
df.show()
Eg:
testData/May/01/00/File.json.gz
Each compressed file is about 11 to…
I'm exploring pyspark and the possibilities of integrating scikit-learn with pyspark. I'd like to train a model on each partition using scikit-learn. That means, when my RDD is is defined and gets distributed among different worker nodes, I'd like…
I am new to Spark. I would like to make a sparse matrix a user-id item-id matrix specifically for a recommendation engine. I know how I would do this in python. How does one do this in PySpark? Here is how I would have done it in matrix. The table…
I am attempting to fill in missing values in my Spark dataframe with the previous non-null value (if it exists). I've done this type of thing in Python/Pandas but my data is too big for Pandas (on a small cluster) and I'm Spark noob. Is this…
I'm trying to send data from the workers of a Pyspark RDD to an SQS queue, using boto3 to talk with AWS. I need to send data directly from the partitions, rather than collecting the RDD and sending data from the driver.
I am able to send messages to…