Questions tagged [pyspark]

The Spark Python API (PySpark) exposes the Apache Spark programming model to Python.

The Spark Python API (PySpark) exposes the Spark programming model to Python.

Useful Links:

Related Tags:

39058 questions
7
votes
2 answers

Pyspark from Spark installation VS Pyspark python package

I just start learning spark , I'm a bit confused by the this concept, so from the spark installation , we get the pyspark under the spark installation sub-folders , I understand it's a shell, and from the python package we also can also install the…
JYBLTN
  • 71
  • 5
7
votes
0 answers

Incorrect nullability of column after saving pyspark dataframe

When saving a pyspark dataframe with a new column added with 'withColumn' function, the nullability changes from false to true. Version info : Python 3.7.3/Spark2.4.0-cdh6.1.1 >>> l = [('Alice', 1)] >>> df = spark.createDataFrame(l) >>>…
7
votes
0 answers

Intermittently getting can not create the managed table error while creating table from spark

We are facing below error in spark 2.4 intermittently when saving the managed table from spark. Error - pyspark.sql.utils.AnalysisException: u"Can not create the managed table('hive_issue.table'). The associated…
7
votes
3 answers

How can I integrate xgboost in spark? (Python)

I am trying to train a model using XGBoost on data I have on the hive, the data is too large and I cant convert it to pandas df, so I have to use XGBoost with spark df. When creating a XGBoostEstimator, an error occur: TypeError: 'JavaPackage'…
Elad Cohen
  • 453
  • 3
  • 16
7
votes
2 answers

Best practice for feeding spark dataframes for training Tensorflow network

I want to feed data coming from spark clusters, to train a deep network. I do not have GPUs in the nodes, so distributed TensorFlow or packages like elephas is not an option. I have come up with the following generator which does the job. It just…
Hamed
  • 474
  • 5
  • 17
7
votes
2 answers

How to dynamically slice an Array column in Spark?

Spark 2.4 introduced the new SQL function slice, which can be used extract a certain range of elements from an array column. I want to define that range dynamically per row, based on an Integer column that has the number of elements I want to pick…
harppu
  • 384
  • 4
  • 13
7
votes
2 answers

How to save a spark DataFrame back into a Google BigQuery project using pyspark?

I am loading a dataset from BigQuery and after some transformations, I'd like to save the transformed DataFrame back into BigQuery. Is there a way of doing this? This is how I am loading the data: df = spark.read \ .format('bigquery') \ …
7
votes
1 answer

Sparksession Status

How I can check sparkSession status in my py spark code ? Requirement is to check whether sparkSession is active or not. if sparksession is not active create another spark session and call some function I am writing and running this code in…
rakesh
  • 81
  • 1
  • 4
7
votes
2 answers

Spark: How to aggregate/reduce records based on time difference?

I have time series data in CSV from vehicle with following information: trip-id timestamp speed The data looks like this: trip-id | timestamp | speed 001 | 1538204192 | 44.55 001 | 1538204193 | 47.20 <-- start of brake 001 |…
Shumail
  • 3,103
  • 4
  • 28
  • 35
7
votes
1 answer

Spark: Is a col of a datetime on a weekday or weekend?

I'm writing Spark code in Python. I have a col(execution_date) that is a timestamp. How would I turn that into a column that is called is_weekend, that has a value of 1 if the date is a weekend and 0 if it's a week day?
Daniel Kaplan
  • 62,768
  • 50
  • 234
  • 356
7
votes
5 answers

Remove spaces from all column names in pyspark

I am new to pySpark. I have received a csv file which has around 1000 columns. I am using databricks. Most of these columns have spaces in between eg "Total Revenue" ,"Total Age" etc. I need to updates all the column names with space with…
user11704694
7
votes
2 answers

AWS EMR - ModuleNotFoundError: No module named 'pyarrow'

I am running into this problem w/ Apache Arrow Spark Integration. Using AWS EMR w/ Spark 2.4.3 Tested this problem on both local spark single machine instance and a Cloudera cluster and everything works fine. set these in spark-env.sh export…
thePurplePython
  • 2,621
  • 1
  • 13
  • 34
7
votes
3 answers

How to save dataframe to Elasticsearch in PySpark?

I have a spark dataframe that I am trying to push to AWS Elasticsearch, but before that I was testing this sample code snippet to push to ES, from pyspark.sql import SparkSession spark = SparkSession.builder.appName('ES_indexer').getOrCreate() df =…
Cyber_Tron
  • 299
  • 1
  • 6
  • 17
7
votes
2 answers

List all additional jars loaded in pyspark

I want to see the jars my spark context is using. I found the code in Scala: $ spark-shell --jars --master=spark://datasci:7077 --jars /opt/jars/xgboost4j-spark-0.7-jar-with-dependencies.jar --packages elsevierlabs-os:spark-xml-utils:1.6.0 scala>…
Eli Simhayev
  • 164
  • 1
  • 11
7
votes
1 answer

GCP Dataproc custom image Python environment

I have an issue when I create a DataProc custom image and Pyspark. My custom image is based on DataProc 1.4.1-debian9 and with my initialisation script I install python3 and some packages from a requirements.txt file, then set the python3 env…
1 2 3
99
100