Highest Voted 'pyspark' Questions

7

votes

2 answers

Pyspark from Spark installation VS Pyspark python package

I just start learning spark , I'm a bit confused by the this concept, so from the spark installation , we get the pyspark under the spark installation sub-folders , I understand it's a shell, and from the python package we also can also install the…

apache-spark pyspark apache-spark-sql

asked Oct 21 '19 at 03:15

JYBLTN

71
5

7

votes

0 answers

Incorrect nullability of column after saving pyspark dataframe

When saving a pyspark dataframe with a new column added with 'withColumn' function, the nullability changes from false to true. Version info : Python 3.7.3/Spark2.4.0-cdh6.1.1 >>> l = [('Alice', 1)] >>> df = spark.createDataFrame(l) >>>…

apache-spark apache-spark-sql nullable pyspark

asked Oct 10 '19 at 19:47

Prasanna Saraswathi Krishnan

599
1
4
17

7

votes

0 answers

Intermittently getting can not create the managed table error while creating table from spark

We are facing below error in spark 2.4 intermittently when saving the managed table from spark. Error - pyspark.sql.utils.AnalysisException: u"Can not create the managed table('hive_issue.table'). The associated…

apache-spark pyspark apache-spark-sql

asked Sep 25 '19 at 06:53

abhijeet bedagkar

81
1
4

7

votes

3 answers

How can I integrate xgboost in spark? (Python)

I am trying to train a model using XGBoost on data I have on the hive, the data is too large and I cant convert it to pandas df, so I have to use XGBoost with spark df. When creating a XGBoostEstimator, an error occur: TypeError: 'JavaPackage'…

python apache-spark pyspark xgboost

asked Sep 15 '19 at 07:58

Elad Cohen

453
3
16

7

votes

2 answers

Best practice for feeding spark dataframes for training Tensorflow network

I want to feed data coming from spark clusters, to train a deep network. I do not have GPUs in the nodes, so distributed TensorFlow or packages like elephas is not an option. I have come up with the following generator which does the job. It just…

python tensorflow keras pyspark apache-spark-sql

asked Sep 11 '19 at 14:56

Hamed

474
5
17

7

votes

2 answers

How to dynamically slice an Array column in Spark?

Spark 2.4 introduced the new SQL function slice, which can be used extract a certain range of elements from an array column. I want to define that range dynamically per row, based on an Integer column that has the number of elements I want to pick…

python apache-spark pyspark apache-spark-sql

asked Sep 02 '19 at 14:32

harppu

384
4
13

7

votes

2 answers

How to save a spark DataFrame back into a Google BigQuery project using pyspark?

I am loading a dataset from BigQuery and after some transformations, I'd like to save the transformed DataFrame back into BigQuery. Is there a way of doing this? This is how I am loading the data: df = spark.read \ .format('bigquery') \ …

python google-cloud-platform pyspark google-bigquery google-cloud-dataproc

asked Aug 30 '19 at 15:28

Totor

125
1
2
10

7

votes

1 answer

Sparksession Status

How I can check sparkSession status in my py spark code ? Requirement is to check whether sparkSession is active or not. if sparksession is not active create another spark session and call some function I am writing and running this code in…

python apache-spark pyspark

asked Aug 16 '19 at 13:08

rakesh

81
1
4

7

votes

2 answers

Spark: How to aggregate/reduce records based on time difference?

I have time series data in CSV from vehicle with following information: trip-id timestamp speed The data looks like this: trip-id | timestamp | speed 001 | 1538204192 | 44.55 001 | 1538204193 | 47.20 <-- start of brake 001 |…

dataframe apache-spark pyspark apache-spark-sql rdd

asked Aug 16 '19 at 13:01

Shumail

3,103
4
28
35

7

votes

1 answer

Spark: Is a col of a datetime on a weekday or weekend?

I'm writing Spark code in Python. I have a col(execution_date) that is a timestamp. How would I turn that into a column that is called is_weekend, that has a value of 1 if the date is a weekend and 0 if it's a week day?

python apache-spark pyspark

asked Aug 15 '19 at 16:43

Daniel Kaplan

62,768
50
234
356

7

votes

5 answers

Remove spaces from all column names in pyspark

I am new to pySpark. I have received a csv file which has around 1000 columns. I am using databricks. Most of these columns have spaces in between eg "Total Revenue" ,"Total Age" etc. I need to updates all the column names with space with…

pyspark

asked Aug 02 '19 at 00:19

user11704694

7

votes

2 answers

AWS EMR - ModuleNotFoundError: No module named 'pyarrow'

I am running into this problem w/ Apache Arrow Spark Integration. Using AWS EMR w/ Spark 2.4.3 Tested this problem on both local spark single machine instance and a Cloudera cluster and everything works fine. set these in spark-env.sh export…

apache-spark pyspark amazon-emr pyarrow apache-arrow

asked Aug 01 '19 at 18:28

thePurplePython

2,621
1
13
34

7

votes

3 answers

How to save dataframe to Elasticsearch in PySpark?

I have a spark dataframe that I am trying to push to AWS Elasticsearch, but before that I was testing this sample code snippet to push to ES, from pyspark.sql import SparkSession spark = SparkSession.builder.appName('ES_indexer').getOrCreate() df =…

apache-spark elasticsearch pyspark apache-spark-sql

asked Jul 17 '19 at 23:54

Cyber_Tron

299
1
6
17

7

votes

2 answers

List all additional jars loaded in pyspark

I want to see the jars my spark context is using. I found the code in Scala: $ spark-shell --jars --master=spark://datasci:7077 --jars /opt/jars/xgboost4j-spark-0.7-jar-with-dependencies.jar --packages elsevierlabs-os:spark-xml-utils:1.6.0 scala>…

apache-spark pyspark

asked Jul 16 '19 at 12:48

Eli Simhayev

164
1
11

7

votes

1 answer

GCP Dataproc custom image Python environment

I have an issue when I create a DataProc custom image and Pyspark. My custom image is based on DataProc 1.4.1-debian9 and with my initialisation script I install python3 and some packages from a requirements.txt file, then set the python3 env…

python google-cloud-platform pyspark google-cloud-dataproc

asked Jul 12 '19 at 13:56

Claudio

642
3
12

Questions tagged [pyspark]

Useful Links:

Related Tags:

Pyspark from Spark installation VS Pyspark python package

Incorrect nullability of column after saving pyspark dataframe

Intermittently getting can not create the managed table error while creating table from spark

How can I integrate xgboost in spark? (Python)

Best practice for feeding spark dataframes for training Tensorflow network

How to dynamically slice an Array column in Spark?

How to save a spark DataFrame back into a Google BigQuery project using pyspark?

Sparksession Status

Spark: How to aggregate/reduce records based on time difference?

Spark: Is a col of a datetime on a weekday or weekend?

Remove spaces from all column names in pyspark

AWS EMR - ModuleNotFoundError: No module named 'pyarrow'

How to save dataframe to Elasticsearch in PySpark?

List all additional jars loaded in pyspark

GCP Dataproc custom image Python environment