I just start learning spark , I'm a bit confused by the this concept, so from the spark installation , we get the pyspark under the spark installation sub-folders , I understand it's a shell, and from the python package we also can also install the…
When saving a pyspark dataframe with a new column added with 'withColumn' function, the nullability changes from false to true.
Version info : Python 3.7.3/Spark2.4.0-cdh6.1.1
>>> l = [('Alice', 1)]
>>> df = spark.createDataFrame(l)
>>>…
We are facing below error in spark 2.4 intermittently when saving the managed table from spark.
Error -
pyspark.sql.utils.AnalysisException: u"Can not create the managed table('hive_issue.table'). The associated…
I am trying to train a model using XGBoost on data I have on the hive, the data is too large and I cant convert it to pandas df, so I have to use XGBoost with spark df.
When creating a XGBoostEstimator, an error occur:
TypeError: 'JavaPackage'…
I want to feed data coming from spark clusters, to train a deep network. I do not have GPUs in the nodes, so distributed TensorFlow or packages like elephas is not an option.
I have come up with the following generator which does the job. It just…
Spark 2.4 introduced the new SQL function slice, which can be used extract a certain range of elements from an array column.
I want to define that range dynamically per row, based on an Integer column that has the number of elements I want to pick…
I am loading a dataset from BigQuery and after some transformations, I'd like to save the transformed DataFrame back into BigQuery. Is there a way of doing this?
This is how I am loading the data:
df = spark.read \
.format('bigquery') \
…
How I can check sparkSession status in my py spark code ? Requirement is to check whether sparkSession is active or not. if sparksession is not active create another spark session and call some function
I am writing and running this code in…
I have time series data in CSV from vehicle with following information:
trip-id
timestamp
speed
The data looks like this:
trip-id | timestamp | speed
001 | 1538204192 | 44.55
001 | 1538204193 | 47.20 <-- start of brake
001 |…
I'm writing Spark code in Python. I have a col(execution_date) that is a timestamp. How would I turn that into a column that is called is_weekend, that has a value of 1 if the date is a weekend and 0 if it's a week day?
I am new to pySpark. I have received a csv file which has around 1000 columns. I am using databricks. Most of these columns have spaces in between eg "Total Revenue" ,"Total Age" etc. I need to updates all the column names with space with…
I am running into this problem w/ Apache Arrow Spark Integration.
Using AWS EMR w/ Spark 2.4.3
Tested this problem on both local spark single machine instance and a Cloudera cluster and everything works fine.
set these in spark-env.sh
export…
I have a spark dataframe that I am trying to push to AWS Elasticsearch, but before that I was testing this sample code snippet to push to ES,
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('ES_indexer').getOrCreate()
df =…
I want to see the jars my spark context is using.
I found the code in Scala:
$ spark-shell --jars --master=spark://datasci:7077 --jars /opt/jars/xgboost4j-spark-0.7-jar-with-dependencies.jar --packages elsevierlabs-os:spark-xml-utils:1.6.0
scala>…
I have an issue when I create a DataProc custom image and Pyspark.
My custom image is based on DataProc 1.4.1-debian9 and with my initialisation script I install python3 and some packages from a requirements.txt file, then set the python3 env…