Questions tagged [apache-spark-sql]

Apache Spark SQL is a tool for "SQL and structured data processing" on Spark, a fast and general-purpose cluster computing system. It can be used to retrieve data from Hive, Parquet etc. and run SQL queries over existing RDDs and Datasets.

Apache Spark SQL is a tool that brings native support for SQL to . It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.

Resources

Related tags: , ,

26508 questions
7
votes
4 answers

PySpark remove special characters in all column names for all special characters

I am trying to remove all special characters from all the columns. I am using the following commands: import pyspark.sql.functions as F df_spark = spark_df.select([F.col(col).alias(col.replace(' ', '_')) for col in df.columns]) df_spark1 =…
user13766314
7
votes
1 answer

Job 65 cancelled because SparkContext was shut down

I'm working on a shared Apache Zeppelin server. Almost every day, I try to run a command and get this error: Job 65 cancelled because SparkContext was shut down I would love to learn more about what causes the SparkContext to shut down. My…
Cauder
  • 2,157
  • 4
  • 30
  • 69
7
votes
1 answer

Custom sorting in pyspark dataframes

Are there any recommended methods for implementing custom sort ordering for categorical data in pyspark? I'm ideally looking for the functionality the pandas categorical data type offers. So, given a dataset with a Speed column, the possible…
Daveed
  • 149
  • 2
  • 8
7
votes
2 answers

How to execute a stored procedure in Azure Databricks PySpark?

I am able to execute a simple SQL statement using PySpark in Azure Databricks but I want to execute a stored procedure instead. Below is the PySpark code I tried. #initialize pyspark import…
Ajay
  • 247
  • 1
  • 5
  • 15
7
votes
3 answers

pyspark.sql.utils.AnalysisException: Failed to find data source: kafka

I am trying to read a stream from kafka using pyspark. I am using spark version 3.0.0-preview2 and spark-streaming-kafka-0-10_2.12 Before this I just stat zookeeper, kafka and create a new topic: /usr/local/kafka/bin/zookeeper-server-start.sh…
7
votes
1 answer

Spark SQL - Regex for matching only numbers

I am trying to make sure that a particular column in a dataframe does not contain any illegal values (non- numerical data). For this purpose I am trying to use a regex matching using rlike to collect illegal values in the data: I need to collect…
Hemanth
  • 705
  • 2
  • 16
  • 32
7
votes
2 answers

Append to PySpark array column

I want to check if the column values are within some boundaries. If they are not I will append some value to the array column "F". This is the code I have so far: df = spark.createDataFrame( [ (1, 56), (2, 32), (3, 99) …
Lossa
  • 341
  • 2
  • 3
  • 9
7
votes
1 answer

Using partitions (with partitionBy) when writing a delta lake has no effect

When I initially write a delta lake, using partitions (with partitionBy) or not, does not make any difference. Using a repartition on the same column before writing, only changes the number of parquet-files. Making the column to partition explicitly…
7
votes
2 answers

spark 2.4 Parquet column cannot be converted in file, Column: [Impressions], Expected: bigint, Found: BINARY

I'm facing a weird issue that I cannot understand. I have source data with a column "Impressions" that is sometimes a bigint / sometimes a string (when I manually explore the data). The HIVE schema registered for this column is of Long. Thus, when…
Jay Cee
  • 1,855
  • 5
  • 28
  • 48
7
votes
1 answer

How to use date_add with two columns in pyspark?

I have a dataframe with some columns: +------------+--------+----------+----------+ |country_name| ID_user|birth_date| psdt| +------------+--------+----------+----------+ | Россия|16460783| 486|1970-01-01| | Россия|16467391| …
7
votes
2 answers

Pyspark from Spark installation VS Pyspark python package

I just start learning spark , I'm a bit confused by the this concept, so from the spark installation , we get the pyspark under the spark installation sub-folders , I understand it's a shell, and from the python package we also can also install the…
JYBLTN
  • 71
  • 5
7
votes
0 answers

Incorrect nullability of column after saving pyspark dataframe

When saving a pyspark dataframe with a new column added with 'withColumn' function, the nullability changes from false to true. Version info : Python 3.7.3/Spark2.4.0-cdh6.1.1 >>> l = [('Alice', 1)] >>> df = spark.createDataFrame(l) >>>…
7
votes
0 answers

Intermittently getting can not create the managed table error while creating table from spark

We are facing below error in spark 2.4 intermittently when saving the managed table from spark. Error - pyspark.sql.utils.AnalysisException: u"Can not create the managed table('hive_issue.table'). The associated…
7
votes
0 answers

How to fix "BlockManagerMasterEndpoint - No more replicas available for rdd" issue?

I am using spark 2.4.1 version and java8 to copy data into cassandra-3.0. My spark job script is $SPARK_HOME/bin/spark-submit \ --master yarn \ --deploy-mode cluster \ --name MyDriver \ --jars "/local/jars/*.jar" \ --files…
7
votes
2 answers

Best practice for feeding spark dataframes for training Tensorflow network

I want to feed data coming from spark clusters, to train a deep network. I do not have GPUs in the nodes, so distributed TensorFlow or packages like elephas is not an option. I have come up with the following generator which does the job. It just…
Hamed
  • 474
  • 5
  • 17