Questions tagged [apache-spark-sql]

Apache Spark SQL is a tool for "SQL and structured data processing" on Spark, a fast and general-purpose cluster computing system. It can be used to retrieve data from Hive, Parquet etc. and run SQL queries over existing RDDs and Datasets.

Apache Spark SQL is a tool that brings native support for SQL to apache-spark. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.

Resources

26508 questions

votes

4 answers

PySpark remove special characters in all column names for all special characters

I am trying to remove all special characters from all the columns. I am using the following commands: import pyspark.sql.functions as F df_spark = spark_df.select([F.col(col).alias(col.replace(' ', '_')) for col in df.columns]) df_spark1 =…

apache-spark pyspark apache-spark-sql special-characters str-replace

asked Jun 18 '20 at 02:37

user13766314

votes

1 answer

Job 65 cancelled because SparkContext was shut down

I'm working on a shared Apache Zeppelin server. Almost every day, I try to run a command and get this error: Job 65 cancelled because SparkContext was shut down I would love to learn more about what causes the SparkContext to shut down. My…

apache-spark hadoop pyspark apache-spark-sql apache-zeppelin

asked May 16 '20 at 13:39

Cauder

2,157
4
30
69

votes

1 answer

Custom sorting in pyspark dataframes

Are there any recommended methods for implementing custom sort ordering for categorical data in pyspark? I'm ideally looking for the functionality the pandas categorical data type offers. So, given a dataset with a Speed column, the possible…

python pandas apache-spark pyspark apache-spark-sql

asked Mar 05 '20 at 00:25

Daveed

votes

2 answers

How to execute a stored procedure in Azure Databricks PySpark?

I am able to execute a simple SQL statement using PySpark in Azure Databricks but I want to execute a stored procedure instead. Below is the PySpark code I tried. #initialize pyspark import…

python apache-spark-sql azure-databricks pyspark

asked Feb 22 '20 at 16:43

Ajay

votes

3 answers

pyspark.sql.utils.AnalysisException: Failed to find data source: kafka

I am trying to read a stream from kafka using pyspark. I am using spark version 3.0.0-preview2 and spark-streaming-kafka-0-10_2.12 Before this I just stat zookeeper, kafka and create a new topic: /usr/local/kafka/bin/zookeeper-server-start.sh…

apache-spark pyspark apache-kafka apache-spark-sql spark-structured-streaming

asked Feb 14 '20 at 16:26

el abed houssem

votes

1 answer

Spark SQL - Regex for matching only numbers

I am trying to make sure that a particular column in a dataframe does not contain any illegal values (non- numerical data). For this purpose I am trying to use a regex matching using rlike to collect illegal values in the data: I need to collect…

regex dataframe apache-spark pyspark apache-spark-sql

asked Feb 10 '20 at 10:16

Hemanth

votes

2 answers

Append to PySpark array column

I want to check if the column values are within some boundaries. If they are not I will append some value to the array column "F". This is the code I have so far: df = spark.createDataFrame( [ (1, 56), (2, 32), (3, 99) …

arrays apache-spark pyspark apache-spark-sql append

asked Jan 21 '20 at 10:01

Lossa

votes

1 answer

Using partitions (with partitionBy) when writing a delta lake has no effect

When I initially write a delta lake, using partitions (with partitionBy) or not, does not make any difference. Using a repartition on the same column before writing, only changes the number of parquet-files. Making the column to partition explicitly…

apache-spark apache-spark-sql partitioning mapr delta-lake

asked Jan 15 '20 at 08:13

Florian Corzilius

votes

2 answers

spark 2.4 Parquet column cannot be converted in file, Column: [Impressions], Expected: bigint, Found: BINARY

I'm facing a weird issue that I cannot understand. I have source data with a column "Impressions" that is sometimes a bigint / sometimes a string (when I manually explore the data). The HIVE schema registered for this column is of Long. Thus, when…

apache-spark pyspark apache-spark-sql

asked Nov 28 '19 at 21:05

Jay Cee

1,855
5
28
48

votes

1 answer

How to use date_add with two columns in pyspark?

I have a dataframe with some columns: +------------+--------+----------+----------+ |country_name| ID_user|birth_date| psdt| +------------+--------+----------+----------+ | Россия|16460783| 486|1970-01-01| | Россия|16467391| …

apache-spark pyspark apache-spark-sql

asked Nov 26 '19 at 01:22

Andrey Timonin

votes

2 answers

Pyspark from Spark installation VS Pyspark python package

I just start learning spark , I'm a bit confused by the this concept, so from the spark installation , we get the pyspark under the spark installation sub-folders , I understand it's a shell, and from the python package we also can also install the…

apache-spark pyspark apache-spark-sql

asked Oct 21 '19 at 03:15

JYBLTN

votes

0 answers

Incorrect nullability of column after saving pyspark dataframe

When saving a pyspark dataframe with a new column added with 'withColumn' function, the nullability changes from false to true. Version info : Python 3.7.3/Spark2.4.0-cdh6.1.1 >>> l = [('Alice', 1)] >>> df = spark.createDataFrame(l) >>>…

apache-spark apache-spark-sql nullable pyspark

asked Oct 10 '19 at 19:47

Prasanna Saraswathi Krishnan

votes

0 answers

Intermittently getting can not create the managed table error while creating table from spark

We are facing below error in spark 2.4 intermittently when saving the managed table from spark. Error - pyspark.sql.utils.AnalysisException: u"Can not create the managed table('hive_issue.table'). The associated…

apache-spark pyspark apache-spark-sql

asked Sep 25 '19 at 06:53

abhijeet bedagkar

votes

0 answers

How to fix "BlockManagerMasterEndpoint - No more replicas available for rdd" issue?

I am using spark 2.4.1 version and java8 to copy data into cassandra-3.0. My spark job script is $SPARK_HOME/bin/spark-submit \ --master yarn \ --deploy-mode cluster \ --name MyDriver \ --jars "/local/jars/*.jar" \ --files…

apache-spark apache-spark-sql cassandra cassandra-3.0 datastax-java-driver

asked Sep 20 '19 at 10:35

BdEngineer

2,929
4
49
85

votes

2 answers

Best practice for feeding spark dataframes for training Tensorflow network

I want to feed data coming from spark clusters, to train a deep network. I do not have GPUs in the nodes, so distributed TensorFlow or packages like elephas is not an option. I have come up with the following generator which does the job. It just…

python tensorflow keras pyspark apache-spark-sql

asked Sep 11 '19 at 14:56

Hamed

Prev 1 2 3

…

99 100 Next