Questions tagged [apache-spark-sql]

Apache Spark SQL is a tool for "SQL and structured data processing" on Spark, a fast and general-purpose cluster computing system. It can be used to retrieve data from Hive, Parquet etc. and run SQL queries over existing RDDs and Datasets.

Apache Spark SQL is a tool that brings native support for SQL to . It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.

Resources

Related tags: , ,

26508 questions
7
votes
3 answers

Count Non Null values in column in PySpark

I have a dataframe which contains null values: from pyspark.sql import functions as F df = spark.createDataFrame( [(125, '2012-10-10', 'tv'), (20, '2012-10-10', 'phone'), (40, '2012-10-10', 'tv'), (None, '2012-10-10', 'tv')], …
newleaf
  • 2,257
  • 8
  • 32
  • 52
7
votes
3 answers

How to change case of whole pyspark dataframe to lower or upper

I am trying to apply pyspark sql functions hash algorithm for every row in two dataframes to identify the differences. Hash algorithm is case sensitive .i.e. if column contains 'APPLE' and 'Apple' are considered as two different values, so I want to…
Jack
  • 957
  • 3
  • 10
  • 23
7
votes
2 answers

How to read spark table back again in a new spark session?

I can read the table just after it created, but how to read it again in another spark session? Given code: spark = SparkSession \ .builder \ .getOrCreate() df = spark.read.parquet("examples/src/main/resources/users.parquet") (df .write …
petertc
  • 3,607
  • 1
  • 31
  • 36
7
votes
3 answers

Add a new key/value pair to a Spark MapType column

I have a Dataframe with a MapType field. >>> from pyspark.sql.functions import * >>> from pyspark.sql.types import * >>> fields = StructType([ ... StructField('timestamp', TimestampType(), True), ... StructField('other_field', …
zemekeneng
  • 1,660
  • 2
  • 15
  • 26
7
votes
2 answers

Spark SQL Split with Period (.)

I encountered a problem in spark 2.2 while using pyspark sql, I tried to split a column with period (.) and it did not behave well even after providing escape chars: >>> spark.sql("select split('a.aaa','.')").show() +---------------+ |split(a.aaa,…
some_user
  • 315
  • 2
  • 14
7
votes
1 answer

pyspark, referencing the outer query are not supported outside of WHERE

I need to join 2 table in pyspark and do this join not on exact value from right table, but on nearest value (as there is no exact match. It works fine in regular SQL, but does not work in SparkSQL. I am using Spark 2.2.1 In regular SQL : SELECT…
Gary Marten
  • 71
  • 1
  • 2
7
votes
1 answer

Fuzzy matching a word inside a pyspark dataframe string

I have some data in which column 'X' contains strings. I am writing a function, using pyspark, where a search_word is passed and all rows which do not contain the substring search_word within the column 'X' string are filtered out. The function must…
7
votes
7 answers

How to convert empty arrays to nulls?

I have below dataframe and i need to convert empty arrays to null. +----+---------+-----------+ | id|count(AS)|count(asdr)| +----+---------+-----------+ |1110| [12, 45]| [50, 55]| |1111| []| []| |1112| [45, 46]| [50, 50]|…
Alice
  • 165
  • 2
  • 4
  • 13
7
votes
1 answer

Why does Spark application fail with "IOException: (null) entry in command string: null chmod 0644"?

I'm trying to write the dataset results into a single CSV using below using JAVA dataset.write().mode(SaveMode.Overwrite).option("header",true).csv("C:\\tmp\\csvs"); But it goes for timed out , the file is not being written. Throws…
John Humanyun
  • 915
  • 3
  • 10
  • 25
7
votes
1 answer

How to transpose/pivot the rows data to column in Spark Scala?

I am new to Spark-SQL. I have information in Spark Dataframe like this Company Type Status A X done A Y done A Z done C X done C Y done B Y done I am want to be displayed like the…
Vikrant Sonawane
  • 207
  • 1
  • 5
  • 15
7
votes
1 answer

Does Spark do one pass through the data for multiple withColumn?

Does Spark do one or multiple passes through data when multiple withColumn functions are chained? For example: val dfnew = df.withColumn("newCol1", f1(col("a"))) .withColumn("newCol2", f2(col("b"))) .withColumn("newCol3",…
astro_asz
  • 2,278
  • 3
  • 15
  • 31
7
votes
0 answers

Timeout of PySpark countApprox() is not working

I'm working with Pyspark and Dataframes and I would like to know approximately if a Dataframe is greater than something. I'm trying to use countApprox() function: df.rdd.countApprox(1000, 0.5) But seems that in Pyspark the timeout is not working.…
Javier Montón
  • 4,601
  • 3
  • 21
  • 29
7
votes
3 answers

Whats is the correct way to sum different dataframe columns in a list in pyspark?

I want to sum different columns in a spark dataframe. Code from pyspark.sql import functions as F cols = ["A.p1","B.p1"] df = spark.createDataFrame([[1,2],[4,89],[12,60]],schema=cols) # 1. Works df = df.withColumn('sum1', sum([df[col] for col in…
GeorgeOfTheRF
  • 8,244
  • 23
  • 57
  • 80
7
votes
2 answers

spark inconsistency when running count command

A question about inconsistency of Spark calculations. Does this exist? For example, I am running EXACTLY the same command twice, e.g.: imp_sample.where(col("location").isNotNull()).count() And I am getting slightly different results every time I…
user3245256
  • 1,842
  • 4
  • 24
  • 51
7
votes
2 answers

IF Statement Pyspark

My data looks like the following: +----------+-------------+-------+--------------------+--------------+---+ |purch_date| purch_class|tot_amt| serv-provider|purch_location|…
Bisbot
  • 127
  • 2
  • 3
  • 9
1 2 3
99
100