Questions tagged [apache-spark-sql]

Apache Spark SQL is a tool for "SQL and structured data processing" on Spark, a fast and general-purpose cluster computing system. It can be used to retrieve data from Hive, Parquet etc. and run SQL queries over existing RDDs and Datasets.

Apache Spark SQL is a tool that brings native support for SQL to apache-spark. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.

Resources

26508 questions

votes

3 answers

Count Non Null values in column in PySpark

I have a dataframe which contains null values: from pyspark.sql import functions as F df = spark.createDataFrame( [(125, '2012-10-10', 'tv'), (20, '2012-10-10', 'phone'), (40, '2012-10-10', 'tv'), (None, '2012-10-10', 'tv')], …

apache-spark pyspark apache-spark-sql count null

asked Feb 05 '18 at 21:50

newleaf

2,257
8
32
52

votes

3 answers

How to change case of whole pyspark dataframe to lower or upper

I am trying to apply pyspark sql functions hash algorithm for every row in two dataframes to identify the differences. Hash algorithm is case sensitive .i.e. if column contains 'APPLE' and 'Apple' are considered as two different values, so I want to…

python-3.x apache-spark pyspark apache-spark-sql case-sensitive

asked Feb 01 '18 at 13:48

Jack

votes

2 answers

How to read spark table back again in a new spark session?

I can read the table just after it created, but how to read it again in another spark session? Given code: spark = SparkSession \ .builder \ .getOrCreate() df = spark.read.parquet("examples/src/main/resources/users.parquet") (df .write …

python apache-spark pyspark apache-spark-sql

asked Jan 24 '18 at 06:48

petertc

3,607
1
31
36

votes

3 answers

Add a new key/value pair to a Spark MapType column

I have a Dataframe with a MapType field. >>> from pyspark.sql.functions import * >>> from pyspark.sql.types import * >>> fields = StructType([ ... StructField('timestamp', TimestampType(), True), ... StructField('other_field', …

python pyspark apache-spark-sql

asked Jan 10 '18 at 21:16

zemekeneng

1,660
2
15
26

votes

2 answers

Spark SQL Split with Period (.)

I encountered a problem in spark 2.2 while using pyspark sql, I tried to split a column with period (.) and it did not behave well even after providing escape chars: >>> spark.sql("select split('a.aaa','.')").show() +---------------+ |split(a.aaa,…

apache-spark-sql

asked Jan 06 '18 at 08:17

some_user

votes

1 answer

pyspark, referencing the outer query are not supported outside of WHERE

I need to join 2 table in pyspark and do this join not on exact value from right table, but on nearest value (as there is no exact match. It works fine in regular SQL, but does not work in SparkSQL. I am using Spark 2.2.1 In regular SQL : SELECT…

apache-spark-sql

asked Jan 05 '18 at 16:53

Gary Marten

votes

1 answer

Fuzzy matching a word inside a pyspark dataframe string

I have some data in which column 'X' contains strings. I am writing a function, using pyspark, where a search_word is passed and all rows which do not contain the substring search_word within the column 'X' string are filtered out. The function must…

python nlp pyspark apache-spark-sql fuzzy-search

asked Jan 03 '18 at 09:32

Dónal Flanagan

votes

7 answers

How to convert empty arrays to nulls?

I have below dataframe and i need to convert empty arrays to null. +----+---------+-----------+ | id|count(AS)|count(asdr)| +----+---------+-----------+ |1110| [12, 45]| [50, 55]| |1111| []| []| |1112| [45, 46]| [50, 50]|…

apache-spark pyspark apache-spark-sql

asked Jan 03 '18 at 06:31

Alice

votes

1 answer

Why does Spark application fail with "IOException: (null) entry in command string: null chmod 0644"?

I'm trying to write the dataset results into a single CSV using below using JAVA dataset.write().mode(SaveMode.Overwrite).option("header",true).csv("C:\\tmp\\csvs"); But it goes for timed out , the file is not being written. Throws…

java apache-spark apache-spark-sql

asked Dec 28 '17 at 16:04

John Humanyun

votes

1 answer

How to transpose/pivot the rows data to column in Spark Scala?

I am new to Spark-SQL. I have information in Spark Dataframe like this Company Type Status A X done A Y done A Z done C X done C Y done B Y done I am want to be displayed like the…

scala apache-spark apache-spark-sql pivot

asked Dec 28 '17 at 10:51

Vikrant Sonawane

votes

1 answer

Does Spark do one pass through the data for multiple withColumn?

Does Spark do one or multiple passes through data when multiple withColumn functions are chained? For example: val dfnew = df.withColumn("newCol1", f1(col("a"))) .withColumn("newCol2", f2(col("b"))) .withColumn("newCol3",…

scala apache-spark apache-spark-sql

asked Dec 18 '17 at 15:47

astro_asz

2,278
3
15
31

votes

0 answers

Timeout of PySpark countApprox() is not working

I'm working with Pyspark and Dataframes and I would like to know approximately if a Dataframe is greater than something. I'm trying to use countApprox() function: df.rdd.countApprox(1000, 0.5) But seems that in Pyspark the timeout is not working.…

apache-spark pyspark apache-spark-sql

asked Dec 13 '17 at 09:09

Javier Montón

4,601
3
21
29

votes

3 answers

Whats is the correct way to sum different dataframe columns in a list in pyspark?

I want to sum different columns in a spark dataframe. Code from pyspark.sql import functions as F cols = ["A.p1","B.p1"] df = spark.createDataFrame([[1,2],[4,89],[12,60]],schema=cols) # 1. Works df = df.withColumn('sum1', sum([df[col] for col in…

python apache-spark pyspark apache-spark-sql

asked Dec 07 '17 at 08:25

GeorgeOfTheRF

8,244
23
57
80

votes

2 answers

spark inconsistency when running count command

A question about inconsistency of Spark calculations. Does this exist? For example, I am running EXACTLY the same command twice, e.g.: imp_sample.where(col("location").isNotNull()).count() And I am getting slightly different results every time I…

count pyspark apache-spark-sql

asked Dec 02 '17 at 21:01

user3245256

1,842
4
24
51

votes

2 answers

IF Statement Pyspark

apache-spark if-statement pyspark apache-spark-sql

asked Nov 30 '17 at 21:36

Bisbot

Prev 1 2 3

…

100