Questions tagged [apache-spark-sql]

Apache Spark SQL is a tool for "SQL and structured data processing" on Spark, a fast and general-purpose cluster computing system. It can be used to retrieve data from Hive, Parquet etc. and run SQL queries over existing RDDs and Datasets.

Apache Spark SQL is a tool that brings native support for SQL to . It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.

Resources

Related tags: , ,

26508 questions
107
votes
5 answers

Split Spark dataframe string column into multiple columns

I've seen various people suggesting that Dataframe.explode is a useful way to do this, but it results in more rows than the original dataframe, which isn't what I want at all. I simply want to do the Dataframe equivalent of the very…
Peter Gaultney
  • 3,269
  • 4
  • 16
  • 20
106
votes
6 answers

Renaming column names of a DataFrame in Spark Scala

I am trying to convert all the headers / column names of a DataFrame in Spark-Scala. as of now I come up with following code which only replaces a single column name. for( i <- 0 to origCols.length - 1) { df.withColumnRenamed( df.columns(i),…
Sam
  • 1,227
  • 3
  • 11
  • 13
105
votes
11 answers

How to save DataFrame directly to Hive?

Is it possible to save DataFrame in spark directly to Hive? I have tried with converting DataFrame to Rdd and then saving as a text file and then loading in hive. But I am wondering if I can directly save dataframe to hive
Gourav
  • 1,245
  • 2
  • 10
  • 12
103
votes
12 answers

How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?

import numpy as np data = [ (1, 1, None), (1, 2, float(5)), (1, 3, np.nan), (1, 4, None), (1, 5, float(10)), (1, 6, float("nan")), (1, 6, float("nan")), ] df = spark.createDataFrame(data, ("session", "timestamp1",…
GeorgeOfTheRF
  • 8,244
  • 23
  • 57
  • 80
102
votes
14 answers

Overwrite specific partitions in spark dataframe write method

I want to overwrite specific partitions instead of all in spark. I am trying the following command: df.write.orc('maprfs:///hdfs-base-path','overwrite',partitionBy='col4') where df is dataframe having the incremental data to be…
yatin
  • 1,023
  • 2
  • 8
  • 7
99
votes
3 answers

How does createOrReplaceTempView work in Spark?

I am new to Spark and Spark SQL. How does createOrReplaceTempView work in Spark? If we register an RDD of objects as a table will spark keep all the data in memory?
Abir Chokraborty
  • 1,695
  • 4
  • 15
  • 23
98
votes
10 answers

Removing duplicate columns after a DF join in Spark

When you join two DFs with similar column names: df = df1.join(df2, df1['id'] == df2['id']) Join works fine but you can't call the id column because it is ambiguous and you would get the following exception: pyspark.sql.utils.AnalysisException:…
thecheech
  • 2,041
  • 3
  • 18
  • 25
97
votes
7 answers

Cannot find col function in pyspark

In pyspark 1.6.2, I can import col function by from pyspark.sql.functions import col but when I try to look it up in the Github source code I find no col function in functions.py file, how can python import a function that doesn't exist?
Bamqf
  • 3,382
  • 8
  • 33
  • 47
97
votes
8 answers

Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame

Let's say I have a rather large dataset in the following form: data = sc.parallelize([('Foo',41,'US',3), ('Foo',39,'UK',1), ('Bar',57,'CA',2), ('Bar',72,'CA',2), …
Jason
  • 2,834
  • 6
  • 31
  • 35
96
votes
5 answers

Get current number of partitions of a DataFrame

Is there any way to get the current number of partitions of a DataFrame? I checked the DataFrame javadoc (spark 1.6) and didn't found a method for that, or am I just missed it? (In case of JavaRDD there's a getNumPartitions() method.)
kecso
  • 2,387
  • 2
  • 18
  • 29
96
votes
4 answers

Spark SQL: apply aggregate functions to a list of columns

Is there a way to apply an aggregate function to all (or a list of) columns of a dataframe, when doing a groupBy? In other words, is there a way to avoid doing this for every column: df.groupBy("col1") .agg(sum("col2").alias("col2"),…
lilloraffa
  • 1,367
  • 3
  • 17
  • 22
95
votes
5 answers

Add an empty column to Spark DataFrame

As mentioned in many other locations on the web, adding a new column to an existing DataFrame is not straightforward. Unfortunately it is important to have this functionality (even though it is inefficient in a distributed environment) especially…
architectonic
  • 2,871
  • 2
  • 21
  • 35
95
votes
5 answers

Updating a dataframe column in spark

Looking at the new spark DataFrame API, it is unclear whether it is possible to modify dataframe columns. How would I go about changing a value in row x column y of a dataframe? In pandas this would be: df.ix[x,y] = new_value Edit: Consolidating…
Luke
  • 6,699
  • 13
  • 50
  • 88
93
votes
4 answers

Create Spark DataFrame. Can not infer schema for type

Could someone help me solve this problem I have with Spark DataFrame? When I do myFloatRDD.toDF() I get an error: TypeError: Can not infer schema for type: type 'float' I don't understand why... Example: myFloatRdd =…
Breach
  • 1,288
  • 1
  • 11
  • 25
92
votes
6 answers

How to write unit tests in Spark 2.0+?

I've been trying to find a reasonable way to test SparkSession with the JUnit testing framework. While there seem to be good examples for SparkContext, I couldn't figure out how to get a corresponding example working for SparkSession, even though it…
bbarker
  • 11,636
  • 9
  • 38
  • 62