Questions tagged [apache-spark-sql]

Apache Spark SQL is a tool for "SQL and structured data processing" on Spark, a fast and general-purpose cluster computing system. It can be used to retrieve data from Hive, Parquet etc. and run SQL queries over existing RDDs and Datasets.

Apache Spark SQL is a tool that brings native support for SQL to apache-spark. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.

Resources

26508 questions

107

votes

5 answers

Split Spark dataframe string column into multiple columns

I've seen various people suggesting that Dataframe.explode is a useful way to do this, but it results in more rows than the original dataframe, which isn't what I want at all. I simply want to do the Dataframe equivalent of the very…

string apache-spark pyspark split apache-spark-sql

asked Aug 30 '16 at 19:32

Peter Gaultney

3,269
4
16
20

106

votes

6 answers

Renaming column names of a DataFrame in Spark Scala

I am trying to convert all the headers / column names of a DataFrame in Spark-Scala. as of now I come up with following code which only replaces a single column name. for( i <- 0 to origCols.length - 1) { df.withColumnRenamed( df.columns(i),…

scala apache-spark dataframe apache-spark-sql

asked Feb 24 '16 at 03:51

Sam

1,227
3
11
13

105

votes

11 answers

How to save DataFrame directly to Hive?

Is it possible to save DataFrame in spark directly to Hive? I have tried with converting DataFrame to Rdd and then saving as a text file and then loading in hive. But I am wondering if I can directly save dataframe to hive

scala apache-spark hive apache-spark-sql

asked Jun 05 '15 at 10:15

Gourav

1,245
2
10
12

103

votes

12 answers

How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?

import numpy as np data = [ (1, 1, None), (1, 2, float(5)), (1, 3, np.nan), (1, 4, None), (1, 5, float(10)), (1, 6, float("nan")), (1, 6, float("nan")), ] df = spark.createDataFrame(data, ("session", "timestamp1",…

apache-spark pyspark apache-spark-sql

asked Jun 19 '17 at 09:54

GeorgeOfTheRF

8,244
23
57
80

102

votes

14 answers

Overwrite specific partitions in spark dataframe write method

I want to overwrite specific partitions instead of all in spark. I am trying the following command: df.write.orc('maprfs:///hdfs-base-path','overwrite',partitionBy='col4') where df is dataframe having the incremental data to be…

apache-spark apache-spark-sql

asked Jul 20 '16 at 18:00

yatin

1,023
2
8
7

votes

3 answers

How does createOrReplaceTempView work in Spark?

I am new to Spark and Spark SQL. How does createOrReplaceTempView work in Spark? If we register an RDD of objects as a table will spark keep all the data in memory?

apache-spark apache-spark-sql

asked May 16 '17 at 21:26

Abir Chokraborty

1,695
4
15
23

votes

10 answers

Removing duplicate columns after a DF join in Spark

When you join two DFs with similar column names: df = df1.join(df2, df1['id'] == df2['id']) Join works fine but you can't call the id column because it is ambiguous and you would get the following exception: pyspark.sql.utils.AnalysisException:…

python apache-spark pyspark apache-spark-sql

asked Oct 26 '17 at 01:33

thecheech

2,041
3
18
25

votes

7 answers

Cannot find col function in pyspark

In pyspark 1.6.2, I can import col function by from pyspark.sql.functions import col but when I try to look it up in the Github source code I find no col function in functions.py file, how can python import a function that doesn't exist?

python apache-spark pyspark apache-spark-sql

asked Oct 20 '16 at 19:38

Bamqf

3,382
8
33
47

votes

8 answers

Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame

Let's say I have a rather large dataset in the following form: data = sc.parallelize([('Foo',41,'US',3), ('Foo',39,'UK',1), ('Bar',57,'CA',2), ('Bar',72,'CA',2), …

apache-spark apache-spark-sql pyspark

asked May 14 '15 at 22:03

Jason

2,834
6
31
35

votes

5 answers

Get current number of partitions of a DataFrame

Is there any way to get the current number of partitions of a DataFrame? I checked the DataFrame javadoc (spark 1.6) and didn't found a method for that, or am I just missed it? (In case of JavaRDD there's a getNumPartitions() method.)

python scala dataframe apache-spark apache-spark-sql

asked Feb 11 '17 at 02:24

kecso

2,387
2
18
29

votes

4 answers

Spark SQL: apply aggregate functions to a list of columns

Is there a way to apply an aggregate function to all (or a list of) columns of a dataframe, when doing a groupBy? In other words, is there a way to avoid doing this for every column: df.groupBy("col1") .agg(sum("col2").alias("col2"),…

apache-spark dataframe apache-spark-sql aggregate-functions

asked Nov 23 '15 at 23:40

lilloraffa

1,367
3
17
22

votes

5 answers

Add an empty column to Spark DataFrame

As mentioned in many other locations on the web, adding a new column to an existing DataFrame is not straightforward. Unfortunately it is important to have this functionality (even though it is inefficient in a distributed environment) especially…

python apache-spark dataframe pyspark apache-spark-sql

asked Oct 09 '15 at 12:45

architectonic

2,871
2
21
35

votes

5 answers

Updating a dataframe column in spark

Looking at the new spark DataFrame API, it is unclear whether it is possible to modify dataframe columns. How would I go about changing a value in row x column y of a dataframe? In pandas this would be: df.ix[x,y] = new_value Edit: Consolidating…

python dataframe apache-spark pyspark apache-spark-sql

asked Mar 17 '15 at 21:19

Luke

6,699
13
50
88

votes

4 answers

Create Spark DataFrame. Can not infer schema for type

Could someone help me solve this problem I have with Spark DataFrame? When I do myFloatRDD.toDF() I get an error: TypeError: Can not infer schema for type: type 'float' I don't understand why... Example: myFloatRdd =…

python apache-spark dataframe pyspark apache-spark-sql

asked Sep 23 '15 at 14:13

Breach

1,288
1
11
25

votes

6 answers

How to write unit tests in Spark 2.0+?

I've been trying to find a reasonable way to test SparkSession with the JUnit testing framework. While there seem to be good examples for SparkContext, I couldn't figure out how to get a corresponding example working for SparkSession, even though it…

scala unit-testing apache-spark junit apache-spark-sql

asked May 02 '17 at 02:46

bbarker

11,636
9
38
62

Prev 1 2

…

99 100 Next