Questions tagged [apache-spark-sql]

Apache Spark SQL is a tool for "SQL and structured data processing" on Spark, a fast and general-purpose cluster computing system. It can be used to retrieve data from Hive, Parquet etc. and run SQL queries over existing RDDs and Datasets.

Apache Spark SQL is a tool that brings native support for SQL to apache-spark. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.

Resources

26508 questions

votes

7 answers

Pyspark: display a spark data frame in a table format

I am using pyspark to read a parquet file like below: my_df = sqlContext.read.parquet('hdfs://myPath/myDB.db/myTable/**') Then when I do my_df.take(5), it will show [Row(...)], instead of a table format like when we use the pandas data frame. Is…

python pandas pyspark apache-spark-sql

asked Aug 21 '16 at 18:24

Edamame

23,718
73
186
320

votes

4 answers

How to join on multiple columns in Pyspark?

I am using Spark 1.3 and would like to join on multiple columns using python interface (SparkSQL) The following works: I first register them as temp tables. numeric.registerTempTable("numeric") Ref.registerTempTable("Ref") test = numeric.join(Ref,…

python apache-spark join pyspark apache-spark-sql

asked Nov 16 '15 at 22:37

user3803714

5,269
10
42
61

votes

10 answers

How to pivot Spark DataFrame?

I am starting to use Spark DataFrames and I need to be able to pivot the data to create multiple columns out of 1 column with multiple rows. There is built in functionality for that in Scalding and I believe in Pandas in Python, but I can't find…

dataframe apache-spark pyspark apache-spark-sql pivot

asked May 14 '15 at 18:42

J Calbreath

2,665
4
22
31

votes

2 answers

Spark - SELECT WHERE or filtering?

What's the difference between selecting with a where clause and filtering in Spark? Are there any use cases in which one is more appropriate than the other one? When do I use DataFrame newdf =…

apache-spark apache-spark-sql

asked Aug 10 '16 at 08:01

lte__

7,175
25
74
131

votes

4 answers

Pyspark: Split multiple array columns into rows

I have a dataframe which has one row, and several columns. Some of the columns are single values, and others are lists. All list columns are the same length. I want to split each list column into a separate row, while keeping any non-list column as…

python apache-spark dataframe pyspark apache-spark-sql

asked Dec 07 '16 at 21:02

Steve

2,401
3
24
28

votes

12 answers

how to filter out a null value from spark dataframe

scala apache-spark apache-spark-sql

asked Sep 27 '16 at 14:46

Steven Li

votes

22 answers

How to perform union on two DataFrames with different amounts of columns in Spark?

I have 2 DataFrames: I need union like this: The unionAll function doesn't work because the number and the name of columns are different. How can I do this?

python apache-spark pyspark apache-spark-sql union

asked Sep 28 '16 at 21:34

Allan Feliph

votes

13 answers

Provide schema while reading csv file as a dataframe in Scala Spark

I am trying to read a csv file into a dataframe. I know what the schema of my dataframe should be since I know my csv file. Also I am using spark csv package to read the file. I trying to specify the schema like below. val pagecount =…

scala apache-spark dataframe apache-spark-sql spark-csv

asked Oct 07 '16 at 22:02

Pa1

votes

4 answers

How to make good reproducible Apache Spark examples

I've been spending a fair amount of time reading through some questions with the pyspark and spark-dataframe tags and very often I find that posters don't provide enough information to truly understand their question. I usually comment asking them…

dataframe apache-spark pyspark apache-spark-sql

asked Jan 24 '18 at 16:24

pault

41,343
15
107
149

votes

4 answers

Spark functions vs UDF performance?

Spark now offers predefined functions that can be used in dataframes, and it seems they are highly optimized. My original question was going to be on which is faster, but I did some testing myself and found the spark functions to be about 10 times…

performance apache-spark pyspark apache-spark-sql user-defined-functions

asked Jul 10 '16 at 21:26

alfredox

4,082
6
21
29

votes

3 answers

How to convert column with string type to int form in pyspark data frame?

I have dataframe in pyspark. Some of its numerical columns contain nan so when I am reading the data and checking for the schema of dataframe, those columns will have string type. How I can change them to int type. I replaced the nan values with 0…

python dataframe apache-spark pyspark apache-spark-sql

asked Oct 26 '17 at 13:43

neha

1,858
5
21
35

votes

5 answers

How to use Column.isin with list?

val items = List("a", "b", "c") sqlContext.sql("select c1 from table") .filter($"c1".isin(items)) .collect .foreach(println) The code above throws the following exception. Exception in thread "main"…

scala apache-spark apache-spark-sql

asked Sep 13 '15 at 16:32

Nabegh

3,249
6
25
26

votes

7 answers

How to loop through each row of dataFrame in pyspark

E.g sqlContext = SQLContext(sc) sample=sqlContext.sql("select Name ,age ,city from user") sample.show() The above statement prints theentire table on terminal. But I want to access each row in that table using for or while to perform further…

apache-spark dataframe for-loop pyspark apache-spark-sql

asked Apr 01 '16 at 06:15

Arti Berde

1,182
1
11
23

votes

3 answers

How to use JDBC source to write and read data in (Py)Spark?

The goal of this question is to document: steps required to read and write data using JDBC connections in PySpark possible issues with JDBC sources and know solutions With small changes these methods should work with other supported languages…

python scala apache-spark apache-spark-sql pyspark

asked Jun 22 '15 at 15:30

zero323

322,348
103
959
935

votes

8 answers

How to get name of dataframe column in PySpark?

In pandas, this can be done by column.name. But how to do the same when it's a column of Spark dataframe? E.g. the calling program has a Spark dataframe: spark_df >>> spark_df.columns ['admit', 'gre', 'gpa', 'rank'] This program calls my function:…

dataframe apache-spark pyspark apache-spark-sql

asked Sep 28 '16 at 11:55

Kaushik Acharya

1,520
2
16
25

Prev 1 2 3

…

99 100 Next