Questions tagged [apache-spark-sql]

Apache Spark SQL is a tool for "SQL and structured data processing" on Spark, a fast and general-purpose cluster computing system. It can be used to retrieve data from Hive, Parquet etc. and run SQL queries over existing RDDs and Datasets.

Apache Spark SQL is a tool that brings native support for SQL to . It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.

Resources

Related tags: , ,

26508 questions
91
votes
7 answers

Pyspark: display a spark data frame in a table format

I am using pyspark to read a parquet file like below: my_df = sqlContext.read.parquet('hdfs://myPath/myDB.db/myTable/**') Then when I do my_df.take(5), it will show [Row(...)], instead of a table format like when we use the pandas data frame. Is…
Edamame
  • 23,718
  • 73
  • 186
  • 320
91
votes
4 answers

How to join on multiple columns in Pyspark?

I am using Spark 1.3 and would like to join on multiple columns using python interface (SparkSQL) The following works: I first register them as temp tables. numeric.registerTempTable("numeric") Ref.registerTempTable("Ref") test = numeric.join(Ref,…
user3803714
  • 5,269
  • 10
  • 42
  • 61
90
votes
10 answers

How to pivot Spark DataFrame?

I am starting to use Spark DataFrames and I need to be able to pivot the data to create multiple columns out of 1 column with multiple rows. There is built in functionality for that in Scalding and I believe in Pandas in Python, but I can't find…
J Calbreath
  • 2,665
  • 4
  • 22
  • 31
89
votes
2 answers

Spark - SELECT WHERE or filtering?

What's the difference between selecting with a where clause and filtering in Spark? Are there any use cases in which one is more appropriate than the other one? When do I use DataFrame newdf =…
lte__
  • 7,175
  • 25
  • 74
  • 131
86
votes
4 answers

Pyspark: Split multiple array columns into rows

I have a dataframe which has one row, and several columns. Some of the columns are single values, and others are lists. All list columns are the same length. I want to split each list column into a separate row, while keeping any non-list column as…
Steve
  • 2,401
  • 3
  • 24
  • 28
86
votes
12 answers

how to filter out a null value from spark dataframe

I created a dataframe in spark with the following schema: root |-- user_id: long (nullable = false) |-- event_id: long (nullable = false) |-- invited: integer (nullable = false) |-- day_diff: long (nullable = true) |-- interested: integer…
Steven Li
  • 901
  • 1
  • 9
  • 9
85
votes
22 answers

How to perform union on two DataFrames with different amounts of columns in Spark?

I have 2 DataFrames: I need union like this: The unionAll function doesn't work because the number and the name of columns are different. How can I do this?
Allan Feliph
  • 862
  • 1
  • 8
  • 8
84
votes
13 answers

Provide schema while reading csv file as a dataframe in Scala Spark

I am trying to read a csv file into a dataframe. I know what the schema of my dataframe should be since I know my csv file. Also I am using spark csv package to read the file. I trying to specify the schema like below. val pagecount =…
Pa1
  • 861
  • 1
  • 7
  • 6
83
votes
4 answers

How to make good reproducible Apache Spark examples

I've been spending a fair amount of time reading through some questions with the pyspark and spark-dataframe tags and very often I find that posters don't provide enough information to truly understand their question. I usually comment asking them…
pault
  • 41,343
  • 15
  • 107
  • 149
83
votes
4 answers

Spark functions vs UDF performance?

Spark now offers predefined functions that can be used in dataframes, and it seems they are highly optimized. My original question was going to be on which is faster, but I did some testing myself and found the spark functions to be about 10 times…
82
votes
3 answers

How to convert column with string type to int form in pyspark data frame?

I have dataframe in pyspark. Some of its numerical columns contain nan so when I am reading the data and checking for the schema of dataframe, those columns will have string type. How I can change them to int type. I replaced the nan values with 0…
neha
  • 1,858
  • 5
  • 21
  • 35
82
votes
5 answers

How to use Column.isin with list?

val items = List("a", "b", "c") sqlContext.sql("select c1 from table") .filter($"c1".isin(items)) .collect .foreach(println) The code above throws the following exception. Exception in thread "main"…
Nabegh
  • 3,249
  • 6
  • 25
  • 26
81
votes
7 answers

How to loop through each row of dataFrame in pyspark

E.g sqlContext = SQLContext(sc) sample=sqlContext.sql("select Name ,age ,city from user") sample.show() The above statement prints theentire table on terminal. But I want to access each row in that table using for or while to perform further…
Arti Berde
  • 1,182
  • 1
  • 11
  • 23
81
votes
3 answers

How to use JDBC source to write and read data in (Py)Spark?

The goal of this question is to document: steps required to read and write data using JDBC connections in PySpark possible issues with JDBC sources and know solutions With small changes these methods should work with other supported languages…
zero323
  • 322,348
  • 103
  • 959
  • 935
80
votes
8 answers

How to get name of dataframe column in PySpark?

In pandas, this can be done by column.name. But how to do the same when it's a column of Spark dataframe? E.g. the calling program has a Spark dataframe: spark_df >>> spark_df.columns ['admit', 'gre', 'gpa', 'rank'] This program calls my function:…
Kaushik Acharya
  • 1,520
  • 2
  • 16
  • 25