Questions tagged [apache-spark-sql]

Apache Spark SQL is a tool for "SQL and structured data processing" on Spark, a fast and general-purpose cluster computing system. It can be used to retrieve data from Hive, Parquet etc. and run SQL queries over existing RDDs and Datasets.

Apache Spark SQL is a tool that brings native support for SQL to apache-spark. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.

Resources

26508 questions

152

votes

9 answers

How to delete columns in pyspark dataframe

>>> a DataFrame[id: bigint, julian_date: string, user_id: bigint] >>> b DataFrame[id: bigint, quan_created_money: decimal(10,0), quan_created_cnt: bigint] >>> a.join(b, a.id==b.id, 'outer') DataFrame[id: bigint, julian_date: string, user_id: bigint,…

apache-spark apache-spark-sql pyspark

asked Apr 13 '15 at 08:10

xjx0524

1,531
2
10
5

145

votes

12 answers

Spark Dataframe distinguish columns with duplicated name

So as I know in Spark Dataframe, that for multiple columns can have the same name as shown in below dataframe snapshot: [ Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0,…

python apache-spark dataframe pyspark apache-spark-sql

asked Nov 18 '15 at 11:16

resec

2,091
3
13
22

145

votes

5 answers

How to define partitioning of DataFrame?

I've started using Spark SQL and DataFrames in Spark 1.4.0. I'm wanting to define a custom partitioner on DataFrames, in Scala, but not seeing how to do this. One of the data tables I'm working with contains a list of transactions, by account,…

scala apache-spark dataframe apache-spark-sql partitioning

asked Jun 23 '15 at 06:48

rake

2,348
3
15
11

139

votes

8 answers

Sort in descending order in PySpark

I'm using PySpark (Python 2.7.9/Spark 1.3.1) and have a dataframe GroupObject which I need to filter & sort in the descending order. Trying to achieve it via this piece of code. group_by_dataframe.count().filter("`count` >= 10").sort('count',…

python apache-spark dataframe pyspark apache-spark-sql

asked Dec 29 '15 at 15:57

rclakmal

1,872
3
17
19

131

votes

13 answers

Best way to get the max value in a Spark dataframe column

I'm trying to figure out the best way to get the largest value in a Spark dataframe column. Consider the following example: df = spark.createDataFrame([(1., 4.), (2., 5.), (3., 6.)], ["A", "B"]) df.show() Which creates: +---+---+ | A| …

python apache-spark pyspark apache-spark-sql

asked Oct 19 '15 at 22:04

xenocyon

2,409
3
20
22

130

votes

6 answers

Convert pyspark string to date format

I have a date pyspark dataframe with a string column in the format of MM-dd-yyyy and I am attempting to convert this into a date column. I tried: df.select(to_date(df.STRING_COLUMN).alias('new_date')).show() And I get a string of nulls. Can anyone…

python apache-spark datetime pyspark apache-spark-sql

asked Jun 28 '16 at 15:45

Jenks

1,950
3
20
27

129

votes

14 answers

Concatenate two PySpark dataframes

I'm trying to concatenate two PySpark dataframes with some columns that are only on one of them: from pyspark.sql.functions import randn, rand df_1 = sqlContext.range(0, 10) +--+ |id| +--+ | 0| | 1| | 2| | 3| | 4| | 5| | 6| | 7| | 8| |…

python apache-spark pyspark apache-spark-sql

asked May 19 '16 at 19:29

Ivan

19,560
31
97
141

126

votes

13 answers

Load CSV file with PySpark

I'm new to Spark and I'm trying to read CSV data from a file with Spark. Here's what I am doing : sc.textFile('file.csv') .map(lambda line: (line.split(',')[0], line.split(',')[1])) .collect() I would expect this call to give me a list of…

python csv apache-spark pyspark apache-spark-sql

asked Feb 28 '15 at 14:41

Kernael

3,270
4
22
42

122

votes

4 answers

What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism?

What's the difference between spark.sql.shuffle.partitions and spark.default.parallelism? I have tried to set both of them in SparkSQL, but the task number of the second stage is always 200.

performance apache-spark hadoop apache-spark-sql

asked Aug 16 '17 at 02:22

Edison

1,225
2
10
8

120

votes

15 answers

Join two data frames, select all columns from one and some columns from the other

Let's say I have a spark data frame df1, with several columns (among which the column id) and data frame df2 with two columns, id and other. Is there a way to replicate the following command: sqlContext.sql("SELECT df1.*, df2.other FROM df1 JOIN df2…

dataframe apache-spark pyspark apache-spark-sql

asked Mar 21 '16 at 13:27

Francesco Sambo

1,213
2
9
6

118

votes

9 answers

How to export a table dataframe in PySpark to csv?

I am using Spark 1.3.1 (PySpark) and I have generated a table using a SQL query. I now have an object that is a DataFrame. I want to export this DataFrame object (I have called it "table") to a csv file so I can manipulate it and plot the columns.…

python apache-spark dataframe apache-spark-sql export-to-csv

asked Jul 13 '15 at 13:56

PyRsquared

6,970
11
50
86

117

votes

10 answers

How to create an empty DataFrame with a specified schema?

I want to create on DataFrame with a specified schema in Scala. I have tried to use JSON read (I mean reading empty file) but I don't think that's the best practice.

dataframe scala apache-spark apache-spark-sql schema

asked Jul 17 '15 at 13:58

user1735076

3,225
7
19
16

112

votes

3 answers

pyspark dataframe filter or include based on list

I am trying to filter a dataframe in pyspark using a list. I want to either filter based on the list or include only those records with a value in the list. My code below does not work: # define a dataframe rdd = sc.parallelize([(0,1), (0,1),…

apache-spark filter pyspark apache-spark-sql

asked Nov 04 '16 at 11:44

user3133475

2,951
3
13
11

112

votes

10 answers

Extract column values of Dataframe as List in Apache Spark

I want to convert a string column of a data frame to a list. What I can find from the Dataframe API is RDD, so I tried converting it back to RDD first, and then apply toArray function to the RDD. In this case, the length and SQL work just fine.…

scala apache-spark apache-spark-sql

asked Aug 14 '15 at 00:39

SH Y.

1,709
3
20
21

110

votes

9 answers

Renaming columns for PySpark DataFrame aggregates

I am analysing some data with PySpark DataFrames. Suppose I have a DataFrame df that I am aggregating: (df.groupBy("group") .agg({"money":"sum"}) .show(100) ) This will give me: group SUM(money#2L) A …

dataframe apache-spark pyspark apache-spark-sql

asked May 01 '15 at 14:01

cantdutchthis

31,949
17
74
114

Prev 1

…

99 100 Next