Questions tagged [apache-spark-sql]

Apache Spark SQL is a tool for "SQL and structured data processing" on Spark, a fast and general-purpose cluster computing system. It can be used to retrieve data from Hive, Parquet etc. and run SQL queries over existing RDDs and Datasets.

Apache Spark SQL is a tool that brings native support for SQL to apache-spark. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.

Resources

26508 questions

votes

2 answers

How to select multiple columns of dataset, given a list of column names?

How can I select multiple columns of dataset ds in Spark 2.3 Java by passing a list argument? For example, this works fine: ds.select("col1","col2","col3").show(); However, this fails: List columns =…

java apache-spark apache-spark-sql

asked Dec 04 '18 at 17:33

ScalaBoy

3,254
13
46
84

votes

4 answers

Select columns which contains a string in pyspark

I have a pyspark dataframe with a lot of columns, and I want to select the ones which contain a certain string, and others. For example: df.columns = ['hello_world','hello_country','hello_everyone','byebye','ciao','index'] I want to select the ones…

python pyspark apache-spark-sql

asked Nov 21 '18 at 09:23

Manrique

2,083
3
15
38

votes

0 answers

Convert Spark DataSet to Java Pojo class

apache-spark java-8 apache-spark-sql

asked Nov 06 '18 at 08:58

Aslan

votes

2 answers

Write spark dataframe to single parquet file

I am trying to do something very simple and I'm having some very stupid struggles. I think it must have to do with a fundamental misunderstanding of what spark is doing. I would greatly appreciate any help or explanation. I have a very large (~3 TB,…

apache-spark pyspark apache-spark-sql

asked Sep 06 '18 at 14:37

seth127

2,594
5
30
43

votes

4 answers

Spark : Union can only be performed on tables with the compatible column types. Struct != Struct

Error : Union can only be performed on tables with the compatible column types. struct(tier:string,skyward_number:string,skyward_points:string) <> struct(skyward_number:string,tier:string,skyward_points:string) at the first column of the…

apache-spark struct apache-spark-sql union

asked Sep 05 '18 at 13:22

Ravi

votes

3 answers

Spark Dataframe Write to CSV creates _temporary directory file in Standalone Cluster Mode

I am running spark job in a cluster which has 2 worker nodes! I am using the code below (spark java) for saving the computed dataframe as csv to worker nodes.…

java csv apache-spark dataframe apache-spark-sql

asked Aug 30 '18 at 04:27

Omkar Puttagunta

4,036
3
22
35

votes

3 answers

PySpark: Insert or update dataframe with another dataframe

I have two dataframes, DF1 and DF2. DF1 is the master and DF2 is the delta. The data from DF2 should be inserted into DF1 or used to update the DF1 data. Lets say the DF1 is of the following…

python pyspark apache-spark-sql upsert

asked Aug 24 '18 at 15:17

navin

votes

3 answers

Compare a pyspark dataframe to another dataframe

I have 2 data frames to compare both have the same number of columns and the comparison result should have the field that is mismatching and the values along with the ID. Dataframe one +-----+---+--------+ | name| id| City| +-----+---+--------+ |…

python dataframe pyspark apache-spark-sql

asked Aug 16 '18 at 13:04

Shijo

9,313
3
19
31

votes

4 answers

pyspark - Convert sparse vector obtained after one hot encoding into columns

I am using apache Spark ML lib to handle categorical features using one hot encoding. After writing the below code I am getting a vector c_idx_vec as output of one hot encoding. I do understand how to interpret this output vector but I am unable to…

pyspark apache-spark-sql apache-spark-mllib apache-spark-ml one-hot-encoding

asked Jun 19 '18 at 14:48

Akash Singh

votes

1 answer

ModuleNotFoundError in PySpark Worker on rdd.collect()

I am running an Apache Spark program in python, and I am getting an error that I can't understand and can't begin to debug. I have a driver program that defines a function called hound in a file called hound.py. In the same directory, I have a file…

python apache-spark pyspark apache-spark-sql

asked Jun 14 '18 at 20:04

Brian Nieves

votes

0 answers

Efficient Spark left join on multiple columns when dataframes are partitioned by a single column

I have two large dataframes df1 and df2 partitioned by column a, and I want to efficiently compute a left join on both a and another column b: df1.join(df2, on=['a', 'b'], how='left_outer') When written as above, Spark reshuffles both dataframes by…

apache-spark dataframe apache-spark-sql left-join database-partitioning

asked Jun 09 '18 at 23:37

tiho

6,655
3
31
31

votes

0 answers

how to get spark executor process PID in pyspark

suppose a spark job running in cluster mode launches 3 executors in cluster mode, then how to fetch the process ID (PID) of each of the executor processes in the spark cluster.? is there any api for this in pyspark.? EDIT: The question is about the…

apache-spark pyspark apache-spark-sql

asked Jun 06 '18 at 13:54

TheCodeCache

votes

1 answer

How to read bz2 files into dataframes using pyspark?

I can read a json file into a dataframe in Pyspark using spark = SparkSession.builder.appName('GetDetails').getOrCreate() df = spark.read.json("path to json file") However, when i try to read a bz2(compressed csv) into a dataframe it gives me an…

python apache-spark pyspark apache-spark-sql

asked Jun 04 '18 at 21:57

Leonius

votes

0 answers

How can InfluxDB be used as Spark Source

How can an InfluxDB database (which has streaming data coming in) be used as Source for Spark Streaming ? Also, Is it possible to use InfluxDB instead of SparkSQL for performing computations on datasets ?

apache-spark apache-spark-sql influxdb

asked May 31 '18 at 10:03

Mark B.

votes

1 answer

Pyspark sql: Create a new column based on whether a value exists in a different DataFrame's column

I tried to follow this answer but my question is slightly different. I have two pyspark data frames df2 and bears2. Both have an integer variable, and I want to create a boolean like this pseudocode: df3 = df2.withColumn("game",…

python apache-spark pyspark apache-spark-sql

asked May 30 '18 at 13:35

mlewis

Prev 1 2 3

…

99 100 Next