Questions tagged [apache-spark-sql]

Apache Spark SQL is a tool for "SQL and structured data processing" on Spark, a fast and general-purpose cluster computing system. It can be used to retrieve data from Hive, Parquet etc. and run SQL queries over existing RDDs and Datasets.

Apache Spark SQL is a tool that brings native support for SQL to . It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.

Resources

Related tags: , ,

26508 questions
7
votes
2 answers

How to select multiple columns of dataset, given a list of column names?

How can I select multiple columns of dataset ds in Spark 2.3 Java by passing a list argument? For example, this works fine: ds.select("col1","col2","col3").show(); However, this fails: List columns =…
ScalaBoy
  • 3,254
  • 13
  • 46
  • 84
7
votes
4 answers

Select columns which contains a string in pyspark

I have a pyspark dataframe with a lot of columns, and I want to select the ones which contain a certain string, and others. For example: df.columns = ['hello_world','hello_country','hello_everyone','byebye','ciao','index'] I want to select the ones…
Manrique
  • 2,083
  • 3
  • 15
  • 38
7
votes
0 answers

Convert Spark DataSet to Java Pojo class

I am trying to convert a DataSet to java object. The schema is like root |-- deptId: long (nullable = true) |-- depNameName: string (nullable = true) |-- employee: array (nullable = true) | |-- element: struct (containsNull = true) | | …
Aslan
  • 71
  • 1
  • 2
7
votes
2 answers

Write spark dataframe to single parquet file

I am trying to do something very simple and I'm having some very stupid struggles. I think it must have to do with a fundamental misunderstanding of what spark is doing. I would greatly appreciate any help or explanation. I have a very large (~3 TB,…
seth127
  • 2,594
  • 5
  • 30
  • 43
7
votes
4 answers

Spark : Union can only be performed on tables with the compatible column types. Struct != Struct

Error : Union can only be performed on tables with the compatible column types. struct(tier:string,skyward_number:string,skyward_points:string) <> struct(skyward_number:string,tier:string,skyward_points:string) at the first column of the…
Ravi
  • 198
  • 1
  • 1
  • 8
7
votes
3 answers

Spark Dataframe Write to CSV creates _temporary directory file in Standalone Cluster Mode

I am running spark job in a cluster which has 2 worker nodes! I am using the code below (spark java) for saving the computed dataframe as csv to worker nodes.…
Omkar Puttagunta
  • 4,036
  • 3
  • 22
  • 35
7
votes
3 answers

PySpark: Insert or update dataframe with another dataframe

I have two dataframes, DF1 and DF2. DF1 is the master and DF2 is the delta. The data from DF2 should be inserted into DF1 or used to update the DF1 data. Lets say the DF1 is of the following…
navin
  • 384
  • 2
  • 4
  • 15
7
votes
3 answers

Compare a pyspark dataframe to another dataframe

I have 2 data frames to compare both have the same number of columns and the comparison result should have the field that is mismatching and the values along with the ID. Dataframe one +-----+---+--------+ | name| id| City| +-----+---+--------+ |…
Shijo
  • 9,313
  • 3
  • 19
  • 31
7
votes
4 answers

pyspark - Convert sparse vector obtained after one hot encoding into columns

I am using apache Spark ML lib to handle categorical features using one hot encoding. After writing the below code I am getting a vector c_idx_vec as output of one hot encoding. I do understand how to interpret this output vector but I am unable to…
7
votes
1 answer

ModuleNotFoundError in PySpark Worker on rdd.collect()

I am running an Apache Spark program in python, and I am getting an error that I can't understand and can't begin to debug. I have a driver program that defines a function called hound in a file called hound.py. In the same directory, I have a file…
Brian Nieves
  • 338
  • 2
  • 11
7
votes
0 answers

Efficient Spark left join on multiple columns when dataframes are partitioned by a single column

I have two large dataframes df1 and df2 partitioned by column a, and I want to efficiently compute a left join on both a and another column b: df1.join(df2, on=['a', 'b'], how='left_outer') When written as above, Spark reshuffles both dataframes by…
7
votes
0 answers

how to get spark executor process PID in pyspark

suppose a spark job running in cluster mode launches 3 executors in cluster mode, then how to fetch the process ID (PID) of each of the executor processes in the spark cluster.? is there any api for this in pyspark.? EDIT: The question is about the…
TheCodeCache
  • 820
  • 1
  • 7
  • 27
7
votes
1 answer

How to read bz2 files into dataframes using pyspark?

I can read a json file into a dataframe in Pyspark using spark = SparkSession.builder.appName('GetDetails').getOrCreate() df = spark.read.json("path to json file") However, when i try to read a bz2(compressed csv) into a dataframe it gives me an…
Leonius
  • 71
  • 1
  • 2
7
votes
0 answers

How can InfluxDB be used as Spark Source

How can an InfluxDB database (which has streaming data coming in) be used as Source for Spark Streaming ? Also, Is it possible to use InfluxDB instead of SparkSQL for performing computations on datasets ?
Mark B.
  • 329
  • 2
  • 16
7
votes
1 answer

Pyspark sql: Create a new column based on whether a value exists in a different DataFrame's column

I tried to follow this answer but my question is slightly different. I have two pyspark data frames df2 and bears2. Both have an integer variable, and I want to create a boolean like this pseudocode: df3 = df2.withColumn("game",…
mlewis
  • 101
  • 1
  • 2
  • 6