Questions tagged [apache-spark-sql]

Apache Spark SQL is a tool for "SQL and structured data processing" on Spark, a fast and general-purpose cluster computing system. It can be used to retrieve data from Hive, Parquet etc. and run SQL queries over existing RDDs and Datasets.

Apache Spark SQL is a tool that brings native support for SQL to . It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.

Resources

Related tags: , ,

26508 questions
8
votes
2 answers

Getting NullPointerException using spark-csv with DataFrames

Running through the spark-csv README there's sample Java code like this import org.apache.spark.sql.SQLContext; import org.apache.spark.sql.types.*; SQLContext sqlContext = new SQLContext(sc); StructType customSchema = new StructType( new…
Dennis Huo
  • 10,517
  • 27
  • 43
8
votes
5 answers

How to create SQLContext in spark using scala?

I am creating a Scala program to SQLContext using sbt. This is my build.sbt: name := "sampleScalaProject" version := "1.0" scalaVersion := "2.11.7" //libraryDependencies += "org.apache.spark" %% "spark-core" % "2.5.2" libraryDependencies +=…
Amaresh
  • 3,231
  • 7
  • 37
  • 60
8
votes
2 answers

Apache zeppelin tutorial, error "sql interpreter not found"

In the "Zeppelin tutorial" notebook, I can't use the %sql interpreter. It will output "sql interpreter not found". But the spark commands work fine, as well as %md and %sh. Here's the log : ERROR [2015-10-20 10:13:35,045] ({qtp885851948-51}…
thomas legrand
  • 493
  • 1
  • 5
  • 16
8
votes
2 answers

How can we JOIN two Spark SQL dataframes using a SQL-esque "LIKE" criterion?

We are using the PySpark libraries interfacing with Spark 1.3.1. We have two dataframes, documents_df := {document_id, document_text} and keywords_df := {keyword}. We would like to JOIN the two dataframes and return a resulting dataframe with…
Will Hardman
  • 193
  • 1
  • 2
  • 8
8
votes
1 answer

How to print rdd in python in spark

I have two files on HDFS and I just want to join these two files on a column say employee id. I am trying to simply print the files to make sure we are reading that correctly from HDFS. lines = sc.textFile("hdfs://ip:8020/emp.txt") print…
yguw
  • 856
  • 6
  • 12
  • 32
8
votes
1 answer

Saving Spark dataFrames as parquet files - no errors, but data is not being saved

I want to save a dataframe as a parquet file in Python, but I am only able to save the schema, not the data itself. I have reduced my problem down to a very simple Python test case, which I copied below from IPYNB. Any advice on what might be…
GrahamM
  • 91
  • 1
  • 3
8
votes
1 answer

Saving Spark DataFrames with nested User Data Types

I want to save (as a parquet file) a Spark DataFrame that contains a custom class as a column. This class is composed by a Seq of another custom class. To do so, I create an UserDefinedType class for each of these classes, in a similar way to…
João Duarte
  • 93
  • 1
  • 7
8
votes
4 answers

Methods of max() and sum() undefined in the Java Spark Dataframe API (1.4.1)

Putting sample code of DataFrame.groupBy() into my code, but it shown the methods of max() and sum() undefined. df.groupBy("department").agg(max("age"), sum("expense")); Which Java package should I import if I want to use max() and sum() method?…
Jingyu Zhang
  • 81
  • 1
  • 1
  • 2
8
votes
1 answer

SparkSQL, Thrift Server and Tableau

I am wondering if there is a way that will make the sparkSQL table in sqlContext directly visible by other processes, for example Tableau. I did some research on thrift server, but I didn't find any specific explanation about it. Is it a middleware…
user3693309
  • 343
  • 4
  • 14
8
votes
3 answers

merge multiple small files in to few larger files in Spark

I using hive through Spark. I have a Insert into partitioned table query in my spark code. The input data is in 200+gb. When Spark is writing to a partitioned table, it is spitting very small files(files in kb's). so now the output partitioned table…
dheee
  • 1,588
  • 3
  • 15
  • 25
8
votes
2 answers

Joining two spark dataframes on time (TimestampType) in python

I have two dataframes and I would like to join them based on one column, with a caveat that this column is a timestamp, and that timestamp has to be within a certain offset (5 seconds) in order to join records. More specifically, a record in…
Oleksiy
  • 6,337
  • 5
  • 41
  • 58
8
votes
2 answers

Not able to connect to postgres using jdbc in pyspark shell

I am using standalone cluster on my local windows and trying to load data from one of our server using following code - from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.load(source="jdbc",…
Soni Shashank
  • 221
  • 1
  • 3
  • 9
8
votes
2 answers

compute string length in Spark SQL DSL

Edit: this is an old question concerning Spark 1.2 I've been trying to compute on the fly the length of a string column in a SchemaRDD for orderBy purposes. I am learning Spark SQL so my question is strictly about using the DSL or the SQL interface…
Wilmerton
  • 1,448
  • 1
  • 12
  • 31
8
votes
2 answers

SparkSQL MissingRequirementError when registering table

I'm a newbie to Scala and Apache Spark and I'm trying to use Spark SQL. After cloning the repo I started the spark shell by typing bin/spark-shell and run the following: val sqlContext = new org.apache.spark.sql.SQLContext(sc) import…
se7entyse7en
  • 4,310
  • 7
  • 33
  • 50
8
votes
1 answer

Saving a >>25T SchemaRDD in Parquet format on S3

I have encountered a number of problems when trying to save a very large SchemaRDD as in Parquet format on S3. I have already posted specific questions for those problems, but this is what I really need to do. The code should look something like…
Daniel Mahler
  • 7,653
  • 5
  • 51
  • 90