Questions tagged [apache-spark-sql]

Apache Spark SQL is a tool for "SQL and structured data processing" on Spark, a fast and general-purpose cluster computing system. It can be used to retrieve data from Hive, Parquet etc. and run SQL queries over existing RDDs and Datasets.

Apache Spark SQL is a tool that brings native support for SQL to apache-spark. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.

Resources

26508 questions

votes

2 answers

Getting NullPointerException using spark-csv with DataFrames

Running through the spark-csv README there's sample Java code like this import org.apache.spark.sql.SQLContext; import org.apache.spark.sql.types.*; SQLContext sqlContext = new SQLContext(sc); StructType customSchema = new StructType( new…

apache-spark apache-spark-sql spark-csv

asked Dec 21 '15 at 03:50

Dennis Huo

10,517
27
43

votes

5 answers

How to create SQLContext in spark using scala?

I am creating a Scala program to SQLContext using sbt. This is my build.sbt: name := "sampleScalaProject" version := "1.0" scalaVersion := "2.11.7" //libraryDependencies += "org.apache.spark" %% "spark-core" % "2.5.2" libraryDependencies +=…

scala apache-spark sbt apache-spark-sql

asked Dec 21 '15 at 01:42

Amaresh

3,231
7
37
60

votes

2 answers

Apache zeppelin tutorial, error "sql interpreter not found"

In the "Zeppelin tutorial" notebook, I can't use the %sql interpreter. It will output "sql interpreter not found". But the spark commands work fine, as well as %md and %sh. Here's the log : ERROR [2015-10-20 10:13:35,045] ({qtp885851948-51}…

apache-spark-sql apache-zeppelin

asked Oct 20 '15 at 08:24

thomas legrand

votes

2 answers

How can we JOIN two Spark SQL dataframes using a SQL-esque "LIKE" criterion?

We are using the PySpark libraries interfacing with Spark 1.3.1. We have two dataframes, documents_df := {document_id, document_text} and keywords_df := {keyword}. We would like to JOIN the two dataframes and return a resulting dataframe with…

python apache-spark apache-spark-sql pyspark

asked Oct 16 '15 at 11:06

Will Hardman

votes

1 answer

How to print rdd in python in spark

I have two files on HDFS and I just want to join these two files on a column say employee id. I am trying to simply print the files to make sure we are reading that correctly from HDFS. lines = sc.textFile("hdfs://ip:8020/emp.txt") print…

python apache-spark pyspark apache-spark-sql

asked Oct 09 '15 at 00:15

yguw

votes

1 answer

Saving Spark dataFrames as parquet files - no errors, but data is not being saved

I want to save a dataframe as a parquet file in Python, but I am only able to save the schema, not the data itself. I have reduced my problem down to a very simple Python test case, which I copied below from IPYNB. Any advice on what might be…

dataframe apache-spark-sql parquet

asked Oct 04 '15 at 20:59

GrahamM

votes

1 answer

Saving Spark DataFrames with nested User Data Types

I want to save (as a parquet file) a Spark DataFrame that contains a custom class as a column. This class is composed by a Seq of another custom class. To do so, I create an UserDefinedType class for each of these classes, in a similar way to…

apache-spark apache-spark-sql

asked Sep 17 '15 at 08:10

João Duarte

votes

4 answers

Methods of max() and sum() undefined in the Java Spark Dataframe API (1.4.1)

Putting sample code of DataFrame.groupBy() into my code, but it shown the methods of max() and sum() undefined. df.groupBy("department").agg(max("age"), sum("expense")); Which Java package should I import if I want to use max() and sum() method?…

java apache-spark-sql

asked Sep 08 '15 at 06:16

Jingyu Zhang

votes

1 answer

SparkSQL, Thrift Server and Tableau

I am wondering if there is a way that will make the sparkSQL table in sqlContext directly visible by other processes, for example Tableau. I did some research on thrift server, but I didn't find any specific explanation about it. Is it a middleware…

apache-spark hive apache-spark-sql

asked Jul 23 '15 at 20:25

user3693309

votes

3 answers

merge multiple small files in to few larger files in Spark

I using hive through Spark. I have a Insert into partitioned table query in my spark code. The input data is in 200+gb. When Spark is writing to a partitioned table, it is spitting very small files(files in kb's). so now the output partitioned table…

scala hadoop apache-spark hive apache-spark-sql

asked Jun 23 '15 at 17:45

dheee

1,588
3
15
25

votes

2 answers

Joining two spark dataframes on time (TimestampType) in python

I have two dataframes and I would like to join them based on one column, with a caveat that this column is a timestamp, and that timestamp has to be within a certain offset (5 seconds) in order to join records. More specifically, a record in…

join apache-spark apache-spark-sql pyspark

asked Jun 03 '15 at 20:43

Oleksiy

6,337
5
41
58

votes

2 answers

Not able to connect to postgres using jdbc in pyspark shell

I am using standalone cluster on my local windows and trying to load data from one of our server using following code - from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.load(source="jdbc",…

postgresql jdbc apache-spark apache-spark-sql pyspark

asked Apr 16 '15 at 08:34

Soni Shashank

votes

2 answers

compute string length in Spark SQL DSL

Edit: this is an old question concerning Spark 1.2 I've been trying to compute on the fly the length of a string column in a SchemaRDD for orderBy purposes. I am learning Spark SQL so my question is strictly about using the DSL or the SQL interface…

apache-spark apache-spark-sql string-length

asked Feb 16 '15 at 15:27

Wilmerton

1,448
1
12
31

votes

2 answers

SparkSQL MissingRequirementError when registering table

I'm a newbie to Scala and Apache Spark and I'm trying to use Spark SQL. After cloning the repo I started the spark shell by typing bin/spark-shell and run the following: val sqlContext = new org.apache.spark.sql.SQLContext(sc) import…

scala sbt apache-spark apache-spark-sql

asked Jan 07 '15 at 16:44

se7entyse7en

4,310
7
33
50

votes

1 answer

Saving a >>25T SchemaRDD in Parquet format on S3

I have encountered a number of problems when trying to save a very large SchemaRDD as in Parquet format on S3. I have already posted specific questions for those problems, but this is what I really need to do. The code should look something like…

amazon-s3 apache-spark parquet apache-spark-sql

asked Oct 13 '14 at 03:31

Daniel Mahler

7,653
5
51
90

Prev 1 2 3

…

99 100 Next