Use for questions specific to Apache Spark 1.6. For general questions related to Apache Spark use the tag [apache-spark].
Questions tagged [apache-spark-1.6]
111 questions
0
votes
1 answer
How to add double quotes to the string?
I have a Json-like string like this:
{cid: {ABCD[1]_TYPE, [text]: alphabets, time: 1/12/2010, author: xyz, best_chapter: 10.5}
And I need to add double quotes to every string to make it look like a real Json:
{"cid": {"ABCD[1]_TYPE", "[text]":…

xin
- 135
- 11
0
votes
0 answers
Saving output of spark to csv in spark 1.6
Spark 1.6
scala
How to save output to csv file of spark 1.6.
i did something like this.
myCleanData.write.mode(SaveMode.Append).csv(path="file:///filepath")
but it throw error as
cannot resolve symbol csv
i tried like this even.
for dependency
…

Sophie Dinka
- 73
- 1
- 8
0
votes
1 answer
UDF in Spark 1.6 Reassignment to val error
I am using Spark 1.6
The below udf is used to clean address data.
sqlContext.udf.register("cleanaddress", (AD1:String,AD2: String, AD3:String)=>Boolean = _.matches("^[a-zA-Z0-9]*$"))
UDF Name : cleanaddress
Three input parameter is coming from…

Sophie Dinka
- 73
- 1
- 8
0
votes
1 answer
How to list All Databases using HiveContext in PySpark 1.6
I am trying to list all the databases using HiveContext in Spark 1.6 but its giving me just the default database.
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext.getOrCreate()
from pyspark.sql import…

Ashish Kumar Singh
- 13
- 1
- 6
0
votes
1 answer
Iterating over a grouped dataset in Spark 1.6
In an ordered dataset, I want to aggregate data until a condition is met, but grouped by a certain key.
To set some context to my question I simplify my problem to the below problem statement:
In spark I need to aggregate strings, grouped by key…

Havnar
- 2,558
- 7
- 33
- 62
0
votes
0 answers
Spark Program twice as slow in Spark 2.2 than Spark 1.6
We're migrating our Scala Spark programs from 1.6.3 to 2.2.0. The program in question has four parts: let's call them sections A, B, C and D. Section A parses the input (parquet files) and then caches the DF and creates a table. Then sections B, C…

Doug T
- 21
- 2
0
votes
1 answer
cast method results in null values in java spark
I have a simple use case of performing join on two dataframes, I am using spark 1.6.3 version. The issue is that while trying to cast the string type to integer type using cast method the resulting column is all null values.
I have already tried all…

humblecoder
- 137
- 1
- 7
0
votes
1 answer
How to split the input data into several files based of date field in pyspark?
I have a hive table with a date field in it.
+----------+------+-----+
|data_field| col1| col2|
+----------+------+-----+
|10/01/2018| 125| abc|
|10/02/2018| 124| def|
|10/03/2018| 127| ghi|
|10/04/2018| 127| klm|
|10/05/2018| …

Bob
- 335
- 1
- 4
- 16
0
votes
1 answer
How to drop duplicates considering only subset of columns?
I use Spark 1.6 and am doing inner join on two dataframes as follows:
val filtergroup = metric
.join(filtercndtns, Seq("aggrgn_filter_group_id"), inner)
.distinct()
But I keep getting duplicate values in aggrgn_filter_group_id column. Can you…

Naveen Yadav
- 11
- 2
- 8
0
votes
0 answers
How to define partitions to Dataframe in pyspark?
Suppose I read a parquet file as a Dataframe in pyspark, how can I specify how many partitions it must be?
I read the parquet file like this -
df = sqlContext.read.format('parquet').load('/path/to/file')
How may I specify the number of partitions…

Ani Menon
- 27,209
- 16
- 105
- 126
0
votes
1 answer
Pyspark- handling exceptions and raising RuntimeError in pyspark dataframe
I have a dataframe in which i'm trying to create a new column based on values of existing column:
dfg = dfg.withColumn("min_time",
F.when(dfg['list'].isin(["A","B"]),dfg['b_time'])
.when(dfg['list']=="C",dfg['b_time'] +2)
…

Mia21
- 119
- 2
- 10
0
votes
2 answers
calculate median, average using hadoop spark1.6 dataframe, Failed to start database 'metastore_db'
spark-shell --packages com.databricks:spark-csv_2.11:1.2.0
1. using SQLContext
~~~~~~~~~~~~~~~~~~~~
1. import org.apache.spark.sql.SQLContext
2. val sqlctx = new SQLContext(sc)
3. import sqlctx._
val df =…
0
votes
0 answers
Apache Toree 0.1.x - NoSuchMethodError: org.apache.spark.repl.SparkIMain.classServerUri()
I have created a Scala kernel for my Jupyter notebook using Spark 1.6 on CDH 5.12. I am using Apache Toree 0.1.x.
I have installed the python package toree 0.1.0 (https://pypi.python.org/pypi/toree/0.1.0).
And the kernel was installed with the…
0
votes
1 answer
Spark 1.6 Streaming consumer reading in kafka offset stuck at createDirectStream
I am trying to read in the spark streaming offset into my consumer but I cannot seem to do it correctly.
Here is my code.
val dfoffset = hiveContext.sql(s"select * from $db")
dfoffset.show()
val dfoffsetArray = dfoffset.collect()
println("printing…

javadev
- 277
- 3
- 19
0
votes
2 answers
Error accessing Spark thrift Server
Spark version: 1.6.3
I running Spark thrift server as proxy. But it not running as long as I expected. It always stop when get high load.
This is Error when I access.

Mercury Trivival
- 11
- 3