Questions tagged [apache-spark-1.6]

Use for questions specific to Apache Spark 1.6. For general questions related to Apache Spark use the tag [apache-spark].

111 questions
0
votes
1 answer

How to add double quotes to the string?

I have a Json-like string like this: {cid: {ABCD[1]_TYPE, [text]: alphabets, time: 1/12/2010, author: xyz, best_chapter: 10.5} And I need to add double quotes to every string to make it look like a real Json: {"cid": {"ABCD[1]_TYPE", "[text]":…
xin
  • 135
  • 11
0
votes
0 answers

Saving output of spark to csv in spark 1.6

Spark 1.6 scala How to save output to csv file of spark 1.6. i did something like this. myCleanData.write.mode(SaveMode.Append).csv(path="file:///filepath") but it throw error as cannot resolve symbol csv i tried like this even. for dependency …
Sophie Dinka
  • 73
  • 1
  • 8
0
votes
1 answer

UDF in Spark 1.6 Reassignment to val error

I am using Spark 1.6 The below udf is used to clean address data. sqlContext.udf.register("cleanaddress", (AD1:String,AD2: String, AD3:String)=>Boolean = _.matches("^[a-zA-Z0-9]*$")) UDF Name : cleanaddress Three input parameter is coming from…
Sophie Dinka
  • 73
  • 1
  • 8
0
votes
1 answer

How to list All Databases using HiveContext in PySpark 1.6

I am trying to list all the databases using HiveContext in Spark 1.6 but its giving me just the default database. from pyspark import SparkContext from pyspark.sql import SQLContext sc = SparkContext.getOrCreate() from pyspark.sql import…
0
votes
1 answer

Iterating over a grouped dataset in Spark 1.6

In an ordered dataset, I want to aggregate data until a condition is met, but grouped by a certain key. To set some context to my question I simplify my problem to the below problem statement: In spark I need to aggregate strings, grouped by key…
Havnar
  • 2,558
  • 7
  • 33
  • 62
0
votes
0 answers

Spark Program twice as slow in Spark 2.2 than Spark 1.6

We're migrating our Scala Spark programs from 1.6.3 to 2.2.0. The program in question has four parts: let's call them sections A, B, C and D. Section A parses the input (parquet files) and then caches the DF and creates a table. Then sections B, C…
0
votes
1 answer

cast method results in null values in java spark

I have a simple use case of performing join on two dataframes, I am using spark 1.6.3 version. The issue is that while trying to cast the string type to integer type using cast method the resulting column is all null values. I have already tried all…
0
votes
1 answer

How to split the input data into several files based of date field in pyspark?

I have a hive table with a date field in it. +----------+------+-----+ |data_field| col1| col2| +----------+------+-----+ |10/01/2018| 125| abc| |10/02/2018| 124| def| |10/03/2018| 127| ghi| |10/04/2018| 127| klm| |10/05/2018| …
Bob
  • 335
  • 1
  • 4
  • 16
0
votes
1 answer

How to drop duplicates considering only subset of columns?

I use Spark 1.6 and am doing inner join on two dataframes as follows: val filtergroup = metric .join(filtercndtns, Seq("aggrgn_filter_group_id"), inner) .distinct() But I keep getting duplicate values in aggrgn_filter_group_id column. Can you…
Naveen Yadav
  • 11
  • 2
  • 8
0
votes
0 answers

How to define partitions to Dataframe in pyspark?

Suppose I read a parquet file as a Dataframe in pyspark, how can I specify how many partitions it must be? I read the parquet file like this - df = sqlContext.read.format('parquet').load('/path/to/file') How may I specify the number of partitions…
Ani Menon
  • 27,209
  • 16
  • 105
  • 126
0
votes
1 answer

Pyspark- handling exceptions and raising RuntimeError in pyspark dataframe

I have a dataframe in which i'm trying to create a new column based on values of existing column: dfg = dfg.withColumn("min_time", F.when(dfg['list'].isin(["A","B"]),dfg['b_time']) .when(dfg['list']=="C",dfg['b_time'] +2) …
Mia21
  • 119
  • 2
  • 10
0
votes
2 answers

calculate median, average using hadoop spark1.6 dataframe, Failed to start database 'metastore_db'

spark-shell --packages com.databricks:spark-csv_2.11:1.2.0 1. using SQLContext ~~~~~~~~~~~~~~~~~~~~ 1. import org.apache.spark.sql.SQLContext 2. val sqlctx = new SQLContext(sc) 3. import sqlctx._ val df =…
0
votes
0 answers

Apache Toree 0.1.x - NoSuchMethodError: org.apache.spark.repl.SparkIMain.classServerUri()

I have created a Scala kernel for my Jupyter notebook using Spark 1.6 on CDH 5.12. I am using Apache Toree 0.1.x. I have installed the python package toree 0.1.0 (https://pypi.python.org/pypi/toree/0.1.0). And the kernel was installed with the…
0
votes
1 answer

Spark 1.6 Streaming consumer reading in kafka offset stuck at createDirectStream

I am trying to read in the spark streaming offset into my consumer but I cannot seem to do it correctly. Here is my code. val dfoffset = hiveContext.sql(s"select * from $db") dfoffset.show() val dfoffsetArray = dfoffset.collect() println("printing…
javadev
  • 277
  • 3
  • 19
0
votes
2 answers

Error accessing Spark thrift Server

Spark version: 1.6.3 I running Spark thrift server as proxy. But it not running as long as I expected. It always stop when get high load. This is Error when I access.