Questions tagged [apache-spark-1.6]

Use for questions specific to Apache Spark 1.6. For general questions related to Apache Spark use the tag [apache-spark].

111 questions
1
vote
1 answer

How to extract the ElementType of an Array as an instance of StructType

I try to decompose the structure of a complex dataframe in spark. I am only interested in the nested arrays under the root. The issue is that I can't retrieve the ElementType from the type of StructField. Here is an example, this schema of a…
Ismail Addou
  • 383
  • 1
  • 2
  • 17
1
vote
1 answer

How to unregister Spark UDF

I use Spark 1.6.0 with Java. I'd like to unregister a Spark UDF. Is there a way like dropping a temporary table sqlContext.drop(TemporaryTableName)? sqlContext.udf().register("isNumeric", value -> { …
JasonG
  • 13
  • 1
  • 3
1
vote
1 answer

How to find the schema of values in DStream at runtime?

I use Spark 1.6 and Kafka 0.8.2.1. I am trying to fetch some data from Kafka using Spark Streaming and do some operations on that data. For that I should know the schema of the fetched data, is there some way for this or can we get values from…
JSR29
  • 354
  • 1
  • 5
  • 17
1
vote
1 answer

Why does reading from Hive fail with "java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found"?

I use Spark v1.6.1 and Hive v1.2.x with Python v2.7 For Hive, I have some tables (ORC files) stored in HDFS and some stored in S3. If we are trying to join 2 tables, where one is in HDFS and the other is in S3, a java.lang.RuntimeException:…
Jane Wayne
  • 8,205
  • 17
  • 75
  • 120
1
vote
2 answers

Why does reading from CSV fail with NumberFormatException?

I use Spark 1.6.0 and Scala 2.10.5. $ spark-shell --packages com.databricks:spark-csv_2.10:1.5.0 import org.apache.spark.sql.SQLContext import sqlContext.implicits._ import org.apache.spark.sql.types.{StructType, StructField, StringType,…
1
vote
3 answers

Calculate maximum number of observations per group

I use Spark 1.6.2. I need to find maximum count per each group. val myData = Seq(("aa1", "GROUP_A", "10"),("aa1","GROUP_A", "12"),("aa2","GROUP_A", "12"),("aa3", "GROUP_B", "14"),("aa3","GROUP_B", "11"),("aa3","GROUP_B","12" ),("aa2", "GROUP_B",…
Dinosaurius
  • 8,306
  • 19
  • 64
  • 113
1
vote
2 answers

Pivot spark scala dataframe

I am trying to use pivot method in scala-spark val dfOutput = df_input.groupBy("memberlogin").pivot("country_group2").count() However, though there isn't any compliation error while creating a jar in eclipse, while execution in spark, its giving…
1
vote
1 answer

dataframe too many arguments in the rdd object

I tryed to use this question to convert rdd object to dataframe in spark. The class in my use case contains more than 100 arguments (columns) case class MyClass(val1: String, ..., val104: String ) val df = rdd.map({ case Row(val1:…
Zied Hermi
  • 229
  • 1
  • 2
  • 11
1
vote
1 answer

Why does executing SQL against Hive table using SQLContext in application fail (but the same query in spark-shell works fine)?

I am using Spark 1.6. I am trying to connect to a table in my spark-sql java code by : JavaSparkContext js = new JavaSparkContext(); SQLContext sc = new SQLContext(js); DataFrame mainFile = sc.sql("Select * from db.table"); It gives me a table…
Aviral Kumar
  • 814
  • 1
  • 15
  • 40
1
vote
0 answers

Spark datasets: Exception when using groupBy MissingRequirementError

I am starting to work with Spark datasets, I am facing this exception when I execute a groupby in Spark 1.6.1 case class RecordIdDate(recordId: String, date: String) val ds = sc.parallelize(List(RecordIdDate("hello","1"),…
1
vote
0 answers

Apache Spark self join big data set on multiple columns

Im running apache spark on a hadoop cluster, using yarn. I have a big data set, something like 160 million records. I have to perform a self join. The join is done on exact match of 1 column (c1), a date overlap match and a match on at least 1 of 2…
1
vote
1 answer

KMeans with Spark 1.6.2 VS Spark 2.0.0

I am using Kmeans() in an environment I have no control and I will abandon in <1 month. Spark 1.6.2. is installed. Should I pay the price for urging 'them' to upgrade to Spark 2.0.0 before I leave? In other words, does Spark 2.0.0 introduce any…
1
vote
0 answers

pyspark installation error, "ImportError: No module named pyspark"

I am trying to install apache spark-1.6.1 as a stand alone mode. I have followed "https://github.com/KristianHolsheimer/pyspark-setup-guide" link. But, after the execution of $ sbt/sbt assembly I have tried $ ./bin/run-example SparkPi 10" but,…
Sounak
  • 13
  • 3
1
vote
1 answer

How to unit test Spark Streaming code?

I use the latest Spark 1.6.0. Looked at another stackoverflow post How can I make Spark Streaming count the words in a file in a unit test? I am trying to use the sample @ https://gist.github.com/emres/67b4eae86fa92df69f61 have for writing a sample…
CodeDreamer
  • 444
  • 2
  • 8
0
votes
0 answers

How to combine UDFs when creating a new column in Pyspark 1.6

I am trying to aggregate a table that I have around one kay value (id here) so that I can have one row per id and perform some verifications on the rows that belong to each id in order to identify the 'result' (type of transaction of sorts). Lets…