I can't seem to write to parquet a JavaRDD<T>
where T is a say, Person
class. I've defined it as
public class Person implements Serializable
{
private static final long serialVersionUID = 1L;
private String name;
private String age;
private Address address;
....
with Address
:
public class Address implements Serializable
{
private static final long serialVersionUID = 1L;
private String City; private String Block;
...<getters and setters>
I then create a JavaRDD
like so:
JavaRDD<Person> people = sc.textFile("/user/johndoe/spark/data/people.txt").map(new Function<String, Person>()
{
public Person call(String line)
{
String[] parts = line.split(",");
Person person = new Person();
person.setName(parts[0]);
person.setAge("2");
Address address = new Address("HomeAdd","141H");
person.setAddress(address);
return person;
}
});
Note - I am manually setting Address
the same for all. This is basically a nested RDD. On trying to save this as a parquet file:
DataFrame dfschemaPeople = sqlContext.createDataFrame(people, Person.class);
dfschemaPeople.write().parquet("/user/johndoe/spark/data/out/people.parquet");
Address class is:
import java.io.Serializable;
public class Address implements Serializable
{
public Address(String city, String block)
{
super();
City = city;
Block = block;
}
private static final long serialVersionUID = 1L;
private String City;
private String Block;
//Omitting getters and setters
}
I encounter the error:
Caused by: java.lang.ClassCastException: com.test.schema.Address cannot be cast to org.apache.spark.sql.Row
I am running spark-1.4.1.
- Is this a known bug?
- If I do the same by importing a nested JSON file of the same format, I am able to save to parquet.
- Even if I create a sub DataFrame like:
DataFrame dfSubset = sqlContext.sql("SELECT address.city FROM PersonTable");
I still get the same error
So what gives? How can I read a complex data structure from a text file and save as parquet? Seems I cannot do so.