How to convert the datasets of Spark Row into string?

Question

I have written the code to access the Hive table using SparkSQL. Here is the code:

SparkSession spark = SparkSession
        .builder()
        .appName("Java Spark Hive Example")
        .master("local[*]")
        .config("hive.metastore.uris", "thrift://localhost:9083")
        .enableHiveSupport()
        .getOrCreate();
Dataset<Row> df =  spark.sql("select survey_response_value from health").toDF();
df.show();

I would like to know how I can convert the complete output to String or String array? As I am trying to work with another module where only I can pass String or String type Array values.
I have tried other methods like .toString or typecast to String values. But did not worked for me.
Kindly let me know how I can convert the DataSet values to String?

score 23 · Accepted Answer · edited Feb 22 '17 at 14:31

23

Here is the sample code in Java.

public class SparkSample {
    public static void main(String[] args) {
        SparkSession spark = SparkSession
            .builder()
            .appName("SparkSample")
            .master("local[*]")
            .getOrCreate();
    //create df
    List<String> myList = Arrays.asList("one", "two", "three", "four", "five");
    Dataset<Row> df = spark.createDataset(myList, Encoders.STRING()).toDF();
    df.show();
    //using df.as
    List<String> listOne = df.as(Encoders.STRING()).collectAsList();
    System.out.println(listOne);
    //using df.map
    List<String> listTwo = df.map(row -> row.mkString(), Encoders.STRING()).collectAsList();
    System.out.println(listTwo);
  }
}

"row" is java 8 lambda parameter. Please check developer.com/java/start-using-java-lambda-expressions.html

edited Feb 22 '17 at 14:31

Jaffer Wilson

7,029
10
62
139

answered Feb 22 '17 at 14:00

abaghel

14,783
2
50
66

Please can you explain me what is this `row` in the program? You code looks pretty optimized to me. – Jaffer Wilson Feb 22 '17 at 14:13
"row" is java 8 lambda parameter. Please check http://www.developer.com/java/start-using-java-lambda-expressions.html – abaghel Feb 22 '17 at 14:19
Its perfect. Thanks. – Jaffer Wilson Feb 22 '17 at 14:31
1

i am getting this error when i used df.as. Exception in thread "main" org.apache.spark.sql.AnalysisException: Try to map struct<***my column names here***> to Tuple1, but failed as the number of fields does not line up.; – user812142 May 15 '19 at 11:23

hage · Answer 2 · 2017-02-22T11:12:46.893

13

You can use the map function to convert every row into a string, e.g.:

df.map(row => row.mkString())

Instead of just mkString you can of course do more sophisticated work

The collect method then can retreive the whole thing into an array

val strings = df.map(row => row.mkString()).collect

(This is the Scala syntax, I think in Java it's quite similar)

edited Feb 22 '17 at 11:12

answered Feb 22 '17 at 11:02

hage

5,966
3
32
42

It didn't work, my friend. Can you help me with java, instead of Scala. i know the syntax are a bit similar, but there are other problems while we use java instead of Scala – Jaffer Wilson Feb 22 '17 at 11:27
1

@Jaffer Java8 syntax should be very similar – OneCricketeer Feb 22 '17 at 13:26
@cricket_007 Thank you for your advise. It helped. – Jaffer Wilson Feb 22 '17 at 13:27
Does this work for a row with multiple columns. or we need row encoder etc for it – Akshay Hazari Jul 13 '21 at 11:48
It also works for rows with multiple columns. You can also pass a separator string to `mkString` to separate the columns. – hage Jul 14 '21 at 06:49
I had to use "df.map(row => row.mkString)" without the () for it to work (Scala 3) – Xavier John Jan 14 '22 at 18:24

score 2 · Answer 3 · answered Apr 01 '20 at 21:20

If you are planning to read the dataset line by line, then you can use the iterator over the dataset:

 Dataset<Row>csv=session.read().format("csv").option("sep",",").option("inferSchema",true).option("escape, "\"").option("header", true).option("multiline",true).load(users/abc/....);

for(Iterator<Row> iter = csv.toLocalIterator(); iter.hasNext();) {
    String item = (iter.next()).toString();
    System.out.println(item.toString());    
}

score 2 · Answer 4 · answered May 21 '21 at 22:07

2

to put as a single string, from sparkSession you can do:

sparkSession.read.textFile(filePath).collect.mkString

assuming your Dataset is of type String: Dataset[String]

answered May 21 '21 at 22:07

ForkPork

37
4

How to convert the datasets of Spark Row into string?

4 Answers4

Linked