3

I am trying to use map function on DataFrame in Spark using Java. I am following the documentation which says

map(scala.Function1 f, scala.reflect.ClassTag evidence$4) Returns a new RDD by applying a function to all rows of this DataFrame.

While using the Function1 in map , I need to implement all the functions. I have seen some questions related to this , but the solution provided converts the DataFrame into RDD. How can I use the map function in DataFrame without converting it into a RDD also what is the second parameter of map ie scala.reflect.ClassTag<R> evidence$4

I am using Java 7 and Spark 1.6.

jgp
  • 2,069
  • 1
  • 21
  • 40
talin
  • 179
  • 3
  • 12
  • The map function will return you an RDD, as the documentation says.... In any case, what's preventing you from getting Spark 2 or at least Java 8? – OneCricketeer Oct 25 '17 at 14:09
  • Yes . The map function anyways returns RDD, But why the provides a map function in DataFrame if can't use it directly ? Actually I am in a learning phashe and we don't know whether the client uses spark 1.7 or spark 2 . So we have to work out in both . – talin Oct 26 '17 at 05:14

4 Answers4

1

I know your question is about Java 7 and Spark 1.6, but in Spark 2 (and obviously Java 8), you can have a map function as part of a class, so you do not need to manipulate Java lambdas.

The call would look like:

Dataset<String> dfMap = df.map(
    new CountyFipsExtractorUsingMap(),
    Encoders.STRING());
dfMap.show(5);

The class would look like:

  /**
   * Returns a substring of the values in the id2 column.
   * 
   * @author jgp
   */
  private final class CountyFipsExtractorUsingMap
      implements MapFunction<Row, String> {
    private static final long serialVersionUID = 26547L;

    @Override
    public String call(Row r) throws Exception {
      String s = r.getAs("id2").toString().substring(2);
      return s;
    }
  }

You can find more details in this example on GitHub.

jgp
  • 2,069
  • 1
  • 21
  • 40
0

I think map is not the right way to use on a DataFrame. Maybe you should have a look at the examples in the API

There they show how to operate on DataFrames

hage
  • 5,966
  • 3
  • 32
  • 42
  • I have used `groupBy`,`agg`,`orderBy` operations in DataFrame. So which are the operation we need to avoid on DataFrame, I did't see any documentation regarding that . anyway thanks for your replay . – talin Oct 26 '17 at 05:21
0

You can use the dataset directly, need not convert the read data to RDD, its unnecessary consumption of resource.

dataset.map(mapfuncton{...}, encoder); this should suffice your needs.

0

Because you don't give any specific problems, there're some common alternatives to map in DataFrame like select, selectExpr, withColumn. If the spark sql builtin functions can't fit your task, you can use UTF.

a.l.
  • 1,085
  • 12
  • 29