10

I am using Apache Spark with Scala.

I have a csv file that does not have column names in the first row. It's like this:

28,Martok,49,476
29,Nog,48,364
30,Keiko,50,175
31,Miles,39,161

The columns represent ID, name, age, numOfFriends.

In my Scala object, I am creating dataset using SparkSession from csv file as follows:

val spark = SparkSession.builder.master("local[*]").getOrCreate()
val df = spark.read.option("inferSchema","true").csv("../myfile.csv")
df.printSchema()

When I run the program, the result is:

|-- _c0: integer (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: integer (nullable = true)
|-- _c3: integer (nullable = true)

How can I add names to the columns in my dataset?

Placid
  • 1,400
  • 3
  • 22
  • 33
  • Maybe this helps: https://stackoverflow.com/questions/40653813/how-to-specify-schema-for-csv-file-without-using-scala-case-class – mrks Nov 05 '17 at 11:09

2 Answers2

26

You can use toDF to specify column names when reading the CSV file:

val df = spark.read.option("inferSchema","true").csv("../myfile.csv").toDF(
  "ID", "name", "age", "numOfFriends"
)

Or, if you already have the DataFrame created, you can rename its columns as follows:

val newColNames = Seq("ID", "name", "age", "numOfFriends")
val df2 = df.toDF(newColNames: _*)
Leo C
  • 22,006
  • 3
  • 26
  • 39
  • can you provide a similar solution for java. I have a Dataset without headers and I want to select some columns not all from it. – user812142 May 23 '19 at 15:03
  • @user812142, I haven't done much with Java on Spark. Perhaps this [SO answer](https://stackoverflow.com/a/53622909/6316508) might give some hints; if not, I would suggest posting a separate question with specific requirement and sample data. – Leo C May 24 '19 at 01:25
  • Can you please provide a solution for the problem in PySpark? – pnv Jul 17 '19 at 06:54
1
toDf           

method can be used, where you can pass in the column name in spark java.

Example:

Dataset<Row> rowsWithTitle = sparkSession.read().option("header", "true").option("delimiter", "\t").csv("file").toDF("h1", "h2");