Add column names to data read from csv file without column names

Question

I am using Apache Spark with Scala.

I have a csv file that does not have column names in the first row. It's like this:

28,Martok,49,476
29,Nog,48,364
30,Keiko,50,175
31,Miles,39,161

The columns represent ID, name, age, numOfFriends.

In my Scala object, I am creating dataset using SparkSession from csv file as follows:

val spark = SparkSession.builder.master("local[*]").getOrCreate()
val df = spark.read.option("inferSchema","true").csv("../myfile.csv")
df.printSchema()

When I run the program, the result is:

|-- _c0: integer (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: integer (nullable = true)
|-- _c3: integer (nullable = true)

How can I add names to the columns in my dataset?

Maybe this helps: https://stackoverflow.com/questions/40653813/how-to-specify-schema-for-csv-file-without-using-scala-case-class — mrks, Nov 05 '17 at 11:09

Leo C · Accepted Answer · 2017-11-05T12:18:37.720

26

You can use toDF to specify column names when reading the CSV file:

val df = spark.read.option("inferSchema","true").csv("../myfile.csv").toDF(
  "ID", "name", "age", "numOfFriends"
)

Or, if you already have the DataFrame created, you can rename its columns as follows:

val newColNames = Seq("ID", "name", "age", "numOfFriends")
val df2 = df.toDF(newColNames: _*)

edited Nov 05 '17 at 12:18

answered Nov 05 '17 at 11:24

Leo C

22,006
3
26
39

can you provide a similar solution for java. I have a Dataset without headers and I want to select some columns not all from it. – user812142 May 23 '19 at 15:03
@user812142, I haven't done much with Java on Spark. Perhaps this [SO answer](https://stackoverflow.com/a/53622909/6316508) might give some hints; if not, I would suggest posting a separate question with specific requirement and sample data. – Leo C May 24 '19 at 01:25
Can you please provide a solution for the problem in PySpark? – pnv Jul 17 '19 at 06:54

score 1 · Answer 2 · answered Mar 15 '21 at 06:08

1

toDf

method can be used, where you can pass in the column name in spark java.

Example:

Dataset<Row> rowsWithTitle = sparkSession.read().option("header", "true").option("delimiter", "\t").csv("file").toDF("h1", "h2");

answered Mar 15 '21 at 06:08

padmaja ramesh

11
3

Add column names to data read from csv file without column names

2 Answers2