How to extract csv name from a spark dataframe

Question

Our setting is PySpark. Suppose I create a dataframe df using the spark.read.csv function, i.e.

df = spark.read.csv("directory/name_file.csv")

Now I need a way to extract "name_file" without of course copying and pasting by hand. In other words, I want a spark list or dataframe that only contains the string "name_file".

Please, provide only a solution that involves PySpark SQL or Python code compatible with PySpark.

The problem seems straightforward, but I spent a lot of time looking for a solution without getting anything.

Hello, what have you tried ? You can get the `name_file.csv` part by using [basename](https://docs.python.org/3/library/os.path.html#os.path.basename) and then you can remove the extension using [splitext](https://docs.python.org/3/library/os.path.html#os.path.splitext) — , Jun 27 '19 at 11:31
Maybe this post will be helpfull: https://stackoverflow.com/questions/39868263/spark-load-data-and-add-filename-as-dataframe-column — Tomasz, Jun 27 '19 at 11:32
why don't create variable `filename` with your `name_file` and then use it in `"directory/{}.csv".format(filename)` and in any other places. — furas, Jun 27 '19 at 11:37
Thanks Tomasz and Reportgunner, using the linked answer I was able to extract all the directory. The problem now is that I would keep only the filename but PySpark does not allow me to combine the basename function with that code. Do you have any ideas about that? — WorkBench, Jun 27 '19 at 11:58

Steven · Accepted Answer · 2019-06-27T13:28:15.793

there is a function for that : input_file_name Then, you split.

from pyspark.sql import functions as F

df = df.withColumn("path", F.input_file_name())
df = df.withColumn("path_splitted", F.split("path","/"))
df = df.withColumn("name", F.col("path_splitted").getItem(F.size("path_splitted")-1))

df.show()
+---+--------------+--------+----------------+
| id|          path|    name|   path_splitted|
+---+--------------+--------+----------------+
|  1|/foo/bar.csv  |bar.csv |[, foo, bar.csv]|
+---+--------------+--------+----------------+

EDIT : with spark 2.4, you can use reverse to get the last element easily

F.reverse("path_splitted").getItem(0)

This is awsome. I had already understood the need of the split function but I wasn't aware of how to use getItem, size and reverse to close the problem. Thank you very much. — WorkBench, Jun 27 '19 at 14:27

score 1 · Answer 2 · answered Apr 07 '21 at 04:44

If you don't want to create extra column that need to be dropped afterward, you can chain the pyspark.sql.functions. We can also taking advantage of pyspark.sql.functions.element_at (Spark 2.4+), that save us one operation (F.size)

df = df.withColumn("filename", F.element_at(F.split(F.input_file_name(), "/"),-1))

or if you interested in parent dirname

df = df.withColumn("dirname", F.element_at(F.split(F.input_file_name(), "/"),-2))

How to extract csv name from a spark dataframe

2 Answers2