2

Our setting is PySpark. Suppose I create a dataframe df using the spark.read.csv function, i.e.

df = spark.read.csv("directory/name_file.csv")

Now I need a way to extract "name_file" without of course copying and pasting by hand. In other words, I want a spark list or dataframe that only contains the string "name_file".

Please, provide only a solution that involves PySpark SQL or Python code compatible with PySpark.

The problem seems straightforward, but I spent a lot of time looking for a solution without getting anything.

jarlh
  • 42,561
  • 8
  • 45
  • 63
WorkBench
  • 83
  • 2
  • 12
  • Hello, what have you tried ? You can get the `name_file.csv` part by using [basename](https://docs.python.org/3/library/os.path.html#os.path.basename) and then you can remove the extension using [splitext](https://docs.python.org/3/library/os.path.html#os.path.splitext) –  Jun 27 '19 at 11:31
  • Maybe this post will be helpfull: https://stackoverflow.com/questions/39868263/spark-load-data-and-add-filename-as-dataframe-column – Tomasz Jun 27 '19 at 11:32
  • why don't create variable `filename` with your `name_file` and then use it in `"directory/{}.csv".format(filename)` and in any other places. – furas Jun 27 '19 at 11:37
  • Thanks Tomasz and Reportgunner, using the linked answer I was able to extract all the directory. The problem now is that I would keep only the filename but PySpark does not allow me to combine the basename function with that code. Do you have any ideas about that? – WorkBench Jun 27 '19 at 11:58

2 Answers2

5

there is a function for that : input_file_name Then, you split.

from pyspark.sql import functions as F

df = df.withColumn("path", F.input_file_name())
df = df.withColumn("path_splitted", F.split("path","/"))
df = df.withColumn("name", F.col("path_splitted").getItem(F.size("path_splitted")-1))

df.show()
+---+--------------+--------+----------------+
| id|          path|    name|   path_splitted|
+---+--------------+--------+----------------+
|  1|/foo/bar.csv  |bar.csv |[, foo, bar.csv]|
+---+--------------+--------+----------------+



EDIT : with spark 2.4, you can use reverse to get the last element easily

F.reverse("path_splitted").getItem(0)

Steven
  • 14,048
  • 6
  • 38
  • 73
  • This is awsome. I had already understood the need of the split function but I wasn't aware of how to use getItem, size and reverse to close the problem. Thank you very much. – WorkBench Jun 27 '19 at 14:27
1

If you don't want to create extra column that need to be dropped afterward, you can chain the pyspark.sql.functions. We can also taking advantage of pyspark.sql.functions.element_at (Spark 2.4+), that save us one operation (F.size)

df = df.withColumn("filename", F.element_at(F.split(F.input_file_name(), "/"),-1))

or if you interested in parent dirname

df = df.withColumn("dirname", F.element_at(F.split(F.input_file_name(), "/"),-2))
asiera
  • 492
  • 5
  • 12