Pyspark 3.3.0 dataframe show data but writing CSV creates empty file

Question

Facing a very unusual issue. Dataframe shows data if ran df.show() however, when trying to write as csv, operation completes without error , but writes 0 byte empty file.

Is this a bug ? Is there something missing?

--pyspark version

      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.3.0
      /_/

Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 1.8.0_352
Branch HEAD
Compiled by user ubuntu on 2022-06-09T19:58:58Z
Revision f74867bddfbcdd4d08076db36851e88b15e66556
Url https://github.com/apache/spark

--Python Version

Python 3.9.13 (main, Aug 25 2022, 23:26:10)
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.

--dataframe shows data if fetched

>>> result.show()
+--------------------+
|     review_keywords|
+--------------------+
|        [love, echo]|
|         [loved, it]|
|[sometimes, playi...|
|[lot, fun, thing,...|
|             [music]|
|[received, echo, ...|
|[without, cellpho...|
|[think, 5th, one,...|
|      [looks, great]|
|[love, it, i’ve, ...|
|[sent, 85, year, ...|
|[love, it, learni...|
|[purchased, mothe...|
|[love, love, love, ]|
|          [expected]|
|[love, it, wife, ...|
|[really, happy, p...|
|[using, alexa, co...|
|[love, size, 2nd,...|
|[liked, original,...|
+--------------------+
only showing top 20 rows

--however write operation creates 0 byte empty file

>>> result.withColumn('review_keywords', col('review_keywords').cast('string')).write.option("header", "true").mode('overwrite').csv("hdfs:///tmp/some_dir/some_other_dir/word_tokens.txt")

--hdfs file gets created but 0 byte

$ hadoop fs -ls hdfs:///tmp/some_dir/some_other_dir/
Found 2 items
drwxr-xr-x   - xyz supergroup          0 2023-03-30 09:18 hdfs:///tmp/some_dir/some_other_dir/word_tokens.txt

score 0 · Accepted Answer · answered Mar 30 '23 at 17:10

0

What Spark really does when doing df.write.csv is writing away a directory, not a file. As is discussed in this SO question, hadoop fs -ls displays directory disk usage as 0.

If you're interested in the size of the file you've just written, try using hadoop fs -dus hdfs:///tmp/some_dir/some_other_dir/word_tokens.txt. More info on that here.

answered Mar 30 '23 at 17:10

Koedlt

4,286
8
15
33

I feel really stupid now!! Classic example why not paying attention to details in documentation can do! Thank you for answering with patience Sir! XD @Koedlt P.S : However, it seems doc also mentions `path` can be single csv. I guess it was misleading https://spark.apache.org/docs/latest/sql-data-sources-csv.html – StrangerThinks Mar 30 '23 at 17:20
Haha no worries, I've been very guilty of not reading documentation well in the past :) Good luck for the rest! – Koedlt Mar 30 '23 at 17:40
2

By the way those docs talk about *reading* in a file: that can be a single CSV indeed. Writing a CSV will always write a directory :) – Koedlt Mar 30 '23 at 18:34
But you can use `repartition(0)` to force it to a single file within a directory – OneCricketeer Mar 30 '23 at 22:30

Pyspark 3.3.0 dataframe show data but writing CSV creates empty file

1 Answers1