Read all files in a nested folder in Spark

Question

If we have a folder folder having all .txt files, we can read them all using sc.textFile("folder/*.txt"). But what if I have a folder folder containing even more folders named datewise, like, 03, 04, ..., which further contain some .log files. How do I read these in Spark?

In my case, the structure is even more nested & complex, so a general answer is preferred.

zero323 · Accepted Answer · 2015-08-26T20:22:01.053

40

If directory structure is regular, lets say something like this:

folder
├── a
│   ├── a
│   │   └── aa.txt
│   └── b
│       └── ab.txt
└── b
    ├── a
    │   └── ba.txt
    └── b
        └── bb.txt

you can use * wildcard for each level of nesting as shown below:

>>> sc.wholeTextFiles("/folder/*/*/*.txt").map(lambda x: x[0]).collect()

[u'file:/folder/a/a/aa.txt',
 u'file:/folder/a/b/ab.txt',
 u'file:/folder/b/a/ba.txt',
 u'file:/folder/b/b/bb.txt']

edited Aug 26 '15 at 20:22

answered Aug 26 '15 at 18:16

zero323

322,348
103
959
935

3

This solved my particular issue. Btw, what if the directory structure is not regular? – kamalbanga Aug 27 '15 at 08:27
Then things start getting messy :) Idea is more or less the same but it is unlikely you can prepare patterns that can be easily reused. You can always you normal tools to traverse filesystem and collect paths instead of hardcoding. – zero323 Aug 27 '15 at 12:01
Why does this not work with `/folder/**/*.txt`? I have basically the exact same directory structure and I'd like to open all with `sc.wholeTextFiles('data/**/*.json')` but that does not seem to work ..? – Stefan Falk Feb 13 '18 at 11:12
@zero323, I couldn't use the whilecard in wholetextfile as getting the error llegalArgumentException: 'java.net.URISyntaxException: Expected scheme-specific part at index. – VSe Dec 07 '19 at 09:15
In this case, since all of the files under the "a" directory are .txt, the last /*.txt is unnecessary. – bloodrootfc Oct 11 '21 at 14:56

score 29 · Answer 2 · answered Sep 06 '20 at 04:17

Spark 3.0 provides an option recursiveFileLookup to load files from recursive subfolders.

val df= sparkSession.read
       .option("recursiveFileLookup","true")
      .option("header","true")
      .csv("src/main/resources/nested")

This recursively loads the files from src/main/resources/nested and it's subfolders.

score 5 · Answer 3 · answered Jul 09 '18 at 13:40

5

if you want use only files which start with name "a" ,you can use

sc.wholeTextFiles("/folder/a*/*/*.txt") or sc.wholeTextFiles("/folder/a*/a*/*.txt")

as well. We can use * as wildcard.

answered Jul 09 '18 at 13:40

Arun Goudar

361
3
5

score 2 · Answer 4 · answered Oct 22 '19 at 14:00

sc.wholeTextFiles("/directory/201910*/part-*.lzo") get all match files name, not files content.

if you want to load the contents of all matched files in a directory, you should use

sc.textFile("/directory/201910*/part-*.lzo")

and setting reading directory recursive!

sc._jsc.hadoopConfiguration().set("mapreduce.input.fileinputformat.input.dir.recursive", "true")

TIPS: scala differ with python, below set use to scala!

sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive", "true")

Read all files in a nested folder in Spark

4 Answers4

Linked

Related