28

If we have a folder folder having all .txt files, we can read them all using sc.textFile("folder/*.txt"). But what if I have a folder folder containing even more folders named datewise, like, 03, 04, ..., which further contain some .log files. How do I read these in Spark?

In my case, the structure is even more nested & complex, so a general answer is preferred.

kamalbanga
  • 1,881
  • 5
  • 27
  • 46

4 Answers4

40

If directory structure is regular, lets say something like this:

folder
├── a
│   ├── a
│   │   └── aa.txt
│   └── b
│       └── ab.txt
└── b
    ├── a
    │   └── ba.txt
    └── b
        └── bb.txt

you can use * wildcard for each level of nesting as shown below:

>>> sc.wholeTextFiles("/folder/*/*/*.txt").map(lambda x: x[0]).collect()

[u'file:/folder/a/a/aa.txt',
 u'file:/folder/a/b/ab.txt',
 u'file:/folder/b/a/ba.txt',
 u'file:/folder/b/b/bb.txt']
zero323
  • 322,348
  • 103
  • 959
  • 935
  • 3
    This solved my particular issue. Btw, what if the directory structure is not regular? – kamalbanga Aug 27 '15 at 08:27
  • Then things start getting messy :) Idea is more or less the same but it is unlikely you can prepare patterns that can be easily reused. You can always you normal tools to traverse filesystem and collect paths instead of hardcoding. – zero323 Aug 27 '15 at 12:01
  • Why does this not work with `/folder/**/*.txt`? I have basically the exact same directory structure and I'd like to open all with `sc.wholeTextFiles('data/**/*.json')` but that does not seem to work ..? – Stefan Falk Feb 13 '18 at 11:12
  • @zero323, I couldn't use the whilecard in wholetextfile as getting the error llegalArgumentException: 'java.net.URISyntaxException: Expected scheme-specific part at index. – VSe Dec 07 '19 at 09:15
  • In this case, since all of the files under the "a" directory are .txt, the last /*.txt is unnecessary. – bloodrootfc Oct 11 '21 at 14:56
29

Spark 3.0 provides an option recursiveFileLookup to load files from recursive subfolders.

val df= sparkSession.read
       .option("recursiveFileLookup","true")
      .option("header","true")
      .csv("src/main/resources/nested")

This recursively loads the files from src/main/resources/nested and it's subfolders.

NNK
  • 1,044
  • 9
  • 24
5

if you want use only files which start with name "a" ,you can use

sc.wholeTextFiles("/folder/a*/*/*.txt") or sc.wholeTextFiles("/folder/a*/a*/*.txt")

as well. We can use * as wildcard.

Arun Goudar
  • 361
  • 3
  • 5
2

sc.wholeTextFiles("/directory/201910*/part-*.lzo") get all match files name, not files content.

if you want to load the contents of all matched files in a directory, you should use

sc.textFile("/directory/201910*/part-*.lzo")

and setting reading directory recursive!

sc._jsc.hadoopConfiguration().set("mapreduce.input.fileinputformat.input.dir.recursive", "true")

TIPS: scala differ with python, below set use to scala!

sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive", "true")
Colin Wang
  • 771
  • 8
  • 14