Cannot stream files in subfolders with wildcards in pySpark streaming

Question

This code works only if I make directory="s3://bucket/folder/2022/10/18/4/*"

from pyspark.sql.functions import from_json
from pyspark.streaming import StreamingContext

ssc = StreamingContext(sc, 30)

directory = "s3://bucket/folder/*/*/*/*/*"
stream_data = ssc.textFileStream(directory)

def readMyStream(rdd):
  if not rdd.isEmpty():
    df = spark.read.option("multiline","true").json(rdd)
    print('Started the Process')
    print('Selection of Columns')
    df = df.select("c1","c2","c3","c4","c5")
    df.show()
    
stream_data.foreachRDD(lambda rdd: readMyStream(rdd)) 

ssc.start()
ssc.awaitTermination()

In the docs it says it supports POSIX glob pattern. Any help is appreciated. Thank you

Have you tried directory = "s3://bucket/folder/" , If yes then what issue are you getting — Anjaneya Tripathi, Jul 20 '22 at 04:00

score 0 · Answer 1 · answered Jul 20 '22 at 14:23

The issue is the final * should not be there. In the docs it says "it is a pattern of directories, not of files in directories". I didnt understand it the first time I read it.

A POSIX glob pattern can be supplied, such as "hdfs://namenode:8040/logs/2017/*". Here, the DStream will consist of all files in the directories matching the pattern. That is: it is a pattern of directories, not of files in directories.

directory = "s3://bucket/folder/*/*/*/*/*" should be directory = "s3://bucket/folder/*/*/*/*/"

Cannot stream files in subfolders with wildcards in pySpark streaming

1 Answers1