I have parquet dirs named like so:
parquetNames = [NAME1,NAME1_MS,NAME2, NAME2_MQ]
I want to load only the parquets in NAME1
and NAME2
, but I'm having trouble with the negative lookahead and alternation. If I do:
s3BaseDir+'NAME*'
then as expected all parquet dirs are loaded. From here and here I could do a negative lookahead with alternation like so to avoid either full substrings "_MS"
or "_MQ"
:
s3BaseDir+'NAME*(?!{_MS,_MQ})'
But I'm getting
AnalysisException: 'Path does not exist'
.
It seems its taking the more complex regex literally.
Are negative lookaheads doable in pyspark spark.read.parquet
? Is it possible to combine it with alternation too? How?