Pyspark path regex negative lookahead

Asked Nov 29 '18 at 21:39

Active Nov 30 '18 at 11:02

Viewed 351 times

I have parquet dirs named like so:

parquetNames = [NAME1,NAME1_MS,NAME2, NAME2_MQ]

I want to load only the parquets in NAME1 and NAME2, but I'm having trouble with the negative lookahead and alternation. If I do:

s3BaseDir+'NAME*'

then as expected all parquet dirs are loaded. From here and here I could do a negative lookahead with alternation like so to avoid either full substrings "_MS" or "_MQ":

s3BaseDir+'NAME*(?!{_MS,_MQ})'

But I'm getting

AnalysisException: 'Path does not exist'.

It seems its taking the more complex regex literally.

Are negative lookaheads doable in pyspark spark.read.parquet? Is it possible to combine it with alternation too? How?

edited Nov 30 '18 at 11:02

mayank agrawal

2,495
2
13
32

asked Nov 29 '18 at 21:39

xv70

1

Could you not just use `re` or do you have to use `spark.read.parquet`? – miike3459 Nov 29 '18 at 21:46
I thought about retrieving all the parquet names and filtering out using standard `re` but that involves I think third libraries that I don't want/cannot use. – xv70 Nov 29 '18 at 21:48
2

`re` is not third-party and usually does not involve third party libraries, if you use it correctly. – miike3459 Nov 29 '18 at 21:50
Fair enough. I meant to say that I'd like to avoid fetching all the parquet file names using for example boto3, regexing, and then reading the desired parquets only. I'd like to use spark directly to filter out at load time but it seems that regex has a very limited functionality at the moment in `spark.parquet.read`. – xv70 Dec 01 '18 at 18:16

Pyspark path regex negative lookahead

0 Answers0