Dropping rows in PySpark based on indexes

Question

I'm working with a PySpark Pandas DataFrame that looks similar to this:

| col1 |          col2        | col3 |
|------|----------------------|------|
| 1    |'C:\windows\a\folder1'|  3   |
| 2    |'C:\windows\a\folder2'|  4   |
| 3    |'C:\windows\b\folder1'|  4   |
| 4    |'C:\unix\b\folder2'   |  5   |
| 5    |'C:\unix\a\folder3'   |  4   |

The total dataset is quite a bit larger (approx. 55 mill rows), so this example will only be partial.

I'm trying to drop rows from based on two arguments:

That col2 is a path that is a Windows server and has folder 'a'
That col3 equals does not equal 3

The end result should be this:

| col1 |          col2        | col3 |
|------|----------------------|------|
| 1    |'C:\windows\a\folder1'|  3   |
| 3    |'C:\windows\b\folder1'|  4   |
| 4    |'C:\unix\b\folder2'   |  5   |
| 5    |'C:\unix\a\folder3'   |  4   |

NB: Note that 5 remains since it is a Unix-server and 1 remains since the col3 value is 3

Since the ps.df.drop() function cannot drop based on indexes, I've tried to use the ps.df.filter to filter away the redundant data.

However, when I tested it with just the first argument, it can't find any values:

df.filter(like='\\windows\\a\\', axis=0)

Out[1]:
    | col1  | col2  | col3  |
    |-------|-------|-------|
    | blank | blank | blank |

To check that the DataFrame wasn't the problem I wrote:

df[ df.col2.str.contains('\\windows\\a\\', regex=False) ]

Out[2]:
    | col1 |          col2        | col3 |
    |------|----------------------|------|
    | 1    |'C:\windows\a\folder1'|  3   |

I also tried the ps.sql() function but got the same result as with df.filter().
I don't want to use the df.col2.str.contains() function in my dataset because it isn't efficient with such a large dataset.

Can someone please tell me why df.filter fails and how I can drop the requested columns in an efficient manor?

score 0 · Answer 1 · answered Dec 20 '22 at 05:55

The value for like argument in pandas dataframe filter is for the index but not on the actual column values. When I use filter on pandas dataframe using the following code, it gives result as:

enter image description here

Since you are using pandas dataframe on spark API, you can first convert this dataframe to a spark dataframe using to_spark().
I have taken the sample data that you have given and created a spark dataframe directly for demonstration.
To get the desired result, I have used filter along with like to filter out based on the given conditions.

df.filter(~((df.c2.like("%windows%a%")) & (df.c3 != 3))).show()

enter image description here

Dropping rows in PySpark based on indexes

1 Answers1