I'm working with a PySpark Pandas DataFrame that looks similar to this:
| col1 | col2 | col3 |
|------|----------------------|------|
| 1 |'C:\windows\a\folder1'| 3 |
| 2 |'C:\windows\a\folder2'| 4 |
| 3 |'C:\windows\b\folder1'| 4 |
| 4 |'C:\unix\b\folder2' | 5 |
| 5 |'C:\unix\a\folder3' | 4 |
The total dataset is quite a bit larger (approx. 55 mill rows), so this example will only be partial.
I'm trying to drop rows from based on two arguments:
- That col2 is a path that is a Windows server and has folder 'a'
- That col3 equals does not equal 3
The end result should be this:
| col1 | col2 | col3 |
|------|----------------------|------|
| 1 |'C:\windows\a\folder1'| 3 |
| 3 |'C:\windows\b\folder1'| 4 |
| 4 |'C:\unix\b\folder2' | 5 |
| 5 |'C:\unix\a\folder3' | 4 |
NB: Note that 5 remains since it is a Unix-server and 1 remains since the col3 value is 3
Since the ps.df.drop()
function cannot drop based on indexes, I've tried to use the ps.df.filter
to filter away the redundant data.
However, when I tested it with just the first argument, it can't find any values:
df.filter(like='\\windows\\a\\', axis=0)
Out[1]:
| col1 | col2 | col3 |
|-------|-------|-------|
| blank | blank | blank |
To check that the DataFrame wasn't the problem I wrote:
df[ df.col2.str.contains('\\windows\\a\\', regex=False) ]
Out[2]:
| col1 | col2 | col3 |
|------|----------------------|------|
| 1 |'C:\windows\a\folder1'| 3 |
I also tried the ps.sql()
function but got the same result as with df.filter()
.
I don't want to use the df.col2.str.contains()
function in my dataset because it isn't efficient with such a large dataset.
Can someone please tell me why df.filter
fails and how I can drop the requested columns in an efficient manor?