2

I have one dataframe(df1) which is my raw data from which i want to filter or extract a part of the data. I have another dataframe(df2) which have my filter conditions. The catch here is my filter condition column if blank should skip tht column condition and move to the other column conditions

Example below:

DF1:

City District Town Country Continent
NY WASHIN DC US America
CZCH SEATLLE DC CZCH Europe
NY MARYLAND DC US S America
NY WASHIN NY US America
NY SEAGA NJ UK Europe

DF2:(sample filter condition table - this table can have multiple conditions)

City District Town Country Continent
NY DC
NJ

Notice that i have left the district, country and continent column blank. As I may or may not use it later. I cannot delete these columns.

OUTPUT DF: should look like this

City District Town Country Continent
NY WASHIN DC US America
NY MARYLAND DC US S America
NY SEAGA NJ UK Europe

So basically i need a filter condition table which will extract information from the raw data for fields i input in the filter tables. I cannot change/delete columns in DF2. I can only leave the column blank if i dont require the filter condition.

Thanks in advance, Nitz

Nithish
  • 33
  • 5

2 Answers2

0

If DF2 has always one row:

df = df1.merge(df2.dropna(axis=1))
print (df)
  City District Town Country  Continent
0   NY   WASHIN   DC      US    America
1   NY       NJ   DC      US  S America

If multiple rows with missing values:

Sample data:

nan = np.nan
df1 = pd.DataFrame({'City': ['NY', 'CZCH', 'NY', 'NY', 'NY'], 'District': ['WASHIN', 'SEATLLE', 'MARYLAND', 'WASHIN', 'SEAGA'], 'Town': ['DC', 'DC', 'DC', 'NY', 'NJ'], 'Country': ['US', 'CZCH', 'US', 'US', 'UK'], 'Continent': ['America', 'Europe', 'S America', 'America', 'Europe']})
df2 = pd.DataFrame({'City': ['NY', nan], 'District': [nan, nan], 'Town': ['DC', 'NJ'], 'Country': [nan, nan], 'Continent': [nan, nan]})

First remove missing values with reshape by DataFrame.stack:

print (df2.stack())
0  City    NY
   Town    DC
1  Town    NJ
dtype: object

Then for each group compare df1 columns if exist in columns names and value from df2:

m = [df1[list(v.droplevel(0).index)].eq(v.droplevel(0)).all(axis=1)
      for k, v in df2.stack().groupby(level=0)]


print (m)
[0     True
1    False
2     True
3    False
4    False
dtype: bool, 0    False
1    False
2    False
3    False
4     True
dtype: bool]

Use logical_or.reduce and filter in boolean indexing:

print (np.logical_or.reduce(m))
[ True False  True False  True]

df = df1[np.logical_or.reduce(m)]
print (df)
  City  District Town Country  Continent
0   NY    WASHIN   DC      US    America
2   NY  MARYLAND   DC      US  S America
4   NY     SEAGA   NJ      UK     Europe
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • No.. it will have multiple rows with various conditions – Nithish Jan 26 '23 at 09:09
  • @Ningkontan - Can you test now? empty valeus are missing `NaN`s ? – jezrael Jan 26 '23 at 09:27
  • Are 'k' and 'v' dummy variables..is it possible if you can explain how it works? – Nithish Jan 26 '23 at 09:42
  • not missing NaNs. columns deliberately left empty...will try the solution suggested and get back – Nithish Jan 26 '23 at 09:44
  • @jezrael...Got this error.. `chunks = self._reader.read_low_memory(nrows) File "pandas\_libs\parsers.pyx", line 805, in pandas._libs.parsers.TextReader.read_low_memory File "pandas\_libs\parsers.pyx", line 861, in pandas._libs.parsers.TextReader._read_rows File "pandas\_libs\parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows File "pandas\_libs\parsers.pyx", line 1952, in pandas._libs.parsers.raise_parser_error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfe in position 258104: invalid start byte` – Nithish Jan 26 '23 at 09:51
  • @Ningkontan - Check [this](https://stackoverflow.com/a/18172249/2901002) – jezrael Jan 26 '23 at 09:55
  • Just tried it..only one filter worked the rest did not. so the logic should be filter first for all data with 'NY',x,y,z as city then from this subset, filter with DC,a,b,c as town.. and so on....Is it possible to do a column by column sequential filtering...... and if a column does not have any value just ignore them.. – Nithish Jan 26 '23 at 11:11
  • @Ningkontan - Can you change data sample? Because not sure if understand what need. – jezrael Jan 26 '23 at 11:13
  • have increased the size of sample data for more clarity – Nithish Jan 27 '23 at 05:53
  • @Ningkontan - Do you try example data with my solution? for me working perfectly. – jezrael Jan 27 '23 at 06:19
  • @Ningkontan - added your data to answer - all working perfectly. – jezrael Jan 27 '23 at 06:24
0

Another possible solution, using numpy broadcasting (it works even when df2 has more than one row):

df1.loc[np.sum(np.sum(
        df1.values == df2.values[:, None], axis=2) == 
    np.sum(df2.notna().values, axis=1)[:,None], axis=0) == 1]

Output:

  City  District Town Country  Continent
0   NY    WASHIN   DC      US    America
2   NY  MARYLAND   DC      US  S America
4   NY     SEAGA   NJ      UK     Europe
PaulS
  • 21,159
  • 2
  • 9
  • 26
  • Got this error ` File "<__array_function__ internals>", line 180, in sum File "C:\python\pyver\lib\site-packages\numpy\core\fromnumeric.py", line 2298, in sum return _wrapreduction(a, np.add, 'sum', axis, dtype, out, keepdims=keepdims, File "C:\python\pyver\lib\site-packages\numpy\core\fromnumeric.py", line 86, in _wrapreduction return ufunc.reduce(obj, axis, dtype, out, **passkwargs) numpy.AxisError: axis 2 is out of bounds for array of dimension 0` – Nithish Jan 27 '23 at 05:32
  • I cannot find a good explanation for your error: Are you running the latest `numpy` version? I have just tried with your new data, and it works perfectly on my computer. – PaulS Jan 27 '23 at 10:52