1

As the title suggests, I have a dataframe containing two columns (both columns names are 0, 1), for example this is the dataframe content

A  8
B  6
C  9

Now I have a dictionary that includes
aliases = {'A': 'P', 'B': 'E', 'C': 'Q'}
and I want to apply this dictionary on the first column so the expected output would be

P  8
E  6
Q  9

In pandas I used to do it with df = df.replace({0: aliases}) But it wont work with dask.

I also came across this SO question and tried to use mask in the following manner
df = df.mask(df[0], aliases)
but I got a TypeError("bad operand type for unary ~: 'str'")

EDIT:

I have tried to implement it as suggested in the post which is linked and I ran into an error with the metadata.
The code right now is :

new_columns = ['identifier', 'position', 'a', 'b', 'c', 'd']
pileup_df = pileup_df.rename(columns=dict(zip(pileup_df.columns, new_columns)))

pileup_df['identifier'] = pileup_df['identifier'].map(lambda x: alias_dict[x], meta=('identifier', pd.Series))
pileup_df.compute()

and I get the following traceback:

File "filter_pileup_from_lists_with_coordinate_name_conversion.py", line 72, in apply_conversion
    pileup_df['identifier'] = pileup_df['identifier'].map(lambda x: alias_dict[x], meta=('identifier', pd.Series))
  File "/home/eliran/miniconda/envs/newenv/lib/python3.7/site-packages/dask/dataframe/core.py", line 3055, in map
    meta = make_meta(meta, index=getattr(make_meta(self), "index", None))
  File "/home/eliran/miniconda/envs/newenv/lib/python3.7/site-packages/dask/utils.py", line 505, in __call__
    return meth(arg, *args, **kwargs)
  File "/home/eliran/miniconda/envs/newenv/lib/python3.7/site-packages/dask/dataframe/utils.py", line 339, in make_meta_object
    return _empty_series(x[0], x[1], index=index)
  File "/home/eliran/miniconda/envs/newenv/lib/python3.7/site-packages/dask/dataframe/utils.py", line 283, in _empty_series
    return pd.Series([], dtype=dtype, name=name, index=index)
  File "/home/eliran/miniconda/envs/newenv/lib/python3.7/site-packages/pandas/core/series.py", line 249, in __init__
    dtype = self._validate_dtype(dtype)
  File "/home/eliran/miniconda/envs/newenv/lib/python3.7/site-packages/pandas/core/generic.py", line 253, in _validate_dtype
    dtype = pandas_dtype(dtype)
  File "/home/eliran/miniconda/envs/newenv/lib/python3.7/site-packages/pandas/core/dtypes/common.py", line 1778, in pandas_dtype
    raise TypeError(f"dtype '{dtype}' not understood")
TypeError: dtype '<class 'pandas.core.series.Series'>' not understood

I have tried to change pd.Series to 'pd.DataFrameanddict` and both result in a similar traceback

Eliran Turgeman
  • 1,526
  • 2
  • 16
  • 34

1 Answers1

4

pandas

You can use the map function as follows:

import pandas as pd

aliases = {'A': 'P', 'B': 'E', 'C': 'Q'}

df = pd.DataFrame({'col1': ['A', 'B', 'C'],'col2': [8,6,9]})
df['col1']= df['col1'].map(aliases)

which output the following DataFrame:

  col1  col2
0    P     8
1    E     6
2    Q     9

dask

For dask you can use the following function:

import dask.dataframe as dd
data_dask = dd.from_pandas(df, npartitions=1)
data_dask['col1'] = data_dask.col1.map(aliases).compute()
Antoine Dubuis
  • 4,974
  • 1
  • 15
  • 29