1

I am a beginner in coding and would highly appreciate your help. I have a file of 4gb and i am trying to select the the most repeated field in column B (that is not similar to column A) and the corresponding column C

For example,

 Column A       Column B    Column C    id
 Sam             Sam          12        001
 Alex            David        10        001
 David           David        15        002
 Sarah           Alice        23        001 
 Alice           Sam          18        002
 Sam             Alice        20        002
 Anna            Sam          26        003

I would like to exclude if names in column A and column B are same and then find the most repeated names in column B. And also I would like to find the corresponding id of the most repeated fields in column B.

When i tried using the following command, i get memory error.

(df.loc[~(df['Column B'].isin(df['Column A']) & df['Column B'].isin(df['Column C'])), 'Column B'])

1 Answers1

1

You're going to have trouble with a 4Gb dataframe using pandas. I recommend you have a look at dask instead which replicates parts of the pandas api but does out of core computation, i.e. does not load everything into memory at once. It should be a drop in replacement for most operations.

https://dask.pydata.org

If I understand your requirements correctly this should work

import pandas as pd
from dask import dataframe as dd

df = pd.DataFrame([['Sam', 'Alex', 'David', 'Sarah', 'Alice', 'Sam', 'Anna'],
                   ['Sam', 'David', 'David', 'Alice', 'Sam', 'Alice', 'Sam'],
                   [12, 10, 15, 23, 18, 20, 26]],
                  index=['Column A', 'Column B', 'Column C']).T
# have a look at dd.read_csv to avoid creating a pandas dataframe
# first, there's no need for that actually
dask_df = dd.from_pandas(df)

The dask dataframe now looks the same as the pandas df

In [39]: dask_df.head()
Out[39]: 
  Column A Column B Column C
0      Sam      Sam       12
1     Alex    David       10
2    David    David       15
3    Sarah    Alice       23
4    Alice      Sam       18

then compute the frequencies of elements in the second column

freq = (dask_df[dask_df['Column A'] != dask_df['Column B']]
        .groupby('Column B') # group by the values in the first column
        .value_counts() # the length of the group is the number of elements in it, i.e. the frequency
       )

freq.compute() # dask requires to call .compute()
Out[42]: 
Alice    2
David    1
Sam      2
dtype: int64

Now that we know the frequencies you can just pick out the rows you want from the dask dataframe

In [44]: dask_df[dask_df['Column B'] == 'Alice'][['Column B', 'Column C']].compute()
Out[44]: 
  Column B Column C
3    Alice       23
5    Alice       20

If you just want to top k=5 frequencies use freq.nlargest(n=5).compute()

Matti Lyra
  • 12,828
  • 8
  • 49
  • 67
  • Thank you ! but what is 'meta' ? Because i get this warning : UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected. Before: .apply(func) After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result or: .apply(func, meta=('x', 'f8')) for series result warnings.warn(msg) –  Jun 08 '17 at 09:49
  • but it worked still :) Thank you for introducing dask to me !! –  Jun 08 '17 at 09:52
  • Is it possible to get top 5 alone ? –  Jun 08 '17 at 10:07
  • 1
    I get that warning as well, I don't know what `meta` is but the results seem to be correct. There's some info on that for `.apply` (http://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.apply) and more on the page talking about internals (http://dask.pydata.org/en/latest/dataframe-design.html#metadata). Following those instructions throws an `AttributeError` so I don't exactly know what's going on there. I opened a new question on how to define `meta` https://stackoverflow.com/questions/44432868/dask-dataframe-apply-meta – Matti Lyra Jun 08 '17 at 10:13
  • in case, there is another column with 'id' , is it possible to extract the corresponding ids ? –  Jun 08 '17 at 13:31
  • 1
    just add `id` to the last selector `['Column B', 'Column C', 'id']`. I think you can also replace the `.apply` with `.value_counts` - that gets rid of the warning – Matti Lyra Jun 08 '17 at 15:28