2

I'm trying to get the n longest entries of a dask DataFrame. I tried calling nlargest on a dask DataFrame with two columns like this:

import dask.dataframe as dd

df = dd.read_csv("opendns-random-domains.txt", header=None, names=['domain_name'])
df['domain_length'] = df.domain_name.map(len)
print(df.head())
print(df.dtypes)
top_3 = df.nlargest(3, 'domain_length')
print(top_3.head())

The file opendns-random-domains.txt contains just a long list of domain names. This is what the output of the above code looks like:

                  domain_name  domain_length
0                webmagnat.ro             12
1     nickelfreesolutions.com             23
2  scheepvaarttelefoongids.nl             26
3                  tursan.net             10
4       plannersanonymous.com             21

domain_name       object
domain_length    float64
dtype: object

Traceback (most recent call last):
  File "nlargest_test.py", line 9, in <module>
    print(top_3.head())
  File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/dataframe/core.py", line 382, in head
    result = result.compute()
  File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/base.py", line 86, in compute
    return compute(self, **kwargs)[0]
  File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/base.py", line 179, in compute
    results = get(dsk, keys, **kwargs)
  File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/threaded.py", line 57, in get
    **kwargs)
  File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/async.py", line 484, in get_async
    raise(remote_exception(res, tb))
dask.async.TypeError: Cannot use method 'nlargest' with dtype object

Traceback
---------
  File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/async.py", line 267, in execute_task
    result = _execute_task(task, data)
  File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/async.py", line 249, in _execute_task
    return func(*args2)
  File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/dataframe/core.py", line 2040, in <lambda>
    f = lambda df: df.nlargest(n, columns)
  File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/core/frame.py", line 3355, in nlargest
    return self._nsorted(columns, n, 'nlargest', keep)
  File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/core/frame.py", line 3318, in _nsorted
    ser = getattr(self[columns[0]], method)(n, keep=keep)
  File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/util/decorators.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/core/series.py", line 1898, in nlargest
    return algos.select_n(self, n=n, keep=keep, method='nlargest')
  File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/core/algorithms.py", line 559, in select_n
    raise TypeError("Cannot use method %r with dtype %s" % (method, dtype))

I'm confused, because I'm calling nlargest on the column which is of type float64 but still get this error saying it cannot be called on dtype object. Also this works fine in pandas. How can I get the n longest entries from a DataFrame?

Glorfindel
  • 21,988
  • 13
  • 81
  • 109
vollkorn
  • 85
  • 2
  • 9
  • Just noticed that with pandas the type of the domain_length column is int64 instead of float64. But I'd expect nlargest() to work with dask and float64 just as well. – vollkorn Aug 16 '16 at 15:20

5 Answers5

3

I was helped by explicit type conversion:

df['column'].astype(str).astype(float).nlargest(5)
nnaqa
  • 259
  • 2
  • 4
1

This is how my first data frame look.

This is how my new data frame looks after getting top 5.

'''

station_count.nlargest(5,'count')

'''

You have to give (nlargest) command to a column who have int data type and not in string so it can calculate the count. Always top n number followed by its corresponding column that is int type.

guzel6031
  • 11
  • 4
0

I tried to reproduce your problem but things worked fine. Can I recommend that you produce a Minimal Complete Verifiable Example?

Pandas example

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'x': ['a', 'bb', 'ccc', 'dddd']})

In [3]: df['y'] = df.x.map(len)

In [4]: df
Out[4]: 
      x  y
0     a  1
1    bb  2
2   ccc  3
3  dddd  4

In [5]: df.nlargest(3, 'y')
Out[5]: 
      x  y
3  dddd  4
2   ccc  3
1    bb  2

Dask dataframe example

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'x': ['a', 'bb', 'ccc', 'dddd']})

In [3]: import dask.dataframe as dd

In [4]: ddf = dd.from_pandas(df, npartitions=2)

In [5]: ddf['y'] = ddf.x.map(len)

In [6]: ddf.nlargest(3, 'y').compute()
Out[6]: 
      x  y
3  dddd  4
2   ccc  3
1    bb  2

Alternatively, perhaps this is just working now on the git master version?

Community
  • 1
  • 1
MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • Your dask dataframe example results on my machine in exactly the same error as my example above. I'm using the current dask version in pip. I'll try to use the current git version. – vollkorn Aug 16 '16 at 16:25
  • Indeed, using the current git master version does not show this error and works like expected. I guess it's some bug which was fixed in the past 21 days. Thanks for helping me figuring this out. – vollkorn Aug 16 '16 at 16:32
  • [I have a similar case](https://stackoverflow.com/questions/48036296/reading-a-numeric-column-from-excel-file-into-pandas-dataframe-results-in-object) – Krzysztof Słowiński Dec 30 '17 at 17:35
0

You only need to change the type of respective column to int or float using .astype().

For example, in your case:

top_3 = df['domain_length'].astype(float).nlargest(3)
Jeremy Caney
  • 7,102
  • 69
  • 48
  • 77
0

If you want to get the values with the most occurrences from a String type column you may use value_counts() with nlargest(n), where n is the number of elements you want to bring.

df['your_column'].value_counts().nlargest(3)

It will bring the top 3 occurrences from that column.