2

With a csv file laid out like this

dtime,Ask,Bid,AskVolume,BidVolume
2003-08-04 00:01:06.430000,1.93273,1.93233,2400000,5100000
2003-08-04 00:01:15.419000,1.93256,1.93211,21900000,4000000
2003-08-04 00:01:18.298000,1.93240,1.93220,18700001,7600000
2003-08-04 00:01:24.950000,1.93264,1.93234,800000,600000
2003-08-04 00:01:26.073000,1.93284,1.93244,2800000,800000
2003-08-04 00:01:29.340000,1.93286,1.93246,7100000,2400000
2003-08-04 00:01:50.452000,1.93278,1.93258,4000000,4800000
2003-08-04 00:01:56.979000,1.93294,1.93244,22600000,13500000
2003-08-04 00:02:20.078000,1.93248,1.93238,3200000,5600000

Using the following code:

import sys
import pandas as pd
import numpy as np
import json
import psycopg2 as pg
import pandas.io.sql as psql
import dask
import dask.dataframe as dd
import datetime as dt
from sklearn.cluster import MeanShift, estimate_bandwidth
from sklearn.externals.joblib import parallel

def parse_dates(df):
  return pd.to_datetime(df['dtime'], format = '%Y-%m-%d %H:%M:%S.%f')

def main():
    meta = ('time', pd.Timestamp)
    dask.set_options(get=dask.multiprocessing.get)
    print (dt.datetime.now().strftime("%A, %d. %B %Y %I:%M:%S%p"),"Doing Start of Processing CSV")
    df = dd.read_csv('/zdb1/trading/tick_data/GBPJPY.csv', sep=',', names=['dtime', 'Ask', 'Bid', 'AskVolume', 'BidVolume'],)
    print (dt.datetime.now().strftime("%A, %d. %B %Y %I:%M:%S%p"),"Done...reading CSV and above datetime")
    df.map_partitions(parse_dates, meta=meta).compute()
    print (dt.datetime.now().strftime("%A, %d. %B %Y %I:%M:%S%p"),"Done...finished datetime index above grouped")
    grouped_data = df.dropna()
    ticks_data = grouped_data['Ask'].resample('24H').ohlc()

    print (dt.datetime.now().strftime("%A, %d. %B %Y %I:%M:%S%p"),"Done...grouped_data.resample")
    sell_data = grouped_data.as_matrix(columns=['Ask']).compute()

    print (dt.datetime.now().strftime("%A, %d. %B %Y %I:%M:%S%p"),"Done...grouped_data.as_matrix")
    bandwidth = estimate_bandwidth(sell_data, quantile=0.1, n_samples=100).compute()
    ms = MeanShift(bandwidth=bandwidth, bin_seeding=True, n_jobs=-1)

    print (dt.datetime.now().strftime("%A, %d. %B %Y %I:%M:%S%p"),"Done...MeanShift setup")
    ms.fit(sell_data).compute()
    print (dt.datetime.now().strftime("%A, %d. %B %Y %I:%M:%S%p"),"Done...MeanShift fit")

    ml_results = []
    for k in range(len(np.unique(ms.labels_))):
        my_members = ms.labels_ == k
        values = sell_data[my_members, 0]

        ml_results.append(min(values))
        ml_results.append(max(values))

    print (dt.datetime.now().strftime("%A, %d. %B %Y %I:%M:%S%p"),"Done...MeanShift for k")
    ticks_data.to_json('ticks.json', date_format='iso', orient='index')

    print (dt.datetime.now().strftime("%A, %d. %B %Y %I:%M:%S%p"),"Done...ticks_data.to_json")
    with open('ml_results.json', 'w') as f:
        f.write(json.dumps(ml_results))

    print (dt.datetime.now().strftime("%A, %d. %B %Y %I:%M:%S%p"),"Done...Closing all connections")

if __name__ == "__main__":
    main()

I get date error and I do not understand why. Maybe someone would be kind and point out the error and how to fix it where the code would run! Something about dask I am not understanding here.

# clear ; python3.5 wtf.py
Sunday, 12. February 2017 08:36:51PM Doing Start of Processing CSV
Sunday, 12. February 2017 08:36:53PM Done...reading CSV and above datetime
/usr/local/lib/python3.5/site-packages/dask/async.py:245: DtypeWarning: Columns (1,2,3,4) have mixed types. Specify dtype option on import or set low_memory=False.
  return [_execute_task(a, cache) for a in arg]
Traceback (most recent call last):
  File "wtf.py", line 56, in <module>
    main()
  File "wtf.py", line 22, in main
    df.map_partitions(parse_dates, meta=meta).compute()
  File "/usr/local/lib/python3.5/site-packages/dask/base.py", line 79, in compute
    return compute(self, **kwargs)[0]
  File "/usr/local/lib/python3.5/site-packages/dask/base.py", line 179, in compute
    results = get(dsk, keys, **kwargs)
  File "/usr/local/lib/python3.5/site-packages/dask/multiprocessing.py", line 86, in get
    dumps=dumps, loads=loads, **kwargs)
  File "/usr/local/lib/python3.5/site-packages/dask/async.py", line 493, in get_async
    raise(remote_exception(res, tb))
dask.async.ValueError: time data 'dtime' doesn't match format specified

Traceback
---------
  File "/usr/local/lib/python3.5/site-packages/dask/async.py", line 268, in execute_task
    result = _execute_task(task, data)
  File "/usr/local/lib/python3.5/site-packages/dask/async.py", line 249, in _execute_task
    return func(*args2)
  File "/usr/local/lib/python3.5/site-packages/dask/dataframe/core.py", line 3013, in apply_and_enforce
    df = func(*args, **kwargs)
  File "wtf.py", line 14, in parse_dates
    return pd.to_datetime(df['dtime'], format = '%Y-%m-%d %H:%M:%S.%f')
  File "/usr/local/lib/python3.5/site-packages/pandas/util/decorators.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.5/site-packages/pandas/tseries/tools.py", line 421, in to_datetime
    values = _convert_listlike(arg._values, False, format)
  File "/usr/local/lib/python3.5/site-packages/pandas/tseries/tools.py", line 413, in _convert_listlike
    raise e
  File "/usr/local/lib/python3.5/site-packages/pandas/tseries/tools.py", line 401, in _convert_listlike
    require_iso8601=require_iso8601
  File "pandas/tslib.pyx", line 2374, in pandas.tslib.array_to_datetime (pandas/tslib.c:44175)
  File "pandas/tslib.pyx", line 2503, in pandas.tslib.array_to_datetime (pandas/tslib.c:42192)

Any ideas on what is wrong here? Works fine if pandas but I cannot make it work with dask. I cannot figure out how to set the timestamp in dask for primary index!

Smaller code section of problem:

def main():
dask.set_options(get=dask.multiprocessing.get)
print (dt.datetime.now().strftime("%A, %d. %B %Y %I:%M:%S%p"),"Doing Start of Processing CSV")
df = dd.read_csv('/zdb1/trading/tick_data/GBPJPY.csv', sep=',', names=['dtime', 'Ask', 'Bid', 'AskVolume', 'BidVolume'],)
print (dt.datetime.now().strftime("%A, %d. %B %Y %I:%M:%S%p"),"Done...reading CSV and above resample")
grouped_data = df.dropna()
ticks_data = grouped_data['Ask'].resample('24H').ohlc()

print (dt.datetime.now().strftime("%A, %d. %B %Y %I:%M:%S%p"),"Done...grouped_data.resample")
sell_data = grouped_data.as_matrix(columns=['Ask']).compute()

print (dt.datetime.now().strftime("%A, %d. %B %Y %I:%M:%S%p"),"Done...grouped_data.as_matrix")
if __name__ == "__main__":
main()

With the following error:

Monday, 13. February 2017 10:50:40AM Doing Start of Processing CSV
Monday, 13. February 2017 10:50:41AM Done...reading CSV and above resample
Traceback (most recent call last):
  File "wtfs.py", line 26, in <module>
    main()
  File "wtfs.py", line 19, in main
    ticks_data = grouped_data['Ask'].resample('24H').ohlc()
  File "/usr/local/lib/python3.5/site-packages/dask/dataframe/core.py", line 1415, in resample
    return _resample(self, rule, how=how, closed=closed, label=label)
  File "/usr/local/lib/python3.5/site-packages/dask/dataframe/tseries/resample.py", line 22, in _resample
    resampler = Resampler(obj, rule, **kwargs)
  File "/usr/local/lib/python3.5/site-packages/dask/dataframe/tseries/resample.py", line 86, in __init__
raise ValueError(msg)
ValueError: Can only resample dataframes with known divisions
See dask.pydata.io/en/latest/dataframe-partitions.html for more information.

In pandas, it works fine. When I go down the divisions of trying to split up the csv I run into timestamp issue so I haft to find a way to set the index on timestamp like pandas dose in order to solve it. Following is the pandas' code but what would be the same in dask???:

pandas.read_csv(filename, parse_dates=[0], index_col=0, names='Date_Time', 'Ask', 'Bid'], date_parser=lambda x: pandas.to_datetime(x, format="%Y-%m-%d %H:%M:%S.%f"))

Dask does not support on read of csv to parse timestamp and set the timestamp to index! Where my problem is and I can not figure out how to make dask work!

rpanai
  • 12,515
  • 2
  • 42
  • 64
  • 2
    There is a lot going on in your example. Are you able to reduce the code example to something smaller around just the issue you're facing? See http://stackoverflow.com/help/mcve . Also have you seen the `parse_dates=` keyword to `read_csv` and the `set_index` method? – MRocklin Feb 13 '17 at 13:33
  • yes i can cut out everything and just leave the date part. That is the problem. Pandas no problem with date but dask dose not want to set the date as the index. read_csv dose not support parsing timestamp at time of read so you can set the index to date. I will cut out all the other stuff and repost if I can edit the posting! – user2777145 Feb 13 '17 at 16:41
  • Were you able to fix this @user2777145??? Facing the same issue, works fine with pandas.... – the_ccalderon Oct 08 '18 at 15:30

0 Answers0