0

I'm doing some parallel computations, evaluating the goodness of a fit across many regressions. In doing so (running ~60K computations), I've somehow managed to get iPython into a strange state.

Pushing objects out to all of the nodes

%%px
model_analytics = ResultsAnalytics(rows,  store['data_model'])

And dispatching the work:

%%time 
ar = lview.map(lambda x: model_analytics.generate_prediction_heuristic(x), rows.index)

Works fine. In fact, most of the work gets completed:

%%time 
completed = ar.progress
print completed
print "Remaining {0} min".format((ar.elapsed/completed) * (len(rows) - completed)/60)

66229

Remaining 0.0205939930854 min

CPU times: user 211 ms, sys: 163 ms, total: 374 ms

Wall time: 364 ms

But there's one job that doesn't complete!

for i, status in enumerate(ar.status): 
    if status != 'ok': print i, status 

35230 None

msg = ar.msg_ids[35230]
lview.abort(msg)
print lview.get_result(msg)
print lview.wait(jobs=msg, timeout=5)

<AsyncResult: unknown>

False

Edit: I was hoping that I'd be able to get all of the results but the defunct one, but no joy.

msgs = ar.msg_ids[0:35230]
res1 = [lview.get_result(msg) for msg in msgs]
print res1[0:10]

[<AsyncResult: unknown>, <AsyncResult: unknown>, <AsyncResult: unknown>, <AsyncResult: unknown>, <AsyncResult: unknown>, <AsyncResult: unknown>, <AsyncResult: unknown>, <AsyncResult: unknown>, <AsyncResult: unknown>, <AsyncResult: unknown>]

I haven't yet tried to reproduce this. What could cause this error? Did something wrong? Is there a more graceful way of recovering from this going forward?

Versions:

  • IPython: 3.2.1
  • pyzmq: 14.7.0
  • zeromq: dpkg -l | grep libzmq yields:

    ii libzmq-dev:amd64 2.2.0+dfsg-5 amd64 lightweight messaging kernel (development files) ii libzmq1:amd64 2.2.0+dfsg-5 amd64 lightweight messaging kernel (shared library)

clearf
  • 586
  • 4
  • 11

0 Answers0