3

I use the datalore kernel in datalore.jetbrains.com. In my notebook there are 3 following cells (this is a minimal working example on which I was able to reproduce this error):

#%%
class MyClass:
    def __getattribute__(self, name):
        return 123
#%%
aaa = MyClass()
#%%
aaa

When I try to execute the third cell I get an error, Can't use object "aaa" outside of the cell where it's defined. The message clearly implies that variable aaa can only be used inside the second cell. But why does the datalore kernel have such a limitation?

hsestupin
  • 1,115
  • 10
  • 19

1 Answers1

4

The short answer is, the Datalore kernel saves on disk runtime environment after executing a cell.

Why does the datalore kernel need to do it? Here comes the long answer. In order to understand the root cause of the issue we need to know how the datalore kernel executes cells.

It would be easier to grasp it if we forgot everything we know about the Jupyter kernel. The Datalore kernel differs drastically from the Jupyter kernel because it's reproducible and incremental.

Reproducibility

Have you ever been in a situation when you needed to re-run all the cells in a notebook from the very beginning because you lost track of the order in which cells were executed? Have you ever shared a notebook with somebody together with the notes which describe the cell execution order? With the datalore kernel you wouldn't need to do anything like that. It ensures that cells are always evaluated in exactly the same order, i.e. in the order in which they are defined in the notebook. Whenever you execute the N-th cell, all the previous cells are automatically calculated by the datalore kernel. You might think it must be extremely slow, but it's not. This brings us to the second key property of the kernel.

Incrementality

The Datalore kernel saves the result of every cell execution on disk. The result is simply a runtime environment. It's in fact just a dictionary of objects and their names. That's why the datalore kernel doesn't need to recalculate unchanged cells because the result is already known - it's persisted on disk. So in the typical real-world situation when you work with one cell and run this cell from time to time - previous cells are not recalculated (only the first time). This property naturally imposes the following restriction: if you want to use your object in several cells, you need to make it serializable. In the opposite case you are limited to using an object only within one cell.

P.S. In this particular example the issue is caused by the incorrect implementation of __getattribute__ method. Such an implementation implies that every invocation of getattr(aaa, attr_name, None) returns 123, which obviously doesn't work well in every case. That's why some error occurred on attempt to serialize object aaa and therefore it hasn't been saved on disk.

hsestupin
  • 1,115
  • 10
  • 19
  • Oh that's a very interesting answer. I was actually wondering how this is working. What if my intermediate results of a previous cell are many GB in size? Wouldn't that slow down things significantly? I'm imagining a huge dataframe object, which is changed in multiple cells. – lumbric Mar 04 '21 at 08:41
  • Serialization of the cell's state in Datalore kernel indeed comes at some cost. It introduces an additional post-processing delay after cell evaluation and of course requires additional disk space. Unnoticeable for simple calculations, it may introduce significant overhead for operations with large datasets. Every "touched" variable will be persisted, even if it wasn't modified (unfortunately, Python doesn't provide a reliable and universal way of tracking object modifications). – hsestupin Mar 10 '21 at 14:14
  • 1
    To mitigate this problem, we added a special mechanism reducing the serialization overhead by preventing serialization of unmodified objects. If you're pretty sure that a certain object is not going to be modified, you can tell the kernel not to serialize it by adding `readonly` comment annotation at the beginning of the cell: (1st cell) `data = pandas.read_csv('huge.csv')`, (2nd cell) `# readonly(data) do_something_with(data) ` – hsestupin Mar 10 '21 at 14:14
  • 2
    Oh great! Thanks! This is really interesting, I would love to see a longer blog post or a forum entry on this subject. – lumbric Mar 10 '21 at 14:17