0

When reading a file of size 4 MB with pandas.read_pickle(), EOFError: Ran out of input is thrown. This file has been written with pandas.to_pickle() but due to a software bug, the thread running pandas.to_pickle() might have been killed. Is there a way to retrieve some data from this file?

mpa
  • 68
  • 6

1 Answers1

0

I found a hint in this Stackoverflow question. The code below is an example how I recovered all relevant data for our case. The structure of the code and the amount of recoverable data obviously depends on the corrupted file. Good luck :-)

with open(path_to_file.pkl, "rb") as f:
    corrupted_data = io.BytesIO(f.read())
    # Use the pure-Python version, we can't see the internal state of the C version
    unpickler = pickle._Unpickler(corrupted_data)
    try:
        unpickler.load()
    except EOFError:
        pass

    metastack = unpickler.metastack
    mgr = metastack[1]
    bool_columns: np.ndarray = mgr[2].values
    num_rows = bool_columns.shape[1]
    int_columns: np.ndarray = mgr[3].values
    object_columns: np.ndarray = metastack[2]
    value_list: list[np.ndarray] = object_columns[4]
    print(f"{num_rows=}", bool_columns.shape, int_columns.shape)
    object_column1: list[np.ndarray] = value_list[:num_rows]
    object_column2: list[np.ndarray] = value_list[num_rows:2 * num_rows]
    object_column3: list[np.ndarray] = value_list[2 * num_rows:]
mpa
  • 68
  • 6