3

Edit3: Loaded core into gdb. Edit2: Included the .cc code. Edit1: loaded debug symbols.

I'm trying to run the example mnist program of the attention-sampling github library. The error out put is as following.

root@4d9b40a6f414:/vol/attention-sampling# ./mnist.py ./test_mnist/mnist-small ./test_mnist/mnist-experiment
2023-03-25 02:57:58.914902: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/vol/attention-sampling/ats/ops/extract_patches
Loaded dataset with the following parameters
{
    "n_train": 5000,
    "n_test": 1000,
    "width": 500,
    "height": 500,
    "scale": 0.2,
    "noise": false,
    "seed": 0
}
Segmentation fault (core dumped)

I tracked down the problem and it appears that the problem is associated with libpatches.so which was built when I install the library. I used gdb to debug libpatches.so and this is the output. I used docker to debug. docker run -it --cap-add=SYS_PTRACE --ulimit core=-1 --security-opt seccomp=unconfined --name attention-gdb-core -v /home/pristina/attention-sampling:/vol/attention-sampling attention-gdb-core

Edit3: gdb with core file

root@7de7ea7530a0:/vol/attention-sampling/ats/ops/extract_patches# gdb libpatches.so core.1679919003.python.510 
GNU gdb (Ubuntu 8.1.1-0ubuntu1) 8.1.1
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from libpatches.so...done.

warning: core file may not match specified executable file.
[New LWP 510]
[New LWP 580]
[New LWP 600]
[New LWP 582]
[New LWP 561]
[New LWP 578]
[New LWP 596]
[New LWP 584]
[New LWP 562]
[New LWP 565]
[New LWP 586]
[New LWP 563]
[New LWP 564]
[New LWP 581]
[New LWP 585]
[New LWP 566]
[New LWP 598]
[New LWP 570]
[New LWP 590]
[New LWP 574]
[New LWP 592]
[New LWP 597]
[New LWP 576]
[New LWP 594]
[New LWP 588]
[New LWP 568]
[New LWP 572]
[New LWP 589]
[New LWP 567]
[New LWP 575]
[New LWP 599]
[New LWP 577]
[New LWP 583]
[New LWP 595]
[New LWP 573]
[New LWP 587]
[New LWP 579]
[New LWP 593]
[New LWP 569]
[New LWP 591]
[New LWP 571]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `python ./mnist.py ./test_mnist/mnist-small ./test_mnist/mnist-experiment'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fe2fd0e3fb2 in tensorflow::shape_inference::InferenceContext::WithRank(tensorflow::shape_inference::ShapeHandle, long long, tensorflow::shape_inference::ShapeHandle*) ()
   from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1
[Current thread is 1 (Thread 0x7fe3b715f740 (LWP 510))]
(gdb) bt
#0  0x00007fe2fd0e3fb2 in tensorflow::shape_inference::InferenceContext::WithRank(tensorflow::shape_inference::ShapeHandle, long long, tensorflow::shape_inference::ShapeHandle*) ()
   from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1
#1  0x00007fe2fd0e517b in tensorflow::shape_inference::InferenceContext::MakeShapeFromShapeTensor(int, tensorflow::shape_inference::ShapeHandle*) () from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1
#2  0x00007fe28c4dfd2c in __lambda13::operator() (__closure=0x0, c=0x7fff9f4653e8)
    at /vol/attention-sampling/ats/ops/extract_patches/extract_patches.cc:31
#3  0x00007fe28c4e0057 in __lambda13::_FUN (c=0x7fff9f4653e8) at /vol/attention-sampling/ats/ops/extract_patches/extract_patches.cc:41
#4  0x00007fe28c4e2c1d in std::_Function_handler<tensorflow::Status (tensorflow::shape_inference::InferenceContext*), tensorflow::Status (*)(tensorflow::shape_inference::InferenceContext*)>::_M_invoke(std::_Any_data const&, tensorflow::shape_inference::InferenceContext*)
    (__functor=..., __args#0=0x7fff9f4653e8) at /usr/include/c++/4.8/functional:2057
#5  0x00007fe2fd0dfaf2 in tensorflow::shape_inference::InferenceContext::Run(std::function<tensorflow::Status (tensorflow::shape_inference::InferenceContext*)> const&) () from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1
#6  0x00007fe306a28cb5 in tensorflow::ShapeRefiner::RunShapeFn(tensorflow::Node const*, tensorflow::OpRegistrationData const*, tensorflow::ExtendedInferenceContext*) () from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so
#7  0x00007fe306a2a97d in tensorflow::ShapeRefiner::AddNode(tensorflow::Node const*) ()
   from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so
#8  0x00007fe30080a792 in TF_FinishOperation ()
   from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so
#9  0x00007fe300581a96 in _wrap_TF_FinishOperation ()
   from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so
#10 0x000000000050a12f in ?? ()
#11 0x00007fe28bbc0d30 in ?? ()
---Type <return> to continue, or q <return> to quit---
#12 0x00007fe28c33cc00 in ?? ()
#13 0x0000000000000000 in ?? ()
(gdb) l
1   // This file is part of Eigen, a lightweight C++ template library
2   // for linear algebra.
3   //
4   // Copyright (C) 2008 Gael Guennebaud <gael.guennebaud@inria.fr>
5   // Copyright (C) 2007-2011 Benoit Jacob <jacob.benoit.1@gmail.com>
6   //
7   // This Source Code Form is subject to the terms of the Mozilla
8   // Public License v. 2.0. If a copy of the MPL was not distributed
9   // with this file, You can obtain one at http://mozilla.org/MPL/2.0/.
10  
(gdb) where
#0  0x00007fe2fd0e3fb2 in tensorflow::shape_inference::InferenceContext::WithRank(tensorflow::shape_inference::ShapeHandle, long long, tensorflow::shape_inference::ShapeHandle*) ()
   from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1
#1  0x00007fe2fd0e517b in tensorflow::shape_inference::InferenceContext::MakeShapeFromShapeTensor(int, tensorflow::shape_inference::ShapeHandle*) ()
   from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1
#2  0x00007fe28c4dfd2c in __lambda13::operator() (__closure=0x0, c=0x7fff9f4653e8) at /vol/attention-sampling/ats/ops/extract_patches/extract_patches.cc:31
#3  0x00007fe28c4e0057 in __lambda13::_FUN (c=0x7fff9f4653e8) at /vol/attention-sampling/ats/ops/extract_patches/extract_patches.cc:41
#4  0x00007fe28c4e2c1d in std::_Function_handler<tensorflow::Status (tensorflow::shape_inference::InferenceContext*), tensorflow::Status (*)(tensorflow::shape_inference::InferenceContext*)>::_M_invoke(std::_Any_data const&, tensorflow::shape_inference::InferenceContext*) (__functor=..., __args#0=0x7fff9f4653e8) at /usr/include/c++/4.8/functional:2057
#5  0x00007fe2fd0dfaf2 in tensorflow::shape_inference::InferenceContext::Run(std::function<tensorflow::Status (tensorflow::shape_inference::InferenceContext*)> const&) ()
   from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1
#6  0x00007fe306a28cb5 in tensorflow::ShapeRefiner::RunShapeFn(tensorflow::Node const*, tensorflow::OpRegistrationData const*, tensorflow::ExtendedInferenceContext*) ()
   from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so
#7  0x00007fe306a2a97d in tensorflow::ShapeRefiner::AddNode(tensorflow::Node const*) ()
   from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so
#8  0x00007fe30080a792 in TF_FinishOperation () from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so
#9  0x00007fe300581a96 in _wrap_TF_FinishOperation () from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so
#10 0x000000000050a12f in ?? ()
#11 0x00007fe28bbc0d30 in ?? ()
#12 0x00007fe28c33cc00 in ?? ()
#13 0x0000000000000000 in ?? ()

The code listed by (gdb) l is as following. (The github page for the .cc .cu .h file)

#include "extract_patches.h"

#include "tensorflow/core/framework/op_kernel.h"
#include "tensorflow/core/framework/tensor_shape.h"
#include "tensorflow/core/framework/shape_inference.h"


using namespace tensorflow;

using CPUDevice = Eigen::ThreadPoolDevice;
using GPUDevice = Eigen::GpuDevice;


REGISTER_OP("ExtractPatches")
    .Attr("T: {float, double, uint8, uint16}")
    .Input("input: T")
    .Input("offsets: int32")
    .Input("size: int32")
    .Output("output: T")
    .SetShapeFn([](shape_inference::InferenceContext *c) {
        // Define shape handle variables for all the intermediate shapes
        shape_inference::ShapeHandle size, channels, batch_and_samples;
        shape_inference::ShapeHandle out1, out2;

        // Gather all the intermediate sizes
        TF_RETURN_IF_ERROR(c->MakeShapeFromShapeTensor(2, &size));
        TF_RETURN_IF_ERROR(c->Subshape(c->input(0), -1, &channels));
        TF_RETURN_IF_ERROR(c->Subshape(c->input(1), 0, 2, &batch_and_samples));

        // Make and set the output shape
        TF_RETURN_IF_ERROR(c->Concatenate(batch_and_samples, size, &out1));
        TF_RETURN_IF_ERROR(c->Concatenate(out1, channels, &out2));
        c->set_output(0, out2);

        return Status::OK();
    });

My environment is:

g++ 4.8.5
gcc 4.8.5
Ubuntu 18.04
Keras                            2.3.1
Keras-Applications               1.0.8
Keras-Preprocessing              1.1.2
tensorflow                       1.15.4+nv

This is the mnist.py code. (also this is the documentation of mnist.py)

#!/usr/bin/env python
#
# Copyright (c) 2019 Idiap Research Institute, http://www.idiap.ch/
# Written by Angelos Katharopoulos <angelos.katharopoulos@idiap.ch>
#

"""Implement attention sampling for classifying MNIST digits."""

import argparse
import json
from os import path

from keras import backend as K
from keras.callbacks import Callback, ModelCheckpoint
from keras.datasets import mnist
from keras.layers import Input, Conv2D, AveragePooling2D, GlobalMaxPooling2D, \
    Dense
from keras.models import Model, Sequential
from keras.optimizers import SGD, Adam
from keras.utils import Sequence
import numpy as np
from skimage.io import imsave

from ats.core import attention_sampling
from ats.utils.layers import L2Normalize, SampleSoftmax, ResizeImages, \
    TotalReshape
from ats.utils.regularizers import multinomial_entropy
from ats.utils.training import Batcher


class MNIST(Sequence):
    """Load a Megapixel MNIST dataset. See make_mnist.py."""
    def __init__(self, dataset_dir, train=True):
        with open(path.join(dataset_dir, "parameters.json")) as f:
            self.parameters = json.load(f)

        filename = "train.npy" if train else "test.npy"
        N = self.parameters["n_train" if train else "n_test"]
        W = self.parameters["width"]
        H = self.parameters["height"]
        scale = self.parameters["scale"]

        self._high_shape = (H, W, 1)
        self._low_shape = (int(scale*H), int(scale*W), 1)
        self._data = np.load(path.join(dataset_dir, filename))

    def __len__(self):
        return len(self._data)

    def __getitem__(self, i):
        if i >= len(self):
            raise IndexError()

        # Placeholders
        x_low = np.zeros(self._low_shape, dtype=np.float32).ravel()
        x_high = np.zeros(self._high_shape, dtype=np.float32).ravel()

        # Fill the sparse representations
        data = self._data[i]
        x_low[data[0][0]] = data[0][1]
        x_high[data[1][0]] = data[1][1]

        # Reshape to their final shape
        x_low = x_low.reshape(self._low_shape)
        x_high = x_high.reshape(self._high_shape)

        return [x_low, x_high], data[2]


class AttentionSaver(Callback):
    def __init__(self, output, att_model, data):
        self._att_path = path.join(output, "attention_{:03d}.png")
        self._patches_path = path.join(output, "patches_{:03d}_{:03d}.png")
        self._att_model = att_model
        (self._x, self._x_high), self._y = data[0]
        self._imsave(
            path.join(output, "image.png"),
            self._x[0, :, :, 0]
        )

    def on_epoch_end(self, e, logs):
        att, patches = self._att_model.predict([self._x, self._x_high])
        self._imsave(self._att_path.format(e), att[0])
        np.save(self._att_path.format(e)[:-4], att[0])
        for i, p in enumerate(patches[0]):
            self._imsave(self._patches_path.format(e, i), p[:, :, 0])

    def _imsave(self, filepath, x):
        x = (x*65535).astype(np.uint16)
        imsave(filepath, x, check_contrast=False)


def get_model(outputs, width, height, scale, n_patches, patch_size, reg):
    # Define the shapes
    shape_high = (height, width, 1)
    shape_low = (int(height*scale), int(width*scale), 1)

    # Make the attention and feature models
    attention = Sequential([
        Conv2D(8, kernel_size=3, activation="tanh", padding="same",
               input_shape=shape_low),
        Conv2D(8, kernel_size=3, activation="tanh", padding="same"),
        Conv2D(1, kernel_size=3, padding="same"),
        SampleSoftmax(squeeze_channels=True, smooth=1e-5)
    ])
    feature = Sequential([
        Conv2D(32, kernel_size=7, activation="relu", input_shape=shape_high),
        Conv2D(32, kernel_size=3, activation="relu"),
        Conv2D(32, kernel_size=3, activation="relu"),
        Conv2D(32, kernel_size=3, activation="relu"),
        GlobalMaxPooling2D(),
        L2Normalize()
    ])

    # Let's build the attention sampling network
    x_low = Input(shape=shape_low)
    x_high = Input(shape=shape_high)
    features, attention, patches = attention_sampling(
        attention,
        feature,
        patch_size,
        n_patches,
        replace=False,
        attention_regularizer=multinomial_entropy(reg)
    )([x_low, x_high])
    y = Dense(outputs, activation="softmax")(features)

    return (
        Model(inputs=[x_low, x_high], outputs=[y]),
        Model(inputs=[x_low, x_high], outputs=[attention, patches])
    )


def get_optimizer(args):
    optimizer = args.optimizer

    if optimizer == "sgd":
        return SGD(lr=args.lr, momentum=args.momentum, clipnorm=args.clipnorm)
    elif optimizer == "adam":
        return Adam(lr=args.lr, clipnorm=args.clipnorm)

    raise ValueError("Invalid optimizer {}".format(optimizer))


def main(argv):
    parser = argparse.ArgumentParser(
        description=("Train a model with attention sampling on the "
                     "artificial mnist dataset")
    )
    parser.add_argument(
        "dataset",
        help="The directory that contains the dataset (see make_mnist.py)"
    )
    parser.add_argument(
        "output",
        help="An output directory"
    )

    parser.add_argument(
        "--optimizer",
        choices=["sgd", "adam"],
        default="adam",
        help="Choose the optimizer for Q1"
    )
    parser.add_argument(
        "--lr",
        type=float,
        default=0.001,
        help="Set the optimizer's learning rate"
    )
    parser.add_argument(
        "--momentum",
        type=float,
        default=0.9,
        help="Choose the momentum for the optimizer"
    )
    parser.add_argument(
        "--clipnorm",
        type=float,
        default=1,
        help=("Clip the gradient norm to avoid exploding gradients "
              "towards the end of convergence")
    )

    parser.add_argument(
        "--patch_size",
        type=lambda x: tuple(int(xi) for xi in x.split("x")),
        default="50x50",
        help="Choose the size of the patch to extract from the high resolution"
    )
    parser.add_argument(
        "--n_patches",
        type=int,
        default=10,
        help="How many patches to sample"
    )
    parser.add_argument(
        "--regularizer_strength",
        type=float,
        default=0.0001,
        help="How strong should the regularization be for the attention"
    )

    parser.add_argument(
        "--batch_size",
        type=int,
        default=128,
        help="Choose the batch size for SGD"
    )
    parser.add_argument(
        "--epochs",
        type=int,
        default=500,
        help="How many epochs to train for"
    )

    args = parser.parse_args(argv)

    # Load the data
    training_dataset = MNIST(args.dataset)
    test_dataset = MNIST(args.dataset, train=False)
    training_batched = Batcher(training_dataset, args.batch_size)
    test_batched = Batcher(test_dataset, args.batch_size)
    print("Loaded dataset with the following parameters")
    print(json.dumps(training_dataset.parameters, indent=4))

    model, att_model = get_model(
        outputs=10,
        width=training_dataset.parameters["width"],
        height=training_dataset.parameters["height"],
        scale=training_dataset.parameters["scale"],
        n_patches=args.n_patches,
        patch_size=args.patch_size,
        reg=args.regularizer_strength
    )
    model.compile(
        loss="categorical_crossentropy",
        optimizer=get_optimizer(args),
        metrics=["accuracy", "categorical_crossentropy"]
    )
    model.summary()

    callbacks = [
        AttentionSaver(args.output, att_model, training_batched),
        ModelCheckpoint(
            path.join(args.output, "weights.{epoch:02d}.h5"),
            save_weights_only=True
        )
    ]
    model.fit_generator(
        training_batched,
        validation_data=test_batched,
        epochs=args.epochs,
        callbacks=callbacks
    )
    loss, accuracy, ce = model.evaluate_generator(test_batched, verbose=1)
    print("Test loss: {}".format(ce))
    print("Test error: {}".format(1-accuracy))


if __name__ == "__main__":
    main(None)
  • 1
    You probably want to obtain debug symbols or rebuild the library with them (i.e. gcc -g3) then revise your question with said information. Tag it with either c or c++, not both, and it's a python program that is crashing... so tag it python? – Allan Wind Mar 25 '23 at 03:47
  • @AllanWind Thanks for your suggestions. I rebuilt the library with debug symbols and edited the tags. – Pristina Wang Mar 26 '23 at 00:35
  • Load the core dump into your debugger and see what's going on and how you got into that situation. – Jesper Juhl Mar 26 '23 at 01:55
  • Did you update the backtrace now that you have debug symbols? – Allan Wind Mar 26 '23 at 02:52
  • @AllanWind Yes, I updated the whole output of gdb including the backtrace in my post. – Pristina Wang Mar 26 '23 at 10:47
  • Unfortunately, none of the symbols are resolved in your stacktrace so it tells us very little (it might happen 3 calls in, or stack is corrupted). – Allan Wind Mar 26 '23 at 18:54
  • 1
    @JesperJuhl Thanks for your advice! I loaded core dump in gdb. – Pristina Wang Mar 27 '23 at 12:24
  • @AllanWind I loaded the core into gdb and updated my post. Hope it helps. Thanks! – Pristina Wang Mar 27 '23 at 12:25
  • I upvoted your question as it's much better now. Not sure I will be of much help resolving it. The next step would be to do `frame 0`, `where` to figure out what caused the segfault. It says it's in the method `WithRank()` but not seeing it in the source listing. If you haven't I suggest you engage with a more specific community around the python or c library. File a bug perhaps? – Allan Wind Mar 28 '23 at 01:38
  • @AllanWind Ok! Thanks a lot! I'll check out your suggestions. – Pristina Wang Mar 29 '23 at 02:04

0 Answers0