4

I have somewhat successfully dockerized a software repository (KPConv) that I plan to work with and extend with the following Dockerfile

FROM tensorflow/tensorflow:1.12.0-devel-gpu-py3

# Install other required python stuff
RUN apt-get update && apt install -y --fix-missing --no-install-recommends\
    python3-setuptools python3-pip python3-tk

RUN pip install --upgrade pip
RUN pip3 install numpy scikit-learn psutil matplotlib pyqt5 laspy 

# Compile the custom operations and CPP wrappers
# For some reason this must be done within container, cannot access libcuda.so during docker build
# Ref: https://stackoverflow.com/questions/66575232
#COPY . /kpconv
#WORKDIR /kpconv/tf_custom_ops
#RUN sh compile_op.sh
#WORKDIR /kpconv/cpp_wrappers
#RUN sh compile_wrappers.sh

# Set the working directory to kpconv
WORKDIR /kpconv

# Set root user password so we can su/sudo later if need be
RUN echo "root:pass" | chpasswd

# Create a user and group akin to the host within the container
ARG USER_ID
ARG GROUP_ID
RUN addgroup --gid $GROUP_ID user
RUN adduser --disabled-password --gecos '' --uid $USER_ID --gid $GROUP_ID user
USER user

#Build
#sudo docker build -t kpconv-test \
#    --build-arg USER_ID=$(id -u) \
#    --build-arg GROUP_ID=$(id -g) \
#    .

At the end of this Dockerfile I followed a post found here which describes a way to correctly set the permissions of files generated by/within a container so that the host machine/user can access them without having to alter the file permissions.

Also, this software repository makes use of custom tensorflow operations in C++ (KPConv/tf_custom_ops) along with Python wrappers for custom C++ code (KPConv/cpp_wrappers). The author of KPConv, Thomas Hugues, provides a bash script which compiles each to generate various .so files.


If I COPY the repository into the image during the build process (COPY . /kpconv), startup the container, call both of the compile bash scripts, and run the code then Python correctly loads the C++ wrapper (the generated .so grid_subsampling.cpython-35m-x86_64-linux-gnu.so) and begins running the software as expected/intended.

$ sudo docker run -it \
>     -v /<myhostpath>/data_sets:/data \
>     -v /<myhostpath>/_output:/output \
>     --runtime=nvidia kpconv-test /bin/bash
user@eec8553dcb5d:/kpconv$ cd tf_custom_ops 
user@eec8553dcb5d:/kpconv/tf_custom_ops$ sh compile_op.sh 
user@eec8553dcb5d:/kpconv/tf_custom_ops$ cd ..
user@eec8553dcb5d:/kpconv$ cd cpp_wrappers/
user@eec8553dcb5d:/kpconv/cpp_wrappers$ sh compile_wrappers.sh 
running build_ext
building 'grid_subsampling' extension
<Redacted for brevity>
user@eec8553dcb5d:/kpconv/cpp_wrappers$ cd ..
user@eec8553dcb5d:/kpconv$ python training_ModelNet40.py 

Dataset Preparation
*******************

Loading training points
1620.2 MB loaded in 0.6s

Loading test points
411.6 MB loaded in 0.2s
<Redacted for brevity>

This works well and allows me run the KPConv software.

Also to note for later the .so file has the hash

user@eec8553dcb5d:/kpconv/cpp_wrappers/cpp_subsampling$ sha1sum grid_subsampling.cpython-35m-x86_64-linux-gnu.so 
a17eef453f6d2370a15bc2a0e6714c978390c5c3  grid_subsampling.cpython-35m-x86_64-linux-gnu.so

It also has the permissions

user@eec8553dcb5d:/kpconv/cpp_wrappers/cpp_subsampling$ ls -al grid_subsampling.cpython-35m-x86_64-linux-gnu.so 
-rwxr-xr-x 1 user user 561056 Mar 14 02:16 grid_subsampling.cpython-35m-x86_64-linux-gnu.so

Though it produces a difficult workflow for quickly editing and the software for my purposes and quickly running it within the container. Every change to the code requires a new build of the image. Thus, I would much rather mount/volume the KPConv code from the host into the container at runtime and then the edits are "live" within the container as it is running.

Doing this and using the Dockerfile at the top of the post (no COPY . /kpconv) to compile an image, perform the same compilation steps, and run the code

$ sudo docker run -it \
>     -v /<myhostpath>/data_sets:/data \
>     -v /<myhostpath>/KPConv_Tensorflow:/kpconv \
>     -v /<myhostpath>/_output:/output \
>     --runtime=nvidia kpconv-test /bin/bash
user@a82e2c1af21a:/kpconv$ cd tf_custom_ops/
user@a82e2c1af21a:/kpconv/tf_custom_ops$ sh compile_op.sh 
user@a82e2c1af21a:/kpconv/tf_custom_ops$ cd ..
user@a82e2c1af21a:/kpconv$ cd cpp_wrappers/
user@a82e2c1af21a:/kpconv/cpp_wrappers$ sh compile_wrappers.sh 
running build_ext
building 'grid_subsampling' extension
<Redacted for brevity>
user@a82e2c1af21a:/kpconv/cpp_wrappers$ cd ..
user@a82e2c1af21a:/kpconv$ python training_ModelNet40.py 

I receive the following Python ImportError

user@a82e2c1af21a:/kpconv$ python training_ModelNet40.py 
Traceback (most recent call last):
  File "training_ModelNet40.py", line 36, in <module>
    from datasets.ModelNet40 import ModelNet40Dataset
  File "/kpconv/datasets/ModelNet40.py", line 40, in <module>
    from datasets.common import Dataset
  File "/kpconv/datasets/common.py", line 29, in <module>
    import cpp_wrappers.cpp_subsampling.grid_subsampling as cpp_subsampling
ImportError: /kpconv/cpp_wrappers/cpp_subsampling/grid_subsampling.cpython-35m-x86_64-linux-gnu.so: failed to map segment from shared object

Why is this Python wrapper for C++ only useable when COPY'ing code into the docker image and not when mounted by volume?

This .so file has the same hash and permissions as the first described situation

user@a82e2c1af21a:/kpconv/cpp_wrappers/cpp_subsampling$ sha1sum grid_subsampling.cpython-35m-x86_64-linux-gnu.so 
a17eef453f6d2370a15bc2a0e6714c978390c5c3  grid_subsampling.cpython-35m-x86_64-linux-gnu.so
user@a82e2c1af21a:/kpconv/cpp_wrappers/cpp_subsampling$ ls -al grid_subsampling.cpython-35m-x86_64-linux-gnu.so 
-rwxr-xr-x 1 user user 561056 Mar 14 02:19 grid_subsampling.cpython-35m-x86_64-linux-gnu.so

On my host machine the file has the following permissions (it's on the host because /kpconv was mounted as a volume) (for some reason the container is in the future too, check the timestamps)

$ ls -al grid_subsampling.cpython-35m-x86_64-linux-gnu.so 
-rwxr-xr-x 1 <myusername> <myusername> 561056 Mar 13 21:19 grid_subsampling.cpython-35m-x86_64-linux-gnu.so

After some research on the error message it looks like every result is specific to a situation. Though most seem to mention that the error is the result of some sort of permissions issue.

This Unix&Linux Stack answer I think provides the answer to what the actual problem is. But I am a bit too far from my days of working with C++ as an intern in college to necessarily understand how to use it to fix this issue. But I think the issue lies with the permissions between the container and host and between the users on each (that is, root on the container, user (Dockerfile) on the container, root on host, and <myusername> on host).


I have also attempted to first elevate permissions within the container using the root password created in the Dockerfile, then compiling the code, and running the software. But this results in the same issue. I have also tried compiling the code as user in the container, but running the software as root, again with the same issue.

Thus another clue I have found and provide is that there is seemingly something different with the .so when compiled "only within" the container (no --volume) and when it is compiled within the --volume (thus why I attempted to compare the file hashes). So maybe its not so much permissions but how the .so is loaded within the container by the kernel or how its location within the --volume effects that loading process?


EDIT: As for a SSCCE you should be able to clone the linked repository to your machine and use the same Dockerfile. You do not need to specify the /data or /output volumes or alter the code in any way (It attempts to load the .so before loading the data (which will just error and end execution))

If you do not have a GPU or do not want to install nvidia-runtime you should be able to alter the Dockerfile base image to tensorflow:1.12.0-devel-py3 and run the code on CPU.

KDecker
  • 6,928
  • 8
  • 40
  • 81

1 Answers1

1

Your problem is created by the linker trying to dynamically load the library. There could be several root-causes for this:

  1. Permissions. The user should have permission to load the library, so when mounting file systems in docker, the owner id and the group id that are in the host are not necessary the same id in the container although they might be the same name.
  2. Wrong binary format. The host OS is compiling the binary in wrong format. This can happen if you run the compile on (by example) macOS and use it in a linux container.
  3. Wrong mounting. The mounting, by example, with noexec will also prevent the library to be loaded.
  4. Difference in libraries from both environments. Due to the differences of the environment where the library was compiled, you might be missing some libraries, so use ldd grid_subsampling.cpython-35m-x86_64-linux-gnu.so and ldd -r -d -v grid_subsampling.cpython-35m-x86_64-linux-gnu.so check all the libraries that are linked.
jordanvrtanoski
  • 5,104
  • 1
  • 20
  • 29
  • 1
    I need to digest you points a bit more, but I wanted to quickly note that the `.so` is compiled within the container itself (via the `.sh` files provided within the linked repository). // I think that makes points 2,4 not-applicable but I need to look into the 3rd a bit more. I also plan to `ldd` the `.so`s between the two scenarios and compare the outputs. – KDecker Mar 14 '21 at 05:43
  • If compiled with the same build system, than 2 and 4 are out, what remains is the difference in the permissions. – jordanvrtanoski Mar 14 '21 at 05:53
  • 1
    Hmm, reviewing this post (https://stackoverflow.com/questions/55902519/mount-volume-into-docker-container-without-noexec-option) and others related to it. The directory representing the linked volume from the host, in which the code and thus compiled `.so` resides after the container compiles it, is actually a mounted partition from the host (from a separate NTFS partition to share data between win/ub). Potentially it is mounted with `noexec` (just used `fstab` `defaults` option when I set it up). Thus, even though container compiles it, it does so into a `noexec` location. I need to check. – KDecker Mar 14 '21 at 05:58
  • Though, the bash files which actually compile the `.so` reside on the same partition, and those are executed. Not sure if that is the same "type" of execution though.. – KDecker Mar 14 '21 at 06:12
  • @KDecker, can you do `ldd -r -d -v grid_subsampling.cpython-35m-x86_64-linux-gnu.so` and paste the output in the question? – jordanvrtanoski Mar 14 '21 at 06:54
  • 1
    Actually, I've solved the issue. The partition on the host directory that is `--volume`ed into the container had the `noexec` option set (based on `/proc/mounts`) so I added the `exec` option to the drive in the `fstab`, `mount -a`, and I can load the `.so` in the container now. // Thanks for the answer and pointing me in the right direction! – KDecker Mar 14 '21 at 07:03