3

I am trying to run a python script using nextflow and docker. I am using a dockerfile (as shown below) to create a docker image. Nextflow script has a simple launch of a python script. The issue is when I run the same python command from within the docker container (in the interactive mode) it works fine. But when I launch it using nextflow with a docker container then it throws up error.

Dockerfile:

#!/usr/local/bin/docker
# -*- version: 20.10.2 -*-

############################################
## MULTI-STAGE CONTAINER CONFIGURATION ##
FROM python:3.6.2
RUN apt-get update && apt-get install -y \
    apt-transport-https \
    software-properties-common \
    unzip \
    curl
RUN wget -O- https://apt.corretto.aws/corretto.key | apt-key add - && \
    add-apt-repository 'deb https://apt.corretto.aws stable main' && \
    apt-get update && \
    apt-get install -y java-1.8.0-amazon-corretto-jdk


############################################
## PHEKNOWLATOR (PKT_KG) PROJECT SETTINGS ##
# create needed project directories
WORKDIR /PKT
RUN mkdir -p /PKT
RUN mkdir -p /PKT/resources
RUN mkdir -p /PKT/resources/construction_approach
RUN mkdir -p /PKT/resources/edge_data
RUN mkdir -p /PKT/resources/knowledge_graphs
RUN mkdir -p /PKT/resources/node_data
RUN mkdir -p /PKT/resources/ontologies
RUN mkdir -p /PKT/resources/processed_data
RUN mkdir -p /PKT/resources/relations_data

# copy scripts/files needed to run pkt_kg
COPY pkt_kg /PKT/pkt_kg
COPY Main.py /PKT
COPY setup.py /PKT
COPY README.rst /PKT
COPY resources /PKT/resources

# download and copy needed data
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/edge_source_list.txt && mv edge_source_list.txt resources/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/ontology_source_list.txt && mv ontology_source_list.txt resources/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/resource_info.txt && mv resource_info.txt resources/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/subclass_construction_map.pkl && mv subclass_construction_map.pkl resources/construction_approach/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/PheKnowLator_MergedOntologies.owl && mv PheKnowLator_MergedOntologies.owl resources/knowledge_graphs/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/node_metadata_dict.pkl && mv node_metadata_dict.pkl resources/node_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/DISEASE_MONDO_MAP.txt && mv DISEASE_MONDO_MAP.txt resources/processed_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/ENSEMBL_GENE_ENTREZ_GENE_MAP.txt && mv ENSEMBL_GENE_ENTREZ_GENE_MAP.txt resources/processed_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt && mv ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt resources/processed_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/GENE_SYMBOL_ENSEMBL_TRANSCRIPT_MAP.txt && mv GENE_SYMBOL_ENSEMBL_TRANSCRIPT_MAP.txt resources/processed_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/HPA_GTEx_TISSUE_CELL_MAP.txt && mv HPA_GTEx_TISSUE_CELL_MAP.txt resources/processed_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/MESH_CHEBI_MAP.txt && mv MESH_CHEBI_MAP.txt resources/processed_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/PHENOTYPE_HPO_MAP.txt && mv PHENOTYPE_HPO_MAP.txt resources/processed_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/STRING_PRO_ONTOLOGY_MAP.txt && mv STRING_PRO_ONTOLOGY_MAP.txt resources/processed_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt && mv UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt resources/processed_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/INVERSE_RELATIONS.txt && mv INVERSE_RELATIONS.txt resources/relations_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/RELATIONS_LABELS.txt && mv RELATIONS_LABELS.txt resources/relations_data/

# install needed python libraries
RUN pip install --upgrade pip setuptools
WORKDIR /PKT
RUN pip install .


############################################
## GLOBAL ENVRIONMENT SETTINGS ##
# copy files needed to run docker container
COPY entrypoint.sh /PKT

# update permissions for all files
RUN chmod -R 755 /PKT

# set OWlTools memory (set to a high value, system will only use available memory)
ENV OWLTOOLS_MEMORY=500g
RUN echo $OWLTOOLS_MEMORY

# set python envrionment encoding
RUN export PYTHONIOENCODING=utf-8

Name of the docker image-- pkt:2.0.0

Nextflow script:

process run_PKTBaseRun{

echo True

container 'pkt:2.0.0'
publishDir "${params.outDir}", mode: 'copy'

output:
file '*' into output_ch

script:
"""
which python
$PWD
pwd
python /PKT/Main.py --onts /PKT/resources/ontology_source_list.txt \
            --edg /PKT/resources/edge_source_list.txt \
            --res /PKT/resources/resource_info.txt \
            --out /PKT/resources/knowledge_graphs --app subclass --kg full --nde yes --rel yes --owl no
"""


}

Now when I execute:

nextflow run main.nf

Then this gives error related to glob.glob modules as it is not listing the files as it must inside the docker container.

However, when i simply run the python code above inside the docker container then it runs seemlessly.

> docker run -it pkt:2.0.0 /bin/bash

/PKT> python Main.py --onts resources/ontology_source_list.txt \
            --edg resources/edge_source_list.txt \
            --res resources/resource_info.txt \
            --out resources/knowledge_graphs --app subclass --kg full --nde yes --rel yes --owl no

It is only when I combine nextflow with docker does this code throw errors. I have ensured that the python that is used is that of within the container.

Questions:

  1. Any ideas/thoughts to make it work?

Interestingly,
the output of which python --> python within the container
BUT,
the output of $PWD --> directory from where nextflow is launched
the output of pwd --> work directory of nextflow

  1. When we add container in the nextflow process, it is not that the commands inside the nextflow process (run_PKTBaseRun) are run from the container workdir?Therefore should value of pwd not be that of container workdir instead of nextflow workdir?

All the required files have been added to the docker image.

  1. Is there a way to ensure that the commands within the script section in the nextflow process are run from the docker root/workdir?

The idea behing this nextflow and docker is to finally run it on aws batch using awscli. But before running it on aws batch, want to ensure that its running fine on the local server.

Looking forward to your suggestions and ideas. Thank you.

deepesh
  • 53
  • 1
  • 4
  • Can you post the error message you get inside the containter when running it with nextflow? `Then this gives error related to glob.glob modules as it is not listing the files as it must inside the docker container.` is a bit vague – Pallie Mar 02 '21 at 08:31

2 Answers2

0

Try escaping the \$PWD which will give you the the nextflow process workdir which is mounted in docker. I'm curious if you have solved it some other way?

Try running this in nextflow process script.

export pdir=\$PWD
echo \$pdir
mahaswap
  • 76
  • 8
0

Bit of an old question, but for future goggles - Nextflow does quite a bit of behind-the-scenes work when running Docker, including mounting files into the container and setting the working directory. This is needed so that commands can generally run seamlessly from within a process with the expected inputs. However, it means that some Dockerfile configurations such as WORKDIR will be overwritten.

Looking at the examples above I would have a couple of suggestions:

  1. It's usually better to stage external data into the Nextflow process rather than saving it into the container (just specify a path with the URL and Nextflow will know to download it for you).
  2. Try not to rely on a specific working directory within the container but instead go for packaged installs that add command line tools to the PATH.
    • Bit difficult to know if you're doing this already - there's a pip install . but the Nextflow script directly runs an absolute path.

One of the benefits of keeping the Dockerfile as slim as possible is that it makes your pipeline more portable. If your installed tool is super simple then other people are more likely to be able to run on systems that don't have Docker installed (instead Singularity, Conda etc).

If you really really need to work within a specific directory in the container, then adding a cd command into the Nextflow script should work. But bear in mind that your input files will be located within the work directory path inside the container, which will be variable.

ewels
  • 463
  • 7
  • 19