0

I am running an executable in Condor that basically processes an input Image and saves a binary image in a given folder. I use this code in 213 images.

My condor configuration file contents are as following:

universe     = vanilla
executable   = /datasets/me/output_cpen_database/source_codes_techniques/test/vole
arguments    = cmfd -I /datasets/me/cpen_database/scale1/$(Process)/$(Process).png -O /datasets/me/output_cpen_database/scale1/dct/$(Process)/ --numThreads 10 --chan GRAY --featvec DCT --blockSize 16 --minDistEuclidian 50 --kdsort --fastsats --minSameShift 1000 --markRegions --useOrig --writePost --writeMatrix
initialdir   = /datasets/me/output_cpen_database/source_codes_techniques/test
requirements = (OpSysAndVer == "Ubuntu12")
request_cpus   = 5
request_memory = 20000
output       = logs/output-$(Process).log
error        = logs/error-$(Process).log
log          = logs/log-$(Process).log
Notification = Complete
Notify_User = mymail@gmail.com
Queue 214

Some images are processed OK, but in some cases I receive the following error in my mailbox:

Condor job 1273.47
/datasets/me/output_cpen_database/source_codes_techniques/test/vole cmfd -I /datasets/me/cpen_database/scale1/47/47.png -O    /datasets/me/output_cpen_database/scale1/dct/47/ --numThreads 10 --chan GRAY --featvec DCT --blockSize 16 --minDistEuclidian 50 --kdsort --fastsats --minSameShift 1000 --markRegions --useOrig --writePost --writeMatrix
died on signal 9 (Killed)

I was thinking if this happens because of lack of memory, but this image's (named 47) size is not longer than 20MB (actually it has 16.7MB).

As I said before, the condor runs this executable ok for some other images .

Should I have to increase the request_memory in my configuration file? what is happening here?

mad
  • 2,677
  • 8
  • 35
  • 78

1 Answers1

0

Usually, a job dying on signal 9 means problems with some of the shared libraries required by your executable. What I would check is whether or not all jobs die on a particular host. If that's the case, you could run the code manually and see if you get a missing shared library error.

fpierfed
  • 71
  • 5