This will be a quite long post, detailing my problems and the results of my research at the moment.
Context:
Currently I have a Rundeck on a GCP server with python scripts running on it. Python scripts are used to interact with other services using rest-API.
Problem:
Theses scripts are running well and fast (under 20sec) but sometimes it takes 4 or 5 minutes to run. This problem happen in every scripts.
When a slow script happen, it touch every script on the period of time:
Script9 20/03/2019 16:50 en 3 s
Script8 20/03/2019 16:49 en 4 m50 s
Script7 20/03/2019 16:49 en 4 m45 s
Script5 20/03/2019 16:49 en 4 m43 s
Script4 20/03/2019 16:49 en 4 m33 s
Script3 20/03/2019 16:48 en 2 m1 s
Script2 20/03/2019 16:48 en 2 m1 s
Script1 20/03/2019 16:48 en 2 m1 s
Script8 20/03/2019 16:44 en 3 s
In order to have more information about what slows the script, I've done a -m cProfile -s tottime
and it shows that the slowest function is {built-in method _openssl.ENGINE_init}
cProfile result
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 216.560 216.560 216.560 216.560 {built-in method _openssl.ENGINE_init}
1 10.005 10.005 10.005 10.005 {built-in method time.sleep}
43 2.121 0.049 2.121 0.049 {method 'read' of '_ssl._SSLSocket' objects}
12 0.449 0.037 0.449 0.037 {method 'do_handshake' of '_ssl._SSLSocket' objects}
488 0.267 0.001 0.267 0.001 {built-in method marshal.loads}
18 0.206 0.011 0.206 0.011 {method 'connect' of '_socket.socket' objects}
12 0.110 0.009 0.110 0.009 {method 'load_verify_locations' of '_ssl._SSLContext' objects}
18 0.064 0.004 0.064 0.004 {built-in method _socket.getaddrinfo}
1630/1617 0.045 0.000 0.154 0.000 {built-in method builtins.__build_class__}
3799 0.041 0.000 0.041 0.000 {built-in method posix.stat}
529/1 0.035 0.000 230.937 230.937 {built-in method builtins.exec}
It looks like the problem comes from _openssl, used by the request library.
What I tried:
In order to solve this problem I've looked on google and found nothing solving my problem. I've tried few things:
- Upgrading the GCP server (new config in other information)
- Upgrading the rundeck JVM memory
- Changing the seconds on the jobs time to avoid script starting at the exact same second
- Monitoring the rundeck and found that the tcp_time_wait increase when the problem happens graph
None of theses things solved the problem.
I also tried to reproduce the problem by hand but even when I launch multiples scripts from the rundeck in parallel the problem didn't shows up.
Your time to shine:
If you already faced this problem, or if you have an idea of what I can try, please let me know.
Thank you for reading this.
Other information:
Rundeck launch scripts using job with this:
#!/bin/bash
set -e
source /root/venv/global/bin/activate
export PYTHONPATH=${PYTHONPATH}:/root/repo/Script
python /root/repo/Script/folder/scriptname.py
deactivate
Python (3.5) libs and their versions in the venv:
argh==0.26.2
asn1crypto==0.24.0
bcrypt==3.1.6
cachetools==3.0.0
certifi==2018.11.29
cffi==1.11.5
chardet==3.0.4
Click==7.0
cryptography==2.5
Deprecated==1.2.4
emoji==0.5.1
et-xmlfile==1.0.1
future==0.17.1
get==2018.11.19
google-api-core==1.7.0
google-api-python-client==1.7.7
google-auth==1.6.2
google-auth-httplib2==0.0.3
google-cloud-core==0.29.1
google-cloud-logging==1.10.0
googleapis-common-protos==1.5.6
grpcio==1.18.0
httplib2==0.12.0
idna==2.8
jdcal==1.4
mysql-connector==2.1.4
numpy==1.16.0
oauth2client==4.1.3
openpyxl==2.5.14
pandas==0.23.4
paramiko==2.4.2
pathtools==0.1.2
Pillow==5.4.1
post==2018.11.20
protobuf==3.6.1
public==2018.11.20
pyasn1==0.4.5
pyasn1-modules==0.2.3
pycparser==2.19
PyGithub==1.43.4
PyJWT==1.7.1
PyNaCl==1.3.0
pyodbc==4.0.25
python-dateutil==2.7.5
python-resize-image==1.1.18
pytz==2018.9
PyYAML==3.13
query-string==2018.11.20
request==2018.11.20
requests==2.21.0
rsa==4.0
salesforce-bulk-api==1.2.0
simple-salesforce==0.74.2
six==1.12.0
unicodecsv==0.14.1
uritemplate==3.0.0
urllib3==1.24.1
watchdog==0.9.0
wrapt==1.11.1
xmltodict==0.11.0
GCP serveur config:
- name: n1-standard-4
- vCPUs: 4
- Memory: 15
- Max number of persistent disks: 128
- Max total PD size: 64