33

My airflow server periodically fails. When I check the gunicorn logs, the error before all works shutting down looks like this:

OperationalError: (psycopg2.OperationalError) could not translate host name "my-airflow-db.l9zijaslosu.us-east-1.rds.amazonaws.com" to address: Name or service not known
 (Background on this error at: http://sqlalche.me/e/e3q8)

I immediately verify that the host name is correct and the database is accepting requests from other tools.

If I restart the Ariflow webserver, the the server operates correctly for 4-5 days, and then the same error occurs.

This issue has been asked before but is typically resolve by telling other developers to not use localhost or postrgres host names. My host name is a fully qualified host name on AWS's domain. It seems exceedingly unlikely that this is a DNS error on Amazon's part.

Hussein Awala
  • 4,285
  • 2
  • 9
  • 23
Brett
  • 719
  • 1
  • 10
  • 16
  • Can you use the connection if you use the Ad-Hoc Query tool? – Adam Bethke Jul 30 '18 at 16:40
  • Yes. The database is accessible via psql and DataGrip. – Brett Jul 30 '18 at 21:05
  • Sorry, I meant the tool in airflow under the tab Data Profiling>[Ad-Hoc Query](https://airflow.apache.org/profiling.html#adhoc-queries). Also, it looks like this is an RDS instance (I also run my airflow database on RDS); does your `sql_alchemy_conn` look something like this? postgres+psycopg2://$USERNAME:$PASSWORD@my-airflow-db.l9zijaslosu.us-east-1.rds.amazonaws.com:$PORT/$DBNAME – Adam Bethke Jul 30 '18 at 21:43
  • Yes, my `sql_alchemy_conn` looks to be in that format. When I attempt to query the "postgres_default" database via the Ad Hoc tool, I see two errors: "fe_sendauth: no password supplied" and "no data." My specific query was `select count(*) from dag;` – Brett Jul 30 '18 at 21:52
  • Is postgres_default your airflow database? If not, you'd need to add a postgres connection with your airflow database parameters in order to be able to test the connection. I think postgres_default comes without connection credentials (which would explain "no password supplied" – Adam Bethke Jul 31 '18 at 00:59
  • I added the postgres connection as "airflow_db" and was was ble to complete a query: `select * from dag` using the Ad Hoc query. – Brett Jul 31 '18 at 13:16
  • 1
    Hmmm. I've experienced similar issues on AWS in two instances: 1. the server is overloaded (too small an instance, offline for backup period/resizing), and 2. temporary failure in DNS due to noisy neighbors (but the error message is more explicit). I don't think that's your case, but sharing where I've run into problems JIC – Adam Bethke Aug 02 '18 at 12:19
  • hitting this fairly frequently with a demanding DAG, with around ~50 subtasks. running Airflow on RDS. hoping someone has a workaround. – root Aug 23 '18 at 09:06
  • 6
    I am getting the same thing - occasional failures on a DAG with lots of subtasks. Did you ever figure out a fix for this? – kgully Aug 20 '19 at 14:10
  • Please provide the full stack trace. – joebeeson Dec 14 '20 at 21:38
  • 1
    I had a similar issue that turned out to be a gevent resolver issue: https://serverfault.com/questions/1080450/intermittent-500-error-caused-by-psycopg2-operationalerror-could-not-translate/1080451#1080451 – Zev Oct 14 '21 at 14:03

0 Answers0