0

I'm following this guide on how to set up docker with timescale/wale for continuous archiving.

https://docs.timescale.com/timescaledb/latest/how-to-guides/backup-and-restore/docker-and-wale/#run-the-timescaledb-container-in-docker

Everything runs as expected, but when I get to the final step, I'm seeing:

written to stdout
2022-04-13 07:43:33.349 UTC [27] LOG:  redo done at 0/50000F8
Connecting to wale (172.18.0.3:80)
writing to stdout
-                    100% |********************************|    36  0:00:00 ETA
written to stdout
Connecting to wale (172.18.0.3:80)
wget: server returned error: HTTP/1.0 500 INTERNAL SERVER ERROR
2022-04-13 07:43:34.264 UTC [27] LOG:  selected new timeline ID: 2
2022-04-13 07:43:34.282 UTC [27] LOG:  archive recovery complete
Connecting to wale (172.18.0.3:80)
wget: server returned error: HTTP/1.0 500 INTERNAL SERVER ERROR
2022-04-13 07:43:34.838 UTC [27] LOG:  could not open file "pg_wal/000000010000000000000006": Permission denied
2022-04-13 07:43:34.844 UTC [1] LOG:  database system is ready to accept connections


It looks like the wget to wale is failing? It's connected to the same network as timescaledb_recovered so shouldn't it work? is there some additional config that the docs are missing? Or am I misreading these logs somehow?

Some additional error output from wale log:

['wal-e', '--terse', 'wal-push', '/var/lib/postgresql/data/pg_wal/000000010000000000000012']
Pushing wal file /var/lib/postgresql/data/pg_wal/000000010000000000000012: ['wal-e', '--terse', 'wal-push', '/var/lib/postgresql/data/pg_wal/000000010000000000000012']
172.18.0.2 - - [13/Apr/2022 14:09:17] "GET /wal-push/000000010000000000000012 HTTP/1.1" 200 -
['wal-e', '--terse', 'wal-fetch', '-p=0', '000000010000000000000011', '/var/lib/postgresql/data/pg_wal/000000010000000000000011']
Fetching wal 000000010000000000000011: ['wal-e', '--terse', 'wal-fetch', '-p=0', '000000010000000000000011', '/var/lib/postgresql/data/pg_wal/000000010000000000000011']
172.18.0.4 - - [13/Apr/2022 14:09:53] "GET /wal-fetch/000000010000000000000011 HTTP/1.1" 200 -
['wal-e', '--terse', 'wal-fetch', '-p=0', '000000010000000000000012', '/var/lib/postgresql/data/pg_wal/000000010000000000000012']
Fetching wal 000000010000000000000012: ['wal-e', '--terse', 'wal-fetch', '-p=0', '000000010000000000000012', '/var/lib/postgresql/data/pg_wal/000000010000000000000012']
172.18.0.4 - - [13/Apr/2022 14:09:54] "GET /wal-fetch/000000010000000000000012 HTTP/1.1" 200 -
['wal-e', '--terse', 'wal-fetch', '-p=0', '000000010000000000000011', '/var/lib/postgresql/data/pg_wal/000000010000000000000011']
Fetching wal 000000010000000000000011: ['wal-e', '--terse', 'wal-fetch', '-p=0', '000000010000000000000011', '/var/lib/postgresql/data/pg_wal/000000010000000000000011']
172.18.0.4 - - [13/Apr/2022 14:09:54] "GET /wal-fetch/000000010000000000000011 HTTP/1.1" 200 -
['wal-e', '--terse', 'wal-fetch', '-p=0', '00000002.history', '/var/lib/postgresql/data/pg_wal/00000002.history']
Fetching wal 00000002.history: ['wal-e', '--terse', 'wal-fetch', '-p=0', '00000002.history', '/var/lib/postgresql/data/pg_wal/00000002.history']
lzop: short read
wal_e.main   CRITICAL MSG: An unprocessed exception has avoided all error handling
        DETAIL: Traceback (most recent call last):
          File "/usr/lib/python3.5/site-packages/wal_e/cmd.py", line 657, in main
            args.prefetch)
          File "/usr/lib/python3.5/site-packages/wal_e/operator/backup.py", line 353, in wal_restore
            self.gpg_key_id is not None)
          File "/usr/lib/python3.5/site-packages/wal_e/worker/worker_util.py", line 58, in do_lzop_get
            return blobstore.do_lzop_get(creds, url, path, decrypt, do_retry=do_retry)
          File "/usr/lib/python3.5/site-packages/wal_e/blobstore/file/file_util.py", line 52, in do_lzop_get
            raise exc
          File "/usr/lib/python3.5/site-packages/wal_e/blobstore/file/file_util.py", line 64, in write_and_return_error
            key.get_contents_to_file(stream)
          File "/usr/lib/python3.5/site-packages/wal_e/blobstore/file/calling_format.py", line 53, in get_contents_to_file
            with open(self.path, 'rb') as f:
        FileNotFoundError: [Errno 2] No such file or directory: '/backups/wal_005/00000002.history.lzo'

        STRUCTURED: time=2022-04-13T14:09:55.216294-00 pid=32
Failed to fetch wal 00000002.history: None
Exception on /wal-fetch/00000002.history [GET]
Traceback (most recent call last):
  File "/usr/lib/python3.5/site-packages/flask/app.py", line 2292, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/lib/python3.5/site-packages/flask/app.py", line 1816, in full_dispatch_request
    return self.finalize_request(rv)
  File "/usr/lib/python3.5/site-packages/flask/app.py", line 1831, in finalize_request
    response = self.make_response(rv)
  File "/usr/lib/python3.5/site-packages/flask/app.py", line 1957, in make_response
    'The view function did not return a valid response. The'
TypeError: The view function did not return a valid response. The function either returned None or ended without a return statement.
172.18.0.4 - - [13/Apr/2022 14:09:55] "GET /wal-fetch/00000002.history HTTP/1.1" 500 -
['wal-e', '--terse', 'wal-fetch', '-p=0', '00000001.history', '/var/lib/postgresql/data/pg_wal/00000001.history']
Fetching wal 00000001.history: ['wal-e', '--terse', 'wal-fetch', '-p=0', '00000001.history', '/var/lib/postgresql/data/pg_wal/00000001.history']
lzop: short read
wal_e.main   CRITICAL MSG: An unprocessed exception has avoided all error handling
        DETAIL: Traceback (most recent call last):
          File "/usr/lib/python3.5/site-packages/wal_e/cmd.py", line 657, in main
            args.prefetch)
          File "/usr/lib/python3.5/site-packages/wal_e/operator/backup.py", line 353, in wal_restore
            self.gpg_key_id is not None)
          File "/usr/lib/python3.5/site-packages/wal_e/worker/worker_util.py", line 58, in do_lzop_get
            return blobstore.do_lzop_get(creds, url, path, decrypt, do_retry=do_retry)
          File "/usr/lib/python3.5/site-packages/wal_e/blobstore/file/file_util.py", line 52, in do_lzop_get
            raise exc
          File "/usr/lib/python3.5/site-packages/wal_e/blobstore/file/file_util.py", line 64, in write_and_return_error
            key.get_contents_to_file(stream)
          File "/usr/lib/python3.5/site-packages/wal_e/blobstore/file/calling_format.py", line 53, in get_contents_to_file
            with open(self.path, 'rb') as f:
        FileNotFoundError: [Errno 2] No such file or directory: '/backups/wal_005/00000001.history.lzo'

        STRUCTURED: time=2022-04-13T14:09:55.689548-00 pid=38
Failed to fetch wal 00000001.history: None
Exception on /wal-fetch/00000001.history [GET]
Traceback (most recent call last):
  File "/usr/lib/python3.5/site-packages/flask/app.py", line 2292, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/lib/python3.5/site-packages/flask/app.py", line 1816, in full_dispatch_request
    return self.finalize_request(rv)
  File "/usr/lib/python3.5/site-packages/flask/app.py", line 1831, in finalize_request
    response = self.make_response(rv)
  File "/usr/lib/python3.5/site-packages/flask/app.py", line 1957, in make_response
    'The view function did not return a valid response. The'
TypeError: The view function did not return a valid response. The function either returned None or ended without a return statement.
172.18.0.4 - - [13/Apr/2022 14:09:55] "GET /wal-fetch/00000001.history HTTP/1.1" 500 -

I've added some additional logs from the wale container that create the error message on running timescaledb-recovered. I'm guessing that there is some issue with the requests timescaledb-recovered is sending because wget works until that continer is started.

Nicole Staline
  • 557
  • 4
  • 15
  • its failing because the server throws an error `server returned error: HTTP/1.0 500 INTERNAL SERVER ERROR`. So your networking works fine / as expected. But the app you have deployed has problems. Perhaps misconfigured. You need to provide more details about that app, config files etc. Otherwise you question cannot be answered. – The Fool Apr 13 '22 at 12:58
  • the error also says `could not open file "pg_wal/000000010000000000000006": Permission denied`. This is likely your issue. Perhaps the container is run as nonroot and has no permissions on that file. – The Fool Apr 13 '22 at 13:02
  • Everything is running through containers though, I would imagine, assuming you aren't changing any of the specs, it would be the same regardless of the machine it's run on. – Nicole Staline Apr 13 '22 at 13:04
  • what spec? You need to add more information to your question. Show how you run the containers. – The Fool Apr 13 '22 at 13:06
  • And I'm fairly certain the permissions issue is unrelated, I've resolved that by adjusting permissions in the past and still see the wget failure. I'm running the containers exactly the same way as the commands are provided in the docs. Not changing anything to rule out potential issues. https://docs.timescale.com/timescaledb/latest/how-to-guides/backup-and-restore/docker-and-wale/#run-the-timescaledb-container-in-docker From step 1 to the final step. – Nicole Staline Apr 13 '22 at 13:08
  • its better to post everything relevant to your question on the question itself instead of refering to external links. You can link the docs for someone who wants to learn more but you should include the exact steps to reproduce your issue on the question. If you would do everything exactly like documented, you likely would not get an error. Even if the permission error is unrelated, your server still throws a 500. So its not a networking issue. The connection is clearly established. – The Fool Apr 13 '22 at 13:18
  • The networking to ` wale (172.18.0.3:80)` is working fine. You can tell this because the web server there provided a response (500). Typically, 5xx level errors indicate an issue on the server side whereas 4xx is reserved for error messages that are caused by a bad client request. Focus on the web server - there is an obvious error that it is telling you about. – Mark Apr 13 '22 at 13:21
  • if you are sure you follow the instruction precisely, and it's not a mistake on your part, you could open an issue on GitHub and point out that the docs seem broken. – The Fool Apr 13 '22 at 13:22
  • I added some more logs/details that will hopefully help. Really struggling to figure out what is causing this. – Nicole Staline Apr 13 '22 at 14:14

1 Answers1

0

This is bizarre, but apparently the critical failure and 500 error are intended to lets postgres know no further segments need to be recovered. Incredibly frustrating.

Nicole Staline
  • 557
  • 4
  • 15