Backups take so long that the firewall closes the connection

Question

A bit of a mashup of systems here, so bear with me. Essentially, I'm having some trouble using the Backup Exec agent for Oracle, while trying to backup a remote Linux server. The BE agent appears to use RMAN to backup the databases

The backup server is on one VLAN and the target server on another, with a Cisco ASA firewall providing the only link between them. This is by design, as the backup server is to support numerous clients and each client must be on its own VLAN to prevent them from accessing each other. I have added the recommended ports to the firewall to at least allow the agent to talk to the media server.

The backup starts well enough (indeed a smaller Oracle database on the same server completes without issue) but a 200GB database, which would clearly take a few hours to complete, is not able to complete.

I believe the problem to be related to http://www.symantec.com/business/support/index?page=content&id=TECH59632, which says that a CORBA session is established on port 5633 at the start of the backup and used before each RMAN operation but, while data is being transferred, the CORBA session's socket receives no packets. Since the connection timeout on the firewall is 60 mins, the CORBA session is dropped and, when the RMAN agent tries to perform its next action, the whole process bombs. Symantec say this problem was fixed in an earlier version of Backup Exec, but do not detail any additional settings to enforce it.

Setting the connection timeout on the firewall to something high-enough to cover the backup window (e.g. 12 hours) seems like the wrong thing to do, as it is an estate-wide change, which would also affect the connection lifetime of (for example) web requests to another client's web server.

Moving the Linux server into the same LAN as the backup server is out of the question.

I'm not a Linux guru, but I roughly know my way around. So far, I have tried starting using libkeepalive (http://libkeepalive.sourceforge.net/) to force the beremote process' socket creation to be made with a KEEPALIVE TCP flag, but a quick netstat -top indicates that it is not taking. Either I'm using libkeepalive incorrectly, or it doesn't work for the beremote binary

I guess I am looking for an option that fits with the environment I am in. I figure I'm looking for one or more of the following:

a way to configure the BE agent to keep the connection alive?
a way to inject the keepalive flag to the existing TCP connection (e.g. via a cronjob)?
a way to tell the Cisco device to increase the connection timeout for a specific source/target (maybe a policy-map)?

Any/all (other) ideas welcome...

J.

RE: Comment by @Weaver

As requested, class-map, policy-map and service-map entries...

class-map CLS_INSPECTION_TRAFFIC
 match default-inspection-traffic
class-map CLS_ALL_TRAFFIC
 match any
class-map CLS_BACKUPEXEC_CORBA
 description Oracle/DB2 CORBA port for BackupExec traffic
 match port tcp eq 5633
!
!
policy-map type inspect dns PMAP_DNS_INSPECT_SETTINGS
 parameters
  message-length maximum client auto
  message-length maximum 1280
policy-map PMAP_GLOBAL_SERVICE
 class CLS_INSPECTION_TRAFFIC
  inspect dns PMAP_DNS_INSPECT_SETTINGS 
  inspect ftp 
  inspect h323 h225 
  inspect h323 ras 
  inspect rsh 
  inspect rtsp 
  inspect esmtp 
  inspect sqlnet 
  inspect skinny  
  inspect sunrpc 
  inspect xdmcp 
  inspect sip  
  inspect netbios 
  inspect tftp 
  inspect ipsec-pass-thru 
  inspect icmp 
  inspect snmp 
 class CLS_BACKUPEXEC_CORBA
  set connection timeout idle 1:00:00 dcd 
 class CLS_ALL_TRAFFIC
  set connection decrement-ttl
!

Weaver · Accepted Answer · 2011-07-27T00:47:50.607

Background on ASA Timeout/Timers:

The global timeout conn is TCP virtual circuit (session) idle timer and defaults to 60 minutes. The global timeout udp is for UDP holes and defaults to 2 minutes. The global timeout xlate is for clearing up translations that linger around after a conn has timed out. The conn (TCP) timeout takes precedence over the xlate timeout. The next paragraph further explains the relationship between conn and xlate timers.

If a conn is successfully torn down via TCP teardown, the conn and xlate go with it (if dynamic xlate, static NAT and static PAT xlate's are never removed). If a conn times out, then the xlate timer is taken into account. If the xlate times out first (you set it real low) it will not take down the connection until the conn times out.

The ASA has several methods for dealing with the varying timeouts. Conn is one where the global setting can be overridden based on class-map -- this should be preferred over increasing the global setting if possible.

The other interesting feature the ASA possesses is dead connection detection -- DCD. DCD allows you to keep your [global] conn timeout at 60 minutes (the default) and when 60 minutes is reached -- the ASA man-in-the-middle spoofs null data ACKs to each endpoint as the other endpoint. Null data works to prevent the sequence numbers from incrementing. If both sides respond the connection's idle timer resets to 0 and begins again. If either side does not respond after a set number of attempts (configurable) in a given period the conn is removed and the xlate timer gains relevance as described above.

I'd recommend configuring a class-map and adding it to your policy that enables DCD. You can use an ACL or a port (others are available as well). Using the port is quick, easy, and will work well if you are certain the TCP/5633 is where the problem sits..

I have used the global_policy below but feel free to adjust as necessary.

class-map BE-CORBA_class
 description Backup Exec CORBA Traffic Class
 match port tcp eq 5633

policy-map global_policy
 class BE-CORBA_class
  -->::Choose one below::<--
  set connection timeout idle 1:00:00 dcd --> for 8.2(2) and up
  set connection timeout tcp 1:00:00 dcd --> for prior to 8.2(2)

service-policy global_policy global

@Comment

According to the reference guide -- "A packet can match only one class map in the policy map for each feature type."

The key phrase is in bold. A packet crossing an interface can match multiple classes inside of a policy-map, but only if those classes use different "features." If you scroll up just a tad in the aforementioned link you will see the various features listed. That whole page is a goldmine for MPF tidbits.

As you mentioned that you have a match any class-map defined and then referenced as a class inside the policy-map -- if you are performing any other TCP and UDP connection limits and timeouts changes in that policy-map class, then subsequent class-maps that match the traffic -- if set in the policy-map -- will not perform TCP and UDP connection limits and timeout changes on that packet.

If you post all the ACL's, class-map's, policy-map's, and service-policy's we can determine for certain.

If this works, you win +1000 internets. I might not know until the next full backup runs on Friday, though, as the daily incrementals might not take longer than an hour. Even if the DCD approach doesn't work, I should certainly be able to modify the timeout specifically for that port. — jimbobmcgee, Jul 25 '11 at 13:40
@Weaver - OK, my first tests with this didn't work out. I did a `packet-tracer` on traffic that would match that port and I could not see the relevant `class-map` in the `CONN_SETTINGS` phase. I think it is because only one `class` is matched in a single `policy-map` and I had a `match any` class prior to my `match port` class. Am I correct in thinking that it will only match one, or is it that the `packet-tracer` command is only showing one. `policy-map` seems very limited if it can't handle multiple matching classes, especially as it does not seem to give you the means to manually order — jimbobmcgee, Jul 26 '11 at 09:40
@user49761 - Responded in the answer. Too much to include here. — Weaver, Jul 27 '11 at 00:48
@Weaver - The `policy-map` I'm using is pretty simple, so manually removed and re-added the `match any` clause. This has allowed the `packet-tracer` to return the entry I need in the `CONN_SETTINGS` phase. When I have a backup that lasts longer than an hour, I'll know for certain whether it has worked. I have added the `class-map`, `policy-map` and `service-policy` entries to the question above. It must be because I am doing `set connection decrement-ttl` in my `match any`. I had read your linked reference before, but didn't grasp that `set connection` could not run twice, straight away. — jimbobmcgee, Jul 27 '11 at 17:58
@user49761 - In your current config a packet with TCP/5633 will match `CLS_BACKUPEXEC_CORBA` class first as it appears earlier in the `policy-map`. Conn/Timeout changes will be applied according to the `class-map`. The packet will then match `CLS_ALL_TRAFFIC` class and the Conn/Timeout changes will ***not*** be applied to the TCP/5633 packet as *only one* Conn/Timeout feature can be performed per packet in a `policy-map`. Sounds like you had it the other way around before -- preventing the wanted class-map Conn/Timeout changes from being applied for the TCP/5633 flows. You got it. — Weaver, Jul 27 '11 at 18:57
Well, my Friday backup ran like a charm; took 15 hours and the connection stayed up for all of it. Thanks @Weaver — jimbobmcgee, Aug 01 '11 at 10:04

score 1 · Answer 2 · answered Jul 20 '11 at 19:28

1

As much as I'm not a fan of applications taking their toys and going home (and failing the backup) when one single TCP session gets killed, in this case I'd say just up the ASA's TCP session timeout.

Putting a hard limit on session length at all is really just a product of the ASA's need to track all connections to maintain state (and usually, NAT) - if you're running against your device's connection limit, then it may be an issue, but otherwise, just crank it up to 6 hours or something.

Unless both nodes at the ends of a TCP session go dark, the ASA will bear witness to one end or the other ending the connection when it ends naturally, and tear down the connection then (or trigger the shorter half-closed connection timeout), so you're unlikely to end up with a ton of dead connections clogging things up. The endpoint devices have an interest in tearing down useless connections, too - web servers are a good example, as they'll usually have much shorter connection timeouts than your ASA.

answered Jul 20 '11 at 19:28

Shane Madden

114,520
13
181
251

I might try this over the weekend. I guess, by rights, that most connections behave normally and that normally-behaving connections will send the necessary close packet (FIN? RST?) when they are finished, anyway, so there wouldn't be too much of a concern regarding a buildup of connections/translates? – jimbobmcgee Jul 21 '11 at 13:59
Yup - depends on the load and connection count running on the firewall normally, but having a few more stray connections in the table from the small percentage of connections that don't close properly is probably not going to have a lot of impact. – Shane Madden Jul 21 '11 at 14:27
1

I assume that the xlate timeout would also have to be increased, also? – jimbobmcgee Jul 21 '11 at 16:36
Assuming the connection is being translated, then yes. If no NAT rule applies to the traffic, then no. – Shane Madden Jul 21 '11 at 16:40
1

As noted in my answer, the xlate timeout does not need to be increased. The conn timeout takes precedence. xlate timeout is only evaluated after conn timeout, even if xlate is shorter. – Weaver Jul 23 '11 at 06:16
OK, as I didn't get @Weaver's answer until today, I left the backup running with `timeout conn 8:00:00` in place. The backup did finish and there didn't appear to be any notable effect on service, so +1 for @Shane, but I have reverted it now, to try @Weaver's approach. – jimbobmcgee Jul 25 '11 at 13:39
@user49761 `timeout conn` should be increased, `timeout xlate` can be left lower - As @Weaver explained, the translation should stick for the lifetime of the connection. – Shane Madden Jul 25 '11 at 14:18

score 0 · Answer 3 · answered Jul 20 '11 at 19:31

You might consider using a generic TCP proxy on the remote Linux machine that answers and forwards the connections from Backup Exec to the local CORBA port. (You could easily arrange for the Backup Exec server to connect to this proxy by way of the NAT rules on your firewall.) That TCP proxy would need to have the SO_KEEPALIVE option set on the listening socket it creates. My proxy-of-choice is rinetd, but a quick look at the source shows that they're not setting the SO_KEEPALIVE option on their listening socket (so you'd have to modify it to get the behavior you want). There maybe another generic TCP proxy that does set SO_KEEPALIVE by default (or as an option) but I'm not aware of one off the top of my head.

Another option might be to bring up an SSH tunnel to the remote machine as part of a pre-job script w/ the SSH client set to either use SO_KEEPALIVE or SSH null packets to keep the connection alive.

Would an IPSec connection include Keepalive? I can add a policy to the Backup Exec (Windows) server to force connections between the two IP addresses to use IPSec. I assume I could do the same for Linux, too (not that I know how currently!) — jimbobmcgee, Jul 21 '11 at 14:01
@Evan Does keep-alive mode preclude the ASA's connection timers? The impression that I get is that if you've got `timeout conn 1:00:00` set, then that TCP tunnel's going down at an hour, regardless - but I haven't actually tested! — Shane Madden, Jul 21 '11 at 14:23

Backups take so long that the firewall closes the connection

RE: Comment by @Weaver

3 Answers3