2

We've been seeing our realtime documents not reconnecting themselves properly for some time now after disconnecting/connecting WiFi. In particular Safari has a 100% success rate at reproducing this issue for us. Chrome also is susceptible to this same issue although it does a significantly better job than Safari at getting the Realtime API re-connected.

The issue appears to be related to how/if the bind/save XHR requests will be reported as failed to JS when the browser is not connected to the internet. In the case of Safari these requests never fully fail and instead sit and wait forever as no timeout is specified along with the XHR. Once this happens the Realtime API does not try to reconnect itself.

Below are the files that repro this issue 100% in Safari. They will prompt to wait/disable/enable WiFi. Each step may take ~10-20s. Additional information is in the comments. They are also available directly from https://drive.google.com/file/d/0B1es-bMybSeSUmRqX2JJd3dqRFE/view?usp=sharing. The ClientID and ApiKey may need to be modified to make them work properly.

Is anyone else running into similar issues or seen behavior that could be explained by this? Any reasonable workaround ideas apart from detecting this case and clobbering the document?

Edit: And of course 15min after posting this, a potential workaround popped into my head. Not pretty by any stretch of the imagination and may cause other issues, but hijacking the XHR to manually set a timeout to ensure the requests expire appears to fix the issue. [Edit3: Updated to scoped workaround with 70s timeout.]

Workaround

var timeoutURLs =
[
    'https://drive.google.com/otservice/save',
    'https://drive.google.com/otservice/bind'
];

function shouldSetXHRTimeout(xhr)
{
    var shouldSetTimeout = false;

    var URLMatch = xhr.url.split('?')[0];

    if (timeoutURLs.indexOf(URLMatch) >= 0)
    {
        shouldSetTimeout = true;
    }

    return shouldSetTimeout;
}

function wrapXHR()
{
    var __send = window.XMLHttpRequest.prototype.send;

    window.XMLHttpRequest.prototype.send = function (data)
    {
        if (!this.timeout && shouldWrapXHRTimeout(this)
        {
            this.timeout = 70000;
        }

        __send.call(this, data);
    };
};

wrapXHR();

gdrive_connect.html

<!DOCTYPE html>
<html>
<head>
    <title>Duchess</title>
    <script type="text/javascript" src="//apis.google.com/js/client.js?onload=GoogleApiLoaded" async defer></script>
    <script src="gdrive_connect.js" type="application/javascript" ></script>
</head>

<body>
    <div id="status">Setting up scenario...</div>
    <button id="button" onclick="runScenario()">Run Scenario</button>
</body>

gdrive_connect.js

var clientId = "597847337936.apps.googleusercontent.com";
var REALTIME_MIMETYPE = 'application/vnd.google-apps.drive-sdk';

//
// Problem
//  The realtime lib is not reconnecting itself and pushing changes back to the remote server
//  after getting disconnected from WiFi in some cases. That is, disconnecting WiFi then reconnecting can
//  result in the Realtime API no longer sending or receiving updates for the document that it believes is
//  connected.
//
// More Info
//  We spent a while tracking down the cause of the issue, and it appears to be with how the bind/save calls
//  for the Realtime API are handled if they just 'never return'. In the reconnection failure cases, it
//  appears that these network requests never completely 'fail' and instead indefinitely wait without
//  notifying the XHR handler.
//
// Hardware
//  This was reprod on a MacBook Pro w/ OSX 10.11 on both public WiFi as well as a home WiFi setup.
//
// Software
//  This issue repros in both Chrome and Safari. In Safari it happens 100% of the time and will be a
//  significantly more consistent repro. Chrome it only happens occasionally and can be fixed by opening and
//  closing the laptop. In Chrome this appears to have a different effect on how hung network requests are
//  handled than just disabling/enabling WiFi.
//
//   Safari
//     Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11) AppleWebKit/601.1.56 (KHTML, like Gecko) Version/9.0 Safari/601.1.56
//     Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/601.4.4 (KHTML, like Gecko) Version/9.0.3 Safari/601.4.4
//   Chrome
//     Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.97 Safari/537.36
//
// Repro
//  0) Ensure that the browser/machine are properly connected to the internet and authorized.
//  1) Authorize and load realtime doc.
//  2) Make modification to the document.
//  3) Verify modification was saved remotely.
//  4) Disable WiFi on machine.
//  5) Make modification to the document.
//  6) Verify that the modification did not get saved remotely as machine is offline.
//  7) Enable WiFi on machine.
//  8) Check document's saveDelay to verify that the remote change was not propagated.
//  9) Make additional modification to document.
//  10) Verify that none of the changes have been saved remotely.
//
// The repro requires that you manually disable/enable WiFi and should prompt when to do so. All other info
// is spit out into the console about all of the online/offline related status. When a non-network !!!ERRROR
// is displayed, this means that the repro has been successful. The browser is 100% online, the user is
// authenticated but the realtime document is not flushing its changes to the remote server.
//
// Additional Repro
//  While the repro page is enough to 100% consistently cause reconnection issues in Safari, Chrome will fix
//  itself in most cases. Disabling/Enabling the WiFi rapidly right after a 'bind' request is made will
//  generally also get Chrome stuck not reconnecting to the server.
//
// Console Key
//  Delay: The realtime doc's 'saveDelay' field.
//  Modified: Time since last modification was made to the doc.
//  IsOnline: Whether the browser thinks it is online (navigator.onLine).
//  DriveConn: Result of making a separate authenticated request to drive through gapi.
//  TokenValid: Whether the gapi.auth authroization token is still currently valid.
//  IsOpen: The realtime doc's '!isClosed' field.
//
// The realtime document is accessible through 'window.remoteDoc'.
//

var ReproSteps =
[
    'Setting up scenario...',       // 0
    'Click button to start',        // 1
    'Connecting...',                // 2
    'Running... please wait ~10s',  // 3
    'Disable WiFi',                 // 4
    'Enable WiFi',                  // 5
    'Running... please wait ~10s',  // 6
    'Done! See error in console'    // 7
];


var currentStep = 0;
function setReproStep(index)
{
    if (index >= currentStep)
    {
        var msg = ReproSteps[index];

        document.getElementById('status').innerText = msg;
    }
}

function GoogleApiLoaded()
{
    console.log('Google Remote Client Lib Loaded');

    gapi.auth.init();

    gapi.client.setApiKey('AIzaSyB9HEdSJ-nhLJG_ssSSqhI2DX74GSiKSao');

    gapi.load('auth:client,drive-realtime', function ()
    {
        console.log('GAPI Loaded Libs');

        setReproStep(1);
    });
}

function createRealtimeFile(title, description, callback)
{
    console.log('Creating Drive Document');
    gapi.client.drive.files.insert({
        'resource':
        {
            'mimeType': REALTIME_MIMETYPE,
            'title': title,
            'description': description
        }
    }).execute(function (docInfo)
    {
        callback(docInfo, /*newDoc*/true);
    });
}

function openRealtimeFile(title, callback)
{
    gapi.client.load('drive', 'v2', function ()
    {
        gapi.client.drive.files.list(
        {
            'q': 'title='+"'"+title+"' and 'me' in owners and trashed=false"
        }).execute(function (results)
        {
            if (!results.items || results.items.length === 0)
            {
                createRealtimeFile(title, /*DocDescription*/"", callback);
            }
            else
            {
                callback(results.items[0], /*newDoc*/false);
            }
        });
    });
}

function runScenario()
{
    console.log('Page Loaded');

    document.getElementById('button').style.display = 'none';

    setReproStep(2);

    var GScope =
    {
        Drive: 'https://www.googleapis.com/auth/drive.file'
    };

    var handleAuthResult = function (authResult)
    {
        console.log('Requesting Drive Document');

        openRealtimeFile("TESTDOC__", function (docInfo, newDoc)
        {
            if (docInfo && docInfo.id)
            {
                gapi.drive.realtime.load(docInfo.id, onDocLoaded, onDocInitialized, onDocLoadError);
            }
            else
            {
                console.log('Unable to find realtime doc');
                debugger;
            }
        });
    };

    gapi.auth.authorize(
    {
        client_id: clientId,
        scope: [ GScope.Drive ],
        immediate: false
    }, handleAuthResult);
}

function onDocInitialized(model)
{
    console.log('Drive Document Initialized');

    var docRoot = model.createMap();

    model.getRoot().set('docRoot', docRoot);
}

var testMap;
var docDataCounter = 0;
var lastWrite = 0;

var remoteDoc;

function onDocLoaded(doc)
{
    setReproStep(3);

    remoteDoc = doc;

    var docModel = doc.getModel();
    var docRoot = docModel.getRoot();

    console.log('Drive Document Loaded');

    // If the loaded document has already been used to test, delete any previous data.
    if (docRoot.has('testMap'))
    {
        console.log('Previous test detected: ' + docRoot.get('testMap').get('testData'));
        docRoot.delete('testMap');
    }

    docRoot.set('testMap', docModel.createMap());

    testMap = docRoot.get('testMap');
    console.assert(testMap, 'Test map required');

    makeDriveDocChange();

    doc.addEventListener(gapi.drive.realtime.EventType.DOCUMENT_SAVE_STATE_CHANGED, onSaveStateChange);

    beginRunningTest();
}

var VerificationTime = 5000;
var ModificationTime = 10000;

function beginRunningTest()
{
    verifyConnectionState();

    setTimeout(setReproStep, ModificationTime * 2, 4);
}

var verificationCount = 0;
function verifyConnectionState()
{
    setTimeout(verifyConnectionState, VerificationTime);

    var saveDelay = remoteDoc.saveDelay;
    var isClosed = remoteDoc.isClosed;
    var lastModification = Date.now() - lastWrite;
    var browserOnline = navigator.onLine;
    var currentCount = ++verificationCount;

    if (!browserOnline && saveDelay > ModificationTime)
    {
        setReproStep(5);

        setTimeout(setReproStep, ModificationTime * 5, 6);
    }

    var isTokenValid = verifyAuthToken();

    verifyDriveConnection(function (driveConnected)
    {
        console.log('--------------------- ' + currentCount + ' ---------------------');

        console.log(' Delay:      ' + saveDelay);
        console.log(' Modified:   ' + lastModification);
        console.log(' IsOnline:   ' + browserOnline);
        console.log(' DriveConn:  ' + driveConnected);
        console.log(' TokenValid: ' + isTokenValid);
        console.log(' IsOpen:     ' + !isClosed);

        if (saveDelay > VerificationTime && driveConnected && !isClosed && browserOnline && isTokenValid)
        {
            console.error('!!! ERROR: Local document not reconnected to remote server. Scenario done.');

            setReproStep(7);
        }
    });

    if (lastModification > ModificationTime && saveDelay === 0)
    {
        makeDriveDocChange();
    }
}

function onSaveStateChange(e)
{

}

function verifyAuthToken()
{
    var isValid = false;

    var token = gapi.auth.getToken();

    if (token)
    {
        var expireTime = parseInt(token.expires_at) * 1000;

        if (Date.now() < expireTime)
        {
            isValid = true;
        }
    }

    return isValid;
}

function makeDriveDocChange()
{
    testMap.set('testData', ++docDataCounter);

    lastWrite = Date.now();
}

function verifyDriveConnection(cb)
{
    gapi.client.drive.about.get({

    }).execute(function (res)
    {
        cb(res && !res.error);
    });
}

function onDocLoadError(e)
{
    console.log('Doc Load Error: ', e);
    findAndLoadDoc();
}
Grant Watters
  • 660
  • 4
  • 16
  • 1
    Hi Grant, Thanks for the detailed reproduction instructions. I will investigate this issue and see if we can reach a solution. – Brian Cairns Feb 27 '16 at 00:56
  • Thanks for the quick reply Brian! Hope things are going well over there :) – Grant Watters Feb 27 '16 at 01:11
  • OK, I would *not* recommend setting the timeout to 10 seconds. The client to server communication channel used by the Realtime API uses long HTTP requests (60 seconds) which are terminated early (with an error) if you have a timeout set. Setting the timeout to less 60 seconds or less will guarantee that every such request finishes with an error, which triggers the API's exponential backoff logic and will effectively break real-time collaboration. The API will eventually retry and catch up, but this can take up to 60 seconds. If you are going to set a timeout, set it to at least 70 seconds. – Brian Cairns Feb 29 '16 at 19:06
  • Great point Brian, I've updated the post to show a 70s timeout in the workaround and is now much closer to what we are actually using. Were you able to reproduce the issue? – Grant Watters Mar 01 '16 at 02:12
  • How did I miss this question? Can corroborate 90% of the above, and the detailed workaround (should be an answer?) has dropped my shoulders 3 inches. – HeyHeyJC Apr 01 '16 at 15:35
  • @GrantWatters, have you had any experience recently with this? I'm getting an occasional user report of work being lost (or possibly reverted?) after sleeping, and can't track it down. – HeyHeyJC Apr 08 '16 at 17:06
  • We have been successfully using the workaround mentioned in the post to reconnect in all the cases that I have seen. Unfortunately this is one of those problems that is hard to prove as fixed in all cases :(. Assuming that you don't have a local data cache, a potential option for determining if this is the problem is to warn the user (or log) if they are leaving the app/page with a high 'saveDelay' on the doc. – Grant Watters Apr 08 '16 at 17:56
  • @GrantWatters That's what I resorted to, too. I haven't heard any reports since... but we shall see. – HeyHeyJC Apr 19 '16 at 18:42

0 Answers0