2

I'm using Go routines to send queries to PostgreSQL master and slave nodes in parallel. The first host that returns a valid result wins. Error cases are outside the scope of this question.

The caller is the only one that cares about the contents of a *sql.Rows object, so intentionally my function doesn't do any operations on those. I use buffered channels to retrieve return objects from the Go routines, so there should be no Go routine leak. Garbage collection should take care of the rest.

There is a problem I haven't taught about properly: the Rows objects that remain behind in the channel are never closed. When I call this function from a (read only) transaction, tx.Rollback() returns an error for every instance of non-closed Rows object: "unexpected command tag SELECT".

This function is called from higher level objects:

func multiQuery(ctx context.Context, xs []executor, query string, args ...interface{}) (*sql.Rows, error) {
    rc := make(chan *sql.Rows, len(xs))
    ec := make(chan error, len(xs))
    for _, x := range xs {
        go func(x executor) {
            rows, err := x.QueryContext(ctx, query, args...)
            switch { // Make sure only one of them is returned
            case err != nil:
                ec <- err
            case rows != nil:
                rc <- rows
            }
        }(x)
    }

    var me MultiError
    for i := 0; i < len(xs); i++ {
        select {
        case err := <-ec:
            me.append(err)
        case rows := <-rc: // Return on the first success
            return rows, nil
        }
    }
    return nil, me.check()
}

Executors can be *sql.DB, *sql.Tx or anything that complies with the interface:

type executor interface {
    ExecContext(ctx context.Context, query string, args ...interface{}) (sql.Result, error)
    QueryContext(ctx context.Context, query string, args ...interface{}) (*sql.Rows, error)
    QueryRowContext(ctx context.Context, query string, args ...interface{}) *sql.Row
}

Rollback logic:

func (mtx MultiTx) Rollback() error {
    ec := make(chan error, len(mtx))
    for _, tx := range mtx {
        go func(tx *Tx) {
            err := tx.Rollback()
            ec <- err
        }(tx)
    }
    var me MultiError
    for i := 0; i < len(mtx); i++ {
        if err := <-ec; err != nil {
            me.append(err)
        }
    }
    return me.check()
}

MultiTx is a collection of open transactions on multiple nodes. It is a higher level object that calls multiQuery

What would be the best approach to "clean up" unused rows? Options I'm thinking about not doing:

  1. Cancel the context: I believe it will work inconsistently, multiple queries might already have returned by the time cancel() is called
  2. Create a deferred Go routine which continues to drain the channels and close the rows objects: If a DB node is slow to respond, Rollback() is still called before rows.Close()
  3. Use a sync.WaitGroup somewhere in the MultiTx type, maybe in combination with (2): This can cause Rollback to hang if one of the nodes is unresponsive. Also, I wouldn't be sure how I would implement that.
  4. Ignore the Rollback errors: Ignoring errors never sounds like a good idea, they are there for a reason.

What would be the recommended way of approaching this?

Edit:

As suggested by @Peter, I've tried canceling the context, but it seems this also invalidates all the returned Rows from the query. On rows.Scan I'm getting context canceled error at the higher level caller.

This is what I've done so far:

func multiQuery(ctx context.Context, xs []executor, query string, args ...interface{}) (*sql.Rows, error) {
    ctx, cancel := context.WithCancel(ctx)
    defer cancel()

    rc := make(chan *sql.Rows, len(xs))
    ec := make(chan error, len(xs))
    for _, x := range xs {
        go func(x executor) {
            rows, err := x.QueryContext(ctx, query, args...)
            switch { // Make sure only one of them is returned
            case err != nil:
                ec <- err
            case rows != nil:
                rc <- rows
                cancel() // Cancel on success
            }
        }(x)
    }

    var (
        me   MultiError
        rows *sql.Rows
    )
    for i := 0; i < len(xs); i++ {
        select {
        case err := <-ec:
            me.append(err)
        case r := <-rc:
            if rows == nil { // Only use the first rows
                rows = r
            } else {
                r.Close() // Cleanup remaining rows, if there are any
            }
        }
    }
    if rows != nil {
        return rows, nil
    }

    return nil, me.check()
}

Edit 2:

@Adrian mentioned:

we can't see the code that's actually using any of this.

This code is reused by type methods. First there is the transaction type. The issues in this question are appearing on the Rollback() method above.

// MultiTx holds a slice of open transactions to multiple nodes.
// All methods on this type run their sql.Tx variant in one Go routine per Node.
type MultiTx []*Tx

// QueryContext runs sql.Tx.QueryContext on the tranactions in separate Go routines.
// The first non-error result is returned immediately
// and errors from the other Nodes will be ignored.
//
// If all nodes respond with the same error, that exact error is returned as-is.
// If there is a variety of errors, they will be embedded in a MultiError return.
//
// Implements boil.ContextExecutor.
func (mtx MultiTx) QueryContext(ctx context.Context, query string, args ...interface{}) (*sql.Rows, error) {
    return multiQuery(ctx, mtx2Exec(mtx), query, args...)
}

Then there is:

// MultiNode holds a slice of Nodes.
// All methods on this type run their sql.DB variant in one Go routine per Node.
type MultiNode []*Node

// QueryContext runs sql.DB.QueryContext on the Nodes in separate Go routines.
// The first non-error result is returned immediately
// and errors from the other Nodes will be ignored.
//
// If all nodes respond with the same error, that exact error is returned as-is.
// If there is a variety of errors, they will be embedded in a MultiError return.
//
// Implements boil.ContextExecutor.
func (mn MultiNode) QueryContext(ctx context.Context, query string, args ...interface{}) (*sql.Rows, error) {
    return multiQuery(ctx, nodes2Exec(mn), query, args...)
}

These methods the public wrappers around the multiQuery() function. Now I realize that just sending the *Rows into a buffered channel to die, is actually a memory leak. In the transaction cases it becomes clear, as Rollback() starts to complain. But in the non-transaction variant, the *Rows inside the channel will never be garbage collected, as the driver might hold reference to it until rows.Close() is called.

I've written this package to by used by an ORM, sqlboiler. My higher level logic passes a MultiTX object to the ORM. From that point, I don't have any explicit control over the returned Rows. A simplistic approach would be that my higher level code cancels the context before Rollback(), but I don't like that:

  1. It gives a non-intuitive API. This (idiomatic) approach would break:
ctx, cancel = context.WithCancel(context.Background())
defer cancel()
tx, _ := db.BeginTx(ctx)
defer tx.Rollback()

  1. The ORM's interfaces also specify the regular, non-context aware Query() variants, which in my package's case will run against context.Background().

I'm starting to worry that this broken by design... Anyway, I will start by implementing a Go routine that will drain the channel and close the *Rows. After that I will see if I can implement some reasonable waiting / cancellation mechanism that won't affect the returned *Rows

Tim
  • 1,585
  • 1
  • 18
  • 23
  • It's hard to offer any kind of recommendation here because we can't see the code that's actually using any of this. Something needs to have clear ownership of each `Rows` and it needs to close it at the appropriate time; that's a requirement of using them. – Adrian Feb 21 '20 at 16:03
  • This is part of a package. `Rows` are only supposed to pass through. The caller, should get one instance of `Rows` returned, the one that returned the quickest. The rest should be discarded. – Tim Feb 21 '20 at 16:08
  • So this is basically about the `Rows` that are not being returned to the caller. The one that is returned, is owned by the caller he should close it. The remaining ones are "stuck" in my channel and are ownership lies within this function. – Tim Feb 21 '20 at 16:10
  • Definitely cancel the context. That's what it's there for and alleviates any problems due to unresponsive nodes. I would probably implement all of the four points, actually. Ignoring the error returned by Rollback is fine in my book (I almost always use it with defer). Most of the time there's nothing useful you can do with it anyway. – Peter Feb 21 '20 at 16:26
  • @Peter, context cancellation didn't work as expected, see my edit. – Tim Feb 22 '20 at 14:13
  • @Adrian I've given more background on usage in Edit 2 – Tim Feb 22 '20 at 18:22

2 Answers2

1

I think that the function below will do what you require with the one provisio being that the context passed in should be cancelled when you are done with the results (otherwise one context.WithCancel will leak; I cannot see a way around that as cancelling it within the function will invalidate the returned sql.Rows).

Note that I have not had time to test this (would need to setup a database, implement your interfaces etc) so there may well be a bug hidden in the code (but I believe the basic algorithm is sound)

// queryResult holds the goroutine# and the result from that gorouting (need both so we can avoid cancelling the relevant context)
type queryResult struct {
    no   int
    rows *sql.Rows
}

// multiQuery - Executes multiple queries and returns either the first to resutn a result or, if all fail, a multierror summarising the errors
// Important: This should be used for READ ONLY queries only (it is possible that more than one will complete)
// Note: The ctx passed in must be cancelled to avoid leaking a context (this routine cannot cancel the context used for the winning query)
func multiQuery(ctx context.Context, xs []executor, query string, args ...interface{}) (*sql.Rows, error) {
    noOfQueries := len(xs)
    rc := make(chan queryResult) // Channel for results; unbuffered because we only want one, and only one, result
    ec := make(chan error)       // errors get sent here - goroutines must send a result or 1 error
    defer close(ec)              // Ensure the error consolidation go routine will complete

    // We need a way to cancel individual goroutines as we do not know which one will succeed
    cancelFns := make([]context.CancelFunc, noOfQueries)

    // All goroutines must terminate before we exit (otherwise the transaction maybe rolled back before they are cancelled leading to "unexpected command tag SELECT")
    var wg sync.WaitGroup
    wg.Add(noOfQueries)

    for i, x := range xs {
        var queryCtx context.Context
        queryCtx, cancelFns[i] = context.WithCancel(ctx)
        go func(ctx context.Context, queryNo int, x executor) {
            defer wg.Done()

            rows, err := x.QueryContext(ctx, query, args...)
            if err != nil {
                ec <- err // Error collection go routine guaranteed to run until all query goroutines complete
                return
            }

            select {
            case rc <- queryResult{queryNo, rows}:
                return
            case <-ctx.Done(): // If another query has already transmitted its results these should be thrown away
                rows.Close() // not strictly required because closed context should tidy up
                return
            }
        }(queryCtx, i, x)
    }

    // Start go routine that will send a MultiError to a channel if all queries fail
    mec := make(chan MultiError)
    go func() {
        var me MultiError
        errCount := 0
        for err := range ec {
            me.append(err)
            errCount += 1
            if errCount == noOfQueries {
                mec <- me
                return
            }

        }
    }()

    // Wait for one query to succeed or all queries to fail
    select {
    case me := <-mec:
        for _, cancelFn := range cancelFns { // not strictly required so long as ctx is eventually cancelled
            cancelFn()
        }
        wg.Wait()
        return nil, me.check()
    case result := <-rc:
        for i, cancelFn := range cancelFns { // not strictly required so long as ctx is eventually cancelled
            if i != result.no { // do not cancel the query that returned a result
                cancelFn()
            }
        }
        wg.Wait()
        return result.rows, nil
    }
}
Brits
  • 14,829
  • 2
  • 18
  • 31
  • Hi, thanks for the suggestion. There are some helpful pointers. I'm not comfortable with context leak here, but your approach gave me inspiration for other solutions. I will revert once I'm done :) – Tim Feb 24 '20 at 11:59
  • "I'm not comfortable with context leak here" - That is only an issue if you don't pass in a context that is either cancelled or has a timeout set (and in most cases I would expect you to be doing this in any event). The advantage of this over your final solution is that the additional queries are cancelled before finnishing (potentially saving resources on the database server) however the importance of this will depend upon your usage. – Brits Feb 24 '20 at 18:28
0

Thanks to the comments from @Peter and the answer of @Brits, I got fresh ideas on how to approach this.

Blue print

3 out of 4 proposals from the question were needed to be implemented.

1. Cancel the Context

mtx.QueryContext() creates a descendant context and sets the CancelFunc in the MultiTx object.

The cancelWait() helper cancels an old context and waits for MultiTX.Done if its not nil. It is called on Rollback() and before every new query.

2. Drain the channel

In multiQuery(), Upon obtaining the first successful Rows, a Go routine is launched to drain and close the remaining Rows. The rows channel no longer needs to be buffered.

An additional Go routine and a WaitGroup is used to close the error and rows channels.

3. Return a done channel

Instead of the proposed WaitGroup, multiQuery() returns a done channel. The channel is closed once the drain & close routine has finished. mtx.QueryContext() sets done the channel on the MultiTx object.

Errors

Instead of the select block, only drain the error channel if there are now Rows. The error needs to remain buffered for this reason.

Code

// MultiTx holds a slice of open transactions to multiple nodes.
// All methods on this type run their sql.Tx variant in one Go routine per Node.
type MultiTx struct {
    tx      []*Tx
    done    chan struct{}
    cancels context.CancelFunc
}

func (m *MultiTx) cancelWait() {
    if m.cancel != nil {
        m.cancel()
    }
    if m.done != nil {
        <-m.done
    }

    // reset
    m.done, m.cancel = nil, nil
}

// Context creates a child context and appends CancelFunc in MultiTx
func (m *MultiTx) context(ctx context.Context) context.Context {
    m.cancelWait()
    ctx, m.cancel = context.WithCancel(ctx)
    return ctx
}

// QueryContext runs sql.Tx.QueryContext on the tranactions in separate Go routines.
func (m *MultiTx) QueryContext(ctx context.Context, query string, args ...interface{}) (rows *sql.Rows, err error) {
    rows, m.done, err = multiQuery(m.context(ctx), mtx2Exec(m.tx), query, args...)
    return rows, err
}

func (m *MultiTx) Rollback() error {
    m.cancelWait()
    ec := make(chan error, len(m.tx))
    for _, tx := range m.tx {
        go func(tx *Tx) {
            err := tx.Rollback()
            ec <- err
        }(tx)
    }
    var me MultiError
    for i := 0; i < len(m.tx); i++ {
        if err := <-ec; err != nil {
            me.append(err)
        }
    }
    return me.check()
}

func multiQuery(ctx context.Context, xs []executor, query string, args ...interface{}) (*sql.Rows, chan struct{}, error) {
    rc := make(chan *sql.Rows)
    ec := make(chan error, len(xs))

    var wg sync.WaitGroup
    wg.Add(len(xs))
    for _, x := range xs {
        go func(x executor) {
            rows, err := x.QueryContext(ctx, query, args...)
            switch { // Make sure only one of them is returned
            case err != nil:
                ec <- err
            case rows != nil:
                rc <- rows
            }
            wg.Done()
        }(x)
    }

    // Close channels when all query routines completed
    go func() {
        wg.Wait()
        close(ec)
        close(rc)
    }()

    rows, ok := <-rc
    if ok { // ok will be false if channel closed before any rows
        done := make(chan struct{}) // Done signals the caller that all remaining rows are properly closed
        go func() {
            for rows := range rc { // Drain channel and close unused Rows
                rows.Close()
            }
            close(done)
        }()
        return rows, done, nil
    }

    // no rows, build error return
    var me MultiError
    for err := range ec {
        me.append(err)
    }
    return nil, nil, me.check()
}

Edit: Cancel & wait for old contexts before every Query, as *sql.Tx is not Go routine save, all previous queries have to be done before a next call.

Tim
  • 1,585
  • 1
  • 18
  • 23
  • I think that this version of ```multiQuery``` will wait untill ALL queries have completed before returning. This means that it will take as long as the slowest query which seems to make it slower, on average, than just running a single query? (not an issue if the aim is purerly resilience). – Brits Feb 24 '20 at 18:39
  • 1
    @Brits: No, because drainage & close runs in a Go routine. So does the `wg.Wait()` and channel close. The function returns after the first `rows` is obtained from the channel or if the the channel is closed without rows, in which case `MultiError` is build. – Tim Feb 24 '20 at 18:46