How to make work concurrent while sending data on a stream in golang?

Question

I have a golang grpc server which has streaming endpoint. Earlier I was doing all the work sequentially and sending on the stream but then I realize I can make the work concurrent and then send on stream. From grpc-go docs: I understood that I can make the work concurrent, but you can't make sending on the stream concurrent so I got below code which does the job.

Below is the code I have in my streaming endpoint which sends data back to client in a streaming way. This does all the work concurrently.

// get "allCids" from lot of files and load in memory.
allCids := .....
var data = allCids.([]int64)
out := make(chan *custPbV1.CustomerResponse, len(data))

wg := &sync.WaitGroup{}
wg.Add(len(data))
go func() {
  wg.Wait()
  close(out)
}()

for _, cid := range data {
  go func (id int64) {
    defer wg.Done()
    pd := repo.GetCustomerData(strconv.FormatInt(cid, 10))
    if !pd.IsCorrect {
      return
    }
    resources := us.helperCom.GenerateResourceString(pd)
    val, err := us.GenerateInfo(clientId, resources, cfg)
    if err != nil {
      return
    }
    out <- val
  }(cid)
}

for val := range out {
  if err := stream.Send(val); err != nil {
    log.Printf("send error %v", err)
  }
}

Now problem I have is size of data slice can be approx a million so I don't want to spawn million go routine doing the job. How do I handle that scenario here? If instead of len(data) I use 100 then will that work for me or I need to slice data as well in 100 sub arrays? I am just confuse on what is the best way to deal with this problem?

I recently started with golang so pardon me if there are any mistakes in my above code while making it concurrent.

Pratheesh M · Answer 1 · 2022-03-14T04:49:22.333

Please check this pseudo code

func main() {

    works := make(chan int, 100)
    errChan := make(chan error, 100)
    out := make(chan *custPbV1.CustomerResponse, 100)

    // spawn fixed workers
    var workerWg sync.WaitGroup
    for i := 0; i < 100; i++ {
        workerWg.Add(1)
        go worker(&workerWg, works, errChan, out)
    }

    // give input
    go func() {
        for _, cid := range data {
            // this will be blocked if all the workers are busy and no space is left in the channel.
            works <- cid
        }
        close(works)
    }()

    var analyzeResults sync.WaitGroup
    analyzeResults.Add(2)

    // process errors
    go func() {
        for err := range errChan {
            log.Printf("error %v", err)
        }
        analyzeResults.Done()
    }()

    // process outout
    go func() {
        for val := range out {
            if err := stream.Send(val); err != nil {
                log.Printf("send error %v", err)
            }
        }
        analyzeResults.Done()
    }()

    workerWg.Wait()
    close(out)
    close(errChan)
    analyzeResults.Wait()
}

func worker(job *sync.WaitGroup, works chan int, errChan chan error, out chan *custPbV1.CustomerResponse) {
    defer job.Done()
    // Idle worker takes the work from this channel. 
    for cid := range works {
        pd := repo.GetCustomerData(strconv.FormatInt(cid, 10))
        if !pd.IsCorrect {
            errChan <- errors.New(fmt.Sprintf("pd %d is incorrect", pd))
            // we can not return here as the total number of workers will be reduced. If all the workers does this then there is a chance that no workers are there to do the job
            continue
        }
        resources := us.helperCom.GenerateResourceString(pd)
        val, err := us.GenerateInfo(clientId, resources, cfg)
        if err != nil {
            errChan <- errors.New(fmt.Sprintf("got error", err))
            continue
        }
        out <- val
    }
}

Explanation:

This is a worker pool implementation where we spawn a fixed number of goroutines(100 workers here) to do the same job(GetCustomerData() & GenerateInfo() here) but with different input data(cid here). 100 workers here does not mean that it is parallel but concurrent(depends on the GOMAXPROCS). If one worker is waiting for io result(basically some blocking operation)then that particular goroutine will be context switched and other worker goroutine gets a chance to execute. But increasing goroutuines (workers) may not give much performance but can leads to contention on the channel as more workers are waiting for the input job on that channel.

The benefit over splitting the 1 million data to subslice is that. Lets say we have 1000 jobs and 100 workers. each worker will get assigned to the jobs 1-10, 11-20 etc... What if the first 10 jobs is taking more time than others. In that case the first worker is overloaded and the other workers will finish the tasks and will be idle even though there are pending tasks. So to avoid this situation, this is the best solution as the idle worker will take the next job. So that no worker is more overloaded compared to the other workers

Can you add some explanation on this as well? How this differs than what I have also pros and cons and few other details which can help me understand better. — rosed, Mar 14 '22 at 04:04

How to make work concurrent while sending data on a stream in golang?

1 Answers1