2

I have an application which needs to read files from two different path. After reading all these files, I need to load them up in memory in products map.

Path:

  • Full: This is the path which will have all files that we need to load up during server startup in memory. This path will have around 50 files and each file size is ~60MB.
  • Delta: This is the path which will have all the delta files that we need to load up in memory periodically every 1 minute. These files will only contain difference from the full path files. This path will have around 60 files and each file size is ~20MB.

Below code watchDeltaPath is called during server startup to watch for delta changes. It will get the delta path from GetDeltaPath method and from that path I need to load all the files in memory. This delta path keeps changing every few minutes and I cannot miss any one delta path and all the files in that path.

Loading all files in memory from loadAllFiles method can take some time (approx 5mins) so I am trying to find a way where I should not miss any new delta path (as it keeps changing every few minutes) and should be able to load all those files in memory from the delta path again and again periodically without any issue and efficiently.

I got the below code which runs every 1 minute and look for new delta path every time and then load all the files from that path in the memory. It works fine but I don't think this is the right approach to do it. What happens if loadAllFiles method takes more than 10 minutes to load all the files in memory and my ticker is running every 1 minute to look for new delta path and then find all the files in that new path and then load up in memory? Will it keep creating lot of background threads and maybe increase cpu-usage by a lot?

type applicationRepository struct {
  client         customer.Client
  logger         log.Logger
  done           chan struct{}
  products       *cmap.ConcurrentMap
}

// this will be called only once
func (r *applicationRepository) watchDeltaPath() error {
    ticker := time.NewTicker(1 * time.Minute)
    go func() {
        select {
        case <-r.done:
            ticker.Stop()
            return
        case <-ticker.C:
            func() (result error) {
                trans := r.logger.StartTransaction(nil, "delta-changes", "")
                defer trans.End()
                defer func() {
                    if result != nil {
                        trans.Errorf("Recovered from error: %v")
                    } else if err := recover(); err != nil {
                        trans.Errorf("Recovered from panic: %v", err)
                    }
                }()
                // get latest delta path everytime as it keeps changing every few minutes
                path, err := r.client.GetDeltaPath("delta")
                if err != nil {
                    return err
                }
                // load all the files in memory in "products" map from that path
                err = r.loadAllFiles(path)
                if err != nil {
                    return err
                }
                return nil
            }()
        }
    }()
    return nil
}

func (r *applicationRepository) Stop() {
    r.done <- struct{}{}
}

What is the best way to do this efficiently in prod?

Here is my play with code on how it is being executed - https://go.dev/play/p/FS4-B0FWwTe

vader
  • 45
  • 1
  • 9
  • Your `select` needs to be within a loop (e.g. `for`) if you want it to run more than once (I am assuming that `watchDeltaPath` is only called once and should check for new files every minute). If the ticker fires whilst `loadAllFiles` is running then you have a ["slow receiver"](https://pkg.go.dev/time#NewTicker) and "The ticker will adjust the time interval or drop ticks to make up for slow receivers". As you are not starting go routines when responding to a tick only a single instance of `loadAllFiles` will be running at any point in time. – Brits Feb 26 '22 at 22:41
  • yeah `watchDeltaPath` is only called once and you are right it will be a slow receiver if the ticker fires whilst `loadAllFiles` is running. I am just worried I can miss any new `delta path` whenever this condition happens? Because there is a new delta path every few minutes and I need to load all files from that new path and I cannot miss any delta path. That is why I am looking for some inputs on the correct design on how to handle this. – vader Feb 27 '22 at 04:23
  • One way of dealing with that is to separate the two operations. One operation solely waits on new paths and adds them to a queue (perhaps a buffered channel) and the other operation reads paths from that queue and processes them. However this assumes that no files are added to a particular path after you become aware of it... – Brits Feb 27 '22 at 04:45
  • Interesting suggestion. Yes once path is defined and processed there won't be any new files added to it later on. I think that might work for me. Do you think you can provide an example on how will that work combined with my above code? It will help me understand better on how to do this correctly. – vader Feb 27 '22 at 05:02
  • Sorry - without significantly more detail (e.g. how you are informed that there is a new folder) any example would be very generic (and unhelpful). However the concept is fairly simple - one routine feeds into a channel (lets say `deltaChan`) when there is a new path and the other is something like your code but with `case path := <-deltaChan:` instead of `case <-ticker.C:`. – Brits Feb 27 '22 at 07:14
  • I created a [play](https://go.dev/play/p/FS4-B0FWwTe) which shows on how `watchDeltaPath` is being used. Also on your question on how I am being informed if there is a new folder - It is pretty much coming from this line `path, err := r.client.GetDeltaPath("delta")`. I have to call `GetDeltaPath` method which will give me a path, it can be same path or new path depending on whether new path has been generated within that time interval. – vader Feb 27 '22 at 17:38
  • Basically it is something I need to detect internally if path we have is same or it is different (maybe looking at queue), if different then it is a new path which I need to consume but if it is old path then I can discard and keep polling again for new path. – vader Feb 27 '22 at 17:38

2 Answers2

1

As per the comments the "best way to do this efficiently in prod" depends on a lot of factors and is probably not answerable on a site like Stack overflow. Having said that I can suggest an approach that might make it easier to think about how the problem could be best solved.

The below code (playground; pretty rough and untested) demonstrates an approach with three go routines:

  1. Detects new delta paths and pushes them to a buffered channel
  2. Handles the initial load
  3. Waits for initial load to finish then applies deltas (note that this does process deltas found while the initial load is underway)

As mentioned above there is insufficient detail in the question to ascertain whether this a good approach. It may be that the initial load and deltas can run simultaneously without saturating the IO but that would require testing (and would be a relatively small change).

// Simulation of process to perform initial load and handle deltas
package main

import (
    "fmt"
    "strconv"
    "sync"
    "time"
)

const deltaBuffer = 100
const initialLoadTime = time.Duration(time.Duration(1.5 * float32(time.Second)))
const deltaCheckFrequency = time.Duration(500 * time.Millisecond)

func main() {
    ar := NewApplicationRepository()
    time.Sleep(5 * time.Second)
    ar.Stop()
    fmt.Println(time.Now(), "complete")
}

type applicationRepository struct {
    deltaChan       chan string   // Could be some other type...
    initialLoadDone chan struct{} // Closed when initial load finished

    done chan struct{}
    wg   sync.WaitGroup
}

func NewApplicationRepository() *applicationRepository {
    ar := applicationRepository{
        deltaChan:       make(chan string, deltaBuffer),
        initialLoadDone: make(chan struct{}),
        done:            make(chan struct{}),
    }

    ar.wg.Add(3)
    go ar.detectNewDeltas()
    go ar.initialLoad()
    go ar.deltaLoad()

    return &ar
}

// detectNewDeltas - watch for new delta paths
func (a *applicationRepository) detectNewDeltas() {
    defer a.wg.Done()
    var previousDelta string
    for {
        select {
        case <-time.After(deltaCheckFrequency):
            dp := a.getDeltaPath()
            if dp != previousDelta {
                select {
                case a.deltaChan <- dp:
                default:
                    panic("channel full - no idea what to do here!")
                }
                previousDelta = dp
            }
        case <-a.done:
            return
        }
    }
}

// getDeltaPath in real application this will retrieve the delta path
func (a *applicationRepository) getDeltaPath() string {
    return strconv.Itoa(time.Now().Second()) // For now just return the current second..
}

// initialLoad - load the initial data
func (a *applicationRepository) initialLoad() {
    defer a.wg.Done()
    defer close(a.initialLoadDone)
    time.Sleep(initialLoadTime) // Simulate time taken for initial load
}

// deltaLoad- load deltas found by detectNewDeltas
func (a *applicationRepository) deltaLoad() {
    defer a.wg.Done()
    fmt.Println(time.Now(), "deltaLoad started")

    // Wait for initial load to complete before doing anything
    <-a.initialLoadDone
    fmt.Println(time.Now(), "Initial Load Done")

    // Wait for incoming deltas and load them
    for {
        select {
        case newDelta := <-a.deltaChan:
            fmt.Println(time.Now(), newDelta)
        case <-a.done:
            return
        }
    }
}

// Stop - signal loader to stop and wait until this is done
func (a *applicationRepository) Stop() {
    close(a.done)
    a.wg.Wait()
}
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Brits
  • 14,829
  • 2
  • 18
  • 31
  • Thank you. I think this is pretty clear to me and makes sense as well. I have few questions - I see in `NewApplicationRepository` function, you have three go routines `go ar.detectNewDeltas()`, `go ar.initialLoad()` and `go ar.deltaLoad()`. If any of these three throws an error then how can I propagate back to calling code? In my case `NewApplicationRepository` also return error back to calling code. As you can see my `loadInitialData` function returns error as well if there is any. So I am just confuse on that for now. – vader Feb 28 '22 at 05:43
  • "how can I propagate back to calling code" using a channel is probably the simplest approach ([errgroup](https://pkg.go.dev/golang.org/x/sync/errgroup) may be useful). If you do things the way I have (load initial data THEN deltas) you can actually run the initial load in the main go routine (I used three because I do not know what you actually do with the data). – Brits Feb 28 '22 at 07:51
  • I think the design you suggested is something I was lookin for where load initial data and once that is done then start looking for deltas and keep loading them as new delta path arrives but got confuse on how to deal with errors from `initialLoad`, `detectNewDeltas` and `deltaLoad` method. Do you think you can provide an example how we can use channel here to propagate back the errors? I am kinda confuse on that will work here. And I haven't heard about errgroup before so need to read about that. – vader Feb 28 '22 at 08:11
  • I have some getters to access the `products` map. In my `loadAllFiles` method I load all data in the `products` map and I access the data from the map using those getters. – vader Feb 28 '22 at 08:13
1

I think you want Golang Concurrency Patterns : Fan in, Fan out. You can search it in Google.

This I create an example code. You can copy-paste it and create folder full and delta with dummy file inside it.

package main

import (
    "fmt"
    "log"
    "os"
    "path/filepath"
    "sync"
    "time"
)

type MyFile struct {
    full         map[string][]byte
    delta        map[string][]byte
    files        []string
    stopAutoLoad chan struct{}
}

func FilePathWalkDir(root string) ([]string, error) {
    var files []string
    err := filepath.Walk(root, func(path string, info os.FileInfo, err error) error {
        if !info.IsDir() {
            files = append(files, path)
        }
        return nil
    })
    return files, err
}

func main() {
    mf := NewMyFile()
    mf.StartAutoLoadDelta(10 * time.Second)

    // time.Sleep(15 * time.Second)
    // mf.StopAutoLoadDelta()

    time.Sleep(50 * time.Minute)
    fmt.Println(len(mf.full))
    fmt.Println(len(mf.delta))
}

func NewMyFile() *MyFile {
    mf := &MyFile{
        full:         make(map[string][]byte),
        delta:        make(map[string][]byte),
        stopAutoLoad: make(chan struct{}),
    }

    mf.LoadFile("full", 0)
    mf.LoadFile("delta", 0)
    return mf
}

func (mf *MyFile) StartAutoLoadDelta(d time.Duration) {
    ticker := time.NewTicker(d)

    go func() {
        defer func() {
            ticker.Stop()
        }()

        i := 1
        for {
            select {
            case <-ticker.C:
                // mf.deleteCurrentDelta()
                mf.LoadFile("delta", i)
                fmt.Println("In Memory:")
                for k, v := range mf.delta {
                    fmt.Printf("key : %s\t\tlen: %d\n", k, len(v))
                }
                i++
            case <-mf.stopAutoLoad:
                return
            }
        }
    }()
}

func (mf *MyFile) StopAutoLoadDelta() {
    fmt.Println("Stopping autoload Delta")
    mf.stopAutoLoad <- struct{}{}
}

func (mf *MyFile) deleteCurrentDelta() {
    for k, _ := range mf.delta {
        fmt.Println("data deleted: ", k)
        delete(mf.delta, k)
    }
}

type Fileinfo struct {
    name string
    data []byte
    err  error
}

func (mf *MyFile) LoadFile(prefix string, i int) {
    log.Printf("%s load : %d", prefix, i)
    files, err := FilePathWalkDir(prefix)
    if err != nil {
        panic("failed to open delta directory")
    }

    newFiles := make([]string, 0)

    for _, v := range files {
        if _, ok := mf.delta[v]; !ok {
            newFiles = append(newFiles, v)
        }
    }

    chanJobs := GenerateJobs(prefix, newFiles)
    chanResultJobs := ReadFiles(chanJobs, 8)
    counterTotal := 0
    counterSuccess := 0
    for results := range chanResultJobs {
        if results.err != nil {
            log.Printf("error creating file %s. stack trace: %s", results.name, results.err)
        } else {
            switch prefix {
            case "delta":
                mf.delta[results.name] = results.data
            case "full":
                mf.full[results.name] = results.data
            default:
                panic("not implemented")
            }
            counterSuccess++
        }
        counterTotal++
    }

    log.Printf("status jobs running: %d/%d", counterSuccess, counterTotal)
}

func GenerateJobs(prefix string, files []string) <-chan Fileinfo {
    chanOut := make(chan Fileinfo)

    go func() {
        for _, v := range files {
            chanOut <- Fileinfo{
                name: v,
            }
        }
        close(chanOut)
    }()

    return chanOut
}

func ReadFiles(chanIn <-chan Fileinfo, worker int) <-chan Fileinfo {
    chanOut := make(chan Fileinfo)

    var wg sync.WaitGroup

    wg.Add(worker)

    go func() {
        for i := 0; i < worker; i++ {
            go func(workerIndex int) {
                defer wg.Done()
                for job := range chanIn {
                    log.Printf("worker %d is reading file %s", workerIndex, job.name)
                    data, err := os.ReadFile(job.name)
                    chanOut <- Fileinfo{
                        name: job.name,
                        data: data,
                        err:  err,
                    }
                }
            }(i)
        }
    }()

    go func() {
        wg.Wait()
        close(chanOut)
    }()
    return chanOut
}
Rahmat Fathoni
  • 1,272
  • 1
  • 2
  • 8
  • Thanks a lot for your help. Just want to clarify few things. Given "full" and "delta" I get the path from where I need to load files in memory. And then periodically I need to check new delta path every 1 minute or so and see if there is a new path that I haven't processed yet and if there is a new path then load all the files in that new path in memory. Problem I had was new delta path comes very frequently in like less than few minutes but loading files in memory can take more time so maybe there is a possibility I can miss any new delta path which I don't want with my design. – vader Feb 28 '22 at 14:48
  • So that is why I was looking for a design where I cannot miss any new delta path and can load all the files from any new delta path. And also handle error scenarios very well from that design. In your case, I think assumption is loading files is fast. Is there any way where we can create a queue or channel which can store all new delta path if it is different and haven't been processed yet and then I can keep pulling new delta path from that channel and start processing the files from it? – vader Feb 28 '22 at 14:52
  • Is your old files inside delta folder can change or deleted? – Rahmat Fathoni Feb 28 '22 at 23:36
  • `delta` isn't a folder btw. It is just a key which can tell me the full path from where I should load new delta files in memory. There is a new delta path generated every 6-7 minutes and previous delta path will still be there with files in it and we cannot miss any new delta path and all the files in that new path. So we need to pick up new delta path, load files in memory from that path. Listen for new delta path again and load files from that new path again. I believe Brits had a good idea of using channel for new delta path everytime. – vader Feb 28 '22 at 23:40
  • So this line in my code in the question `r.client.GetDeltaPath("delta")` tells me new delta path which I need to use to load files from that path in the memory and I constantly need to call that line to see if there is any new path or no. If there is a new path then add it to channel buffer and load all files from that path in memory. Brits suggestion works very well for me but error handling isn't there on that and I am confuse on how to add that. – vader Feb 28 '22 at 23:42
  • Is in delta or full folder have one or more subfolders? – Rahmat Fathoni Feb 28 '22 at 23:47
  • nopes. it won't have it. – vader Feb 28 '22 at 23:53
  • ok, I assume your old file in delta and full cannot be deleted. Files in delta only can increase. I have changed my code, you can re paste it. It's simple to check if theres a new or more files, check line 107-111. I decrease the ticker duration. – Rahmat Fathoni Mar 01 '22 at 00:23
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/242505/discussion-between-vader-and-rahmat-fathoni). – vader Mar 01 '22 at 04:16