-1

PROBLEM DESCRIPTION

Hello, I would like to implement a service that receives data from various providers and dumps it into a database (a sort of raw data store).
The issue is that the providers have different ways to give me the data I need. Some stream the data with a RabbitMQ exchange, others give me access to an API I can pull form, and others simply share csv files.
I believe this is a common scenario for those who need a lot of data and have different providers.

QUESTION

Can somebody point me to the right steps I should take in the designing phase of this data ingestion pipeline in order to make it as scalable and maintainable as possible? Maybe some known design patterns or anything that can come in handy for these fairly common scenarios.

giulio di zio
  • 171
  • 1
  • 11
  • I guess it depends on what kind of concrete pipeline you're building and with what. Some etl platforms might have building blocks that allow merging inputs. In general I'd assume a good goal would be to abstract from the different sources so they can be handled in the same way (somewhat like a repository pattern) – zapl Oct 11 '22 at 13:27

2 Answers2

1

If you have different interchangeable implementations of data providers, then this s is a place where Strategy pattern can be used:

Strategy pattern is a behavioral software design pattern that enables selecting an algorithm at runtime. Instead of implementing a single algorithm directly, code receives run-time instructions as to which in a family of algorithms to use.

Let me show an example.

We need to have some common behaviour that will be shared across all strategies. In our case, it would be just one Get() method from different data providers:

public interface IDataProvider
{
    string Get();
}

And its concrete implementations. These are exchangeable strategies:

public class RabbitMQDataProvider : IDataProvider
{
    public string Get()
    {
        return "I am RabbitMQDataProvider";
    }
}

public class ApiDataProvider : IDataProvider
{
    public string Get()
    {
        return "I am ApiDataProvider";
    }
}

public class CsvDataProvider : IDataProvider
{
    public string Get()
    {
        return "I am CsvDataProvider";
    }
}

We need a place where all strategies can be stored. And we should be able to get necessary strategy from this store. So this is a place where simple factory can be used. Simple factory is not Factory method pattern and not Abstract factory.

public enum DataProviderType
{
    RabbitMq, Api, Csv
}

public class DataProviderFactory
{
    private Dictionary<DataProviderType, IDataProvider> _dataProviderByType
        = new Dictionary<DataProviderType, IDataProvider>()
        {
            { DataProviderType.RabbitMq, new RabbitMQDataProvider() },
            { DataProviderType.Api, new ApiDataProvider() },
            { DataProviderType.Csv, new CsvDataProvider() },
        };

    public IDataProvider GetInstanceByType(DataProviderType dataProviderType) =>
        _dataProviderByType[dataProviderType];
}

and then you can get instance of desired storage easier:

DataProviderFactory dataProviderFactory = new();
IDataProvider dataProvider = dataProviderFactory
    .GetInstanceByType(DataProviderType.Api);
string data = dataProvider.Get();

This design is compliant with the open/closed principle. So if you would need to add other storages, then:

  • you would add new class with new strategy
  • you will not edit StorageService class

And it is compliant with open closed principle.

StepUp
  • 36,391
  • 15
  • 88
  • 148
0

One approach of many might be:

What came first into my mind was an integration framework like Apache Camel or Spring Integration. To make it practical, implement for each data source a route (RabbitMQ consumer, file consumer, http producer, etc.) and shape/map/transform the data afterwards so they can be stored into the database which is the last step in your route. That's quite easy this way.

kladderradatsch
  • 596
  • 5
  • 18