0

I have a few datasets from the government dataset that I'm using on my ML model, the problem is, their server is not that great to put it nicely. Whenever I run my pipeline, when I pull from their API all at once, their server goes down for a few minutes.

This is how their data is represented on our catalog.yml:

external-safra-cana:
  type: api.APIDataSet
  url: https://apisidra.ibge.gov.br/values/t/6588/p/all/v/allxp/c48/39456/n3/all

external-safra-algodao:
  type: api.APIDataSet
  url: https://apisidra.ibge.gov.br/values/t/6588/p/all/v/allxp/c48/39429/n3/all 

external-safra-arroz:
  type: api.APIDataSet
  url: https://apisidra.ibge.gov.br/values/t/6588/p/all/v/allxp/c48/39432/n3/all

external-safra-milho1:
  type: api.APIDataSet
  url: https://apisidra.ibge.gov.br/values/t/6588/p/all/v/allxp/c48/39441/n3/all

external-safra-milho2:
  type: api.APIDataSet
  url: https://apisidra.ibge.gov.br/values/t/6588/p/all/v/allxp/c48/39442/n3/all

What I want to do is, if the data fails to download, I want to sleep for a few seconds and retry, but I could not find anything like that on the documentation, is there a way to get this behavior from the APIDataSet?

João Areias
  • 1,192
  • 11
  • 41

1 Answers1

0

I would consider subclassing the APIDataSet and building in a caching mechanism - you could say pickle responses and build some sort of 'expiry' mechanism where you:

  1. If pickle doesn't exist, call API and save response as pickle
  2. If pickle exists and within 'fresh' window, read from pickle
  3. If pickle exists and outside 'fresh' window, call API and create new pickle

This isn't something that's come up a lot with users so we haven't built any native way of doing it.

datajoely
  • 1,466
  • 10
  • 13