2

I am testing a Python script that pulls data from an API and sends the data to Splunk. The script is working fine, but my issue is that I will need to send millions of events daily from the API to Splunk. In my local testing, I am only able to send a few thousand events per hour. I eventually need to port this into Lambda for scheduled automation.

I know about the multiprocessing Python module, but my concern is that even if I get that logic up and running, at best I will be able to send 10's thousands of events an hour, and Lambda will time out before I'm even close to sending the full range of data. I'm hoping someone has encountered this challenge before and can suggest some options for me to consider. Thank you!

Code:

splunk_conf = {<config stuff>}
for r in range(0,9000000,10000):
     offset = str(r)
     r = requests.get(f'{base_url}/<api>?limit=10000&offset={offset}',   headers = headers).json()
     for x in r['data']:
         splunk_payload = x
         splunk(splunk_payload, splunk_conf)

def splunk(splunk_payload, splunk_conf):
    splunk = SplunkSender(**splunk_conf)
    payloads = [splunk_payload]
    splunk_res = splunk.send_data(payloads)

I wrote my script and got it working, but the sheer volume of data is what will be the limiting factor with my current understanding of the solutions available.

Update: I was able to get this working by taking the elements in the dictionary and adding them to a list to pass in as the Splunk payload. My original code was sending the events one at a time due to my misunderstanding of how to pass in the data properly.

splunk_token = <code to retreieve token>

for r in range(0,10000000,10000):
  offset = str(r)
  splunk_payload = []
  try:
    r = requests.get(f'{base_url}<API endpoint URL>limit=10000&offset={offset}', headers = headers).json()
    for event in r['data']:
        splunk_payload.append(event)
    splunk(splunk_payload, splunk_token)         
  except Exception as ex:
      print("No more results from API!")
      exit()

def splunk(splunk_payload, splunk_token):
  splunk_conf = { <splunk conf details> }
  splunk = SplunkSender(**splunk_conf)
  splunk_res = splunk.send_data(splunk_payload)
  logging.info(splunk_res)
noobuntu
  • 21
  • 3
  • 1
    I presume you're sending them to the HEC? Is this Splunk Cloud, or on-prem? Splunk, *per se*, has absolutely no issues handling "millions of events per hour" - my last customer is a 30TB/day environment, receiving over a million events per hour (via HEC) on just one sourcetype (that "only" accounts for ~3-4% of the total ingest)...let alone the couple hundreds others *also* sending (via HEC, UF, etc) – warren Aug 12 '23 at 13:03
  • Thank you, @warren. It turns out that the issue was my own lack of technical know-how. – noobuntu Aug 14 '23 at 21:57
  • what was the solution? – warren Aug 15 '23 at 01:29
  • 1
    I edited my post to add the solution. – noobuntu Aug 16 '23 at 21:41
  • 1
    post it as an answer, and you can then click it as "accepted" :) – warren Aug 17 '23 at 11:49

1 Answers1

0

I was able to get this working by taking the elements in the dictionary and adding them to a list to pass in as the splunk payload. My original code was sending the events one at a time due to my misunderstanding of how to pass in the data properly.

splunk_token = <code to retreieve token>

for r in range(0,10000000,10000):
  offset = str(r)
  splunk_payload = []
  try:
    r = requests.get(f'{base_url}<API endpoint URL>limit=10000&offset={offset}', headers = headers).json()
    for event in r['data']:
        splunk_payload.append(event)
    splunk(splunk_payload, splunk_token)         
  except Exception as ex:
      print("No more results from API!")
      exit()

def splunk(splunk_payload, splunk_token):
  splunk_conf = { <splunk conf details> }
  splunk = SplunkSender(**splunk_conf)
  splunk_res = splunk.send_data(splunk_payload)
  logging.info(splunk_res)
noobuntu
  • 21
  • 3