0

I am putting asynchronous csv stream data from each URL into each file one after another like below.

async with httpx.AsyncClient(headers={"Authorization": 'Token token="sometoken"'}) as session:
    for url in some_urls_list:
        download_data(url, session)

@backoff.on_exception(backoff.expo,exception=(httpx.SomeException,),max_tries=7,)
async def download_data(url, session):
    while True:
        async with session.stream("GET", url) as csv_stream:
            csv_stream.raise_for_status()
            async with aiofiles.open("someuniquepath", "wb") as f:
                async for data in csv_stream.aiter_bytes():
                    await f.write(data)
        break

I am ingesting this data into Splunk via inputs.conf and props.conf as below.

[monitor:///my_main_dir_path]
disabled = 0
index = xx
sourcetype = xx:xx
[xx:xx]
SHOULD_LINEMERGE = false
LINE_BREAKER = ([\r\n]+)
NO_BINARY_CHECK = true
CHARSET = UTF-8
INDEXED_EXTRACTIONS = csv
TIMESTAMP_FIELDS = xx

I am getting several issues in this as below.

  • Some files are not indexed at all.
  • From some files only partial rows are indexed.
  • Some rows are abruptly divided into 2 events on Splunk.

What could be done on the Splunk configuration side to solve above issues while taking care that it does not cause any duplicate data indexing issue?

Sample Data: (First line is the header.)

A,B B,C D,E,F,G H?,I J K,L M?,N/O P,Q R S,T U V (w x),Y Z,AA BB,CC DD,EE FF,GG HH,II JJ KK,some timestamp field,LL,MM,NN-OO,PP?,QQ RR ss TT UU,VV,WW,XX,YY,ZZ,AAA BBB,CCC,DDD-EEE,FFF GGG,HHH,III JJJ,KKK LLL,MMM MMM,NNN OOO,PPP QQQ,RRR SSS 1,TTT UUU 2,VVV WWW 3,XX YYY,ZZZ AAAA,BBBB CCCC
adata@adata.adata,"bbdata, bbdata",ccdata ccdata,eedata eedata - eedata,ffdata - ffdata - 725 ffdata ffdata,No,,No,,,,,unknown,unknown,unknown,2.0.0,"Sep 26 22:40:18 iidata-iidata-12cb65d081f745a2b iidata/iidata[4783]: iidata: to=<iidata@iidata.iidata>, iidata=iidata.iidata.iidata.iidata[111.111.11.11]:25, iidata=0.35, iidata=0.08/0/0.07/0.2, iidata=2.0.0, iidata=iidata (250 2.0.0 OK  1569537618 iidata.325 - iidata)",9/26/2019 22:40,,,,,,,wwdata,xxdata,5,"zzdata, zzdata",aaadata aaadata aaadata,cccdata - cccdata,ddddata - ddddata,fffdata,hhhdata,25/06/2010,6,2010,"nnndata nnndata nnndata, nnndata.",(pppdata'pppdata) pppdata pppdata,,,,303185,,

Sample Broken Event:

adata@adata.adata,"bbdata, bbdata",ccdata ccdata,eedata eedata - eedata,ffdata - ffdata - 725 ffdata ffdata,No,,No,,,,,unknown,un

known,unknown,2.0.0,"Sep 26 22:40:18 iidata-iidata-12cb65d081f745a2b iidata/iidata[4783]: iidata: to=<iidata@iidata.iidata>, iidata=iidata.iidata.iidata.iidata[111.111.11.11]:25, iidata=0.35, iidata=0.08/0/0.07/0.2, iidata=2.0.0, iidata=iidata (250 2.0.0 OK  1569537618 iidata.325 - iidata)",9/26/2019 22:40,,,,,,,wwdata,xxdata,5,"zzdata, zzdata",aaadata aaadata aaadata,cccdata - cccdata,ddddata - ddddata,fffdata,hhhdata,25/06/2010,6,2010,"nnndata nnndata nnndata, nnndata.",(pppdata'pppdata) pppdata pppdata,,,,303185,,
SmiP
  • 155
  • 2
  • 2
  • 16

1 Answers1

2

I hope you are monitoring something much more specific than a top-level directory. Otherwise, you run the risk of Splunk running out of file opens and/or memory.

Partial rows and divided rows are symptoms of incorrect props.conf settings. It's impossible to suggest corrections without seeing some events.

It's also possible Splunk is reading the file too fast. Try adding these settings to inputs.conf:

multiline_event_extra_waittime = true
time_before_close = 3
RichG
  • 9,063
  • 2
  • 18
  • 29
  • Will try by monitoring a specific file pattern and also with the provided 2 config lines and get back. Also added data sample to issue description at bottom. – SmiP Dec 23 '20 at 08:17