gsub in FluentD and S3 not substituting unicode characters

Question

Message appears as this string in S3 bucket: '\u00001\u00001\u0000/\u00002\u00003\u0000/\u00002\u00000\u00002\u00001\u0000 \u00001\u00007\u0000:\u00004\u00005\u0000,\u0000s\u0000e\u0000v\u0000e\u0000r\u0000i\u0000t\u0000y\u0000l\u0000e\u0000v\u0000e\u0000l\u0000,\u0000T\u0000h\u0000i\u0000s\u0000 \u0000i\u0000s\u0000 \u0000a\u0000 \u0000t\u0000e\u0000s\u0000t\u0000 \u0000m\u0000e\u0000s\u0000s\u0000a\u0000g\u0000e\u0000 \u0000'

<source>
  @type tail
  path PATH_TO_LOG_FILE
  pos_file PATH_TO_LOG_FILE.pos
  read_from_head true
  tag test
  <parse>
    @type none
  </parse>
</source>

<filter test>
  @type record_transformer
  enable_ruby true
  <record>
    message ${ record["message"].gsub(/(\\u\d{4})/, "") }
  </record>
</filter>

<match test>
  @type s3
  
  aws_key_id KEY_ID
  aws_sec_key SEC_KEY
  s3_bucket S3_BUCKET
  s3_region S3_REGION
  #path logs/
  <buffer tag,time>
    @type file
    path PATH_TO_BUFFER
    timekey 60 # 1 hour partition
    timekey_wait 10s
    chunk_limit_size 256m
  </buffer>
</match>

For some reason the filter isn't swapping out '\u0000' with ''. I've tried string interpolation, gsub!, and a couple other things. When I put that line into an online compiler, the gsub appears to work fine. I was inspired by: How to remove unicode in fluentd tail/s3 plugin

You need `record["message"].gsub(/\p{Cc}+/, "")` if your string contains unnecessary control chars. — Wiktor Stribiżew, Nov 23 '21 at 18:01
@WiktorStribiżew I swapped my record["message"] line with yours but it's still coming through to S3 as the unicode version of the string. The problem is that record["message"] appears to be unchanged by the gsub (starts as unicode and then posts to S3 as unicode w/o any change taking place). — helpaccount321, Nov 23 '21 at 18:09

gsub in FluentD and S3 not substituting unicode characters

0 Answers0