I'm trying to extract multiple columns values from an existing column in a streamable pyspark dataframe.
I read the stream using
stream_dataframe = spark_session.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", broker) \
.option("subscribe", topic) \
.option("startingOffsets", "earliest") \
.load()
I'm currently splitting the string in value column and applying a schema to that using,
assert sdf.isStreaming == True, "DataFrame doesn't receive streaming data"
# split attributes to nested array in one Column
col = split(sdf[col_name], split_str)
# now expand col to multiple top-level columns
for idx, field in enumerate(schema):
sdf = sdf.withColumn(field.name, col.getItem(idx).cast(field.dataType))
return sdf
I wanted to use named regex instead of the above.
I tried using the below code,
host_pattern = r'(^\S+\.[\S+\.]+\S+)\s'
ts_pattern = r'\[(\d{2}\/\w{3}\/\d{4}\:\d{2}\:\d{2}\:\d{2} (\+|\-)\d{4})\]'
method = r'\s([A-Z]{3,7})\s'
# url = r'\s((\/((\w+\/*\?*.(\w))+)\s))'
url = r'\s(\/[a-zA-Z0-9\/\S]+)'
protocol = r'\s([A-Z]+\/\d\.\d)\s'
status_pattern_size = r'\s(\d{3})\s(\d+)\s'
uuid_pattern = r'(([A-Za-z0-9\-]+)$)|(([0-9a-f]{32})$)'
df = df.selectExpr(regexp_extract('value', host_pattern, 1).alias('host'),
regexp_extract('value', ts_pattern, 1).alias('time'),
regexp_extract('value', method, 1).alias('http_method'),
regexp_extract('value', url, 1).alias('request_uri'),
regexp_extract('value', protocol, 1).alias('http_protocol'),
regexp_extract('value', status_pattern_size, 1).cast('integer').alias('response_status'),
regexp_extract('value', status_pattern_size, 2).cast('integer').alias('response_time'),
regexp_extract('value', uuid_pattern, 1).alias('instance_id'))
This throws me an error saying Column is not iterable
.
I wanted to use the following name regex instead, as the above would lead to multiple regexp_extract calls,
(?P<host>\S+)\s+\S+\s+(?P<user>\S+)\s+\[(?P<time>.*?)\]\s+(?P<http_method>\S+)\s+(?P<request_uri>\S+)\s+(?P<http_protocol>\S+)\s+(?P<response_status>\S+)\s+(?P<reponse_time>\S+)\s+(?P<instance_id>\S+)
for extracting the values for the respective columns. Is it possible to do that on a streamable pyspark dataframe?