How to read multiple line elements in Spark , where each record of log is starting with yyyy-MM-dd format and each record of log is multi-line?

Question

I have implemented below logic in scala so far for this :

val hadoopConf = new Configuration(sc.hadoopConfiguration); 
    //hadoopConf.set("textinputformat.record.delimiter", "2016-")
    hadoopConf.set("textinputformat.record.delimiter", "^([0-9]{4}.*)")

    val accessLogs = sc.newAPIHadoopFile("/user/root/sample.log", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], hadoopConf).map(x=>x._2.toString)

I want to put regex to recognize the if line started with date format then treat it as a new record else continue to add lines in old record.

But this is not working. If i am passing date manually then its working fine. Below is the same code like this i want to put the regex:

//hadoopConf.set("textinputformat.record.delimiter", "2016-")

Please help on this.thanks in advance.

Here below is the sample format:

2016-12-23 07:00:09,693 [jetty-51 - /app/service] INFO  org.apache.cxf.interceptor.LoggingOutInterceptor S:METHOD_NAME=METHNAME : WebAppSessionId= : ChannelSessionId=web-xxx-xxx-xxx : ClientIp=xxxxxxx :  - Outbound Message

---------------------------
    ID: 1978
    Address: https://sample.domain.com/SampleService.xxx/basic
    Encoding: UTF-8
    Content-Type: text/xml
    Headers: {Accept=[*/*], SOAPAction=["WebDomain.Service/app"]}
    Payload: <soap:Envelope>
    </soap:Envelope>
2016-12-26 08:00:01,514 [jetty-1195 - /app/service/serviceName] ERROR com.testservices.cache.impl.ActiveSpaceCacheHandler S:METHOD_NAME=ServiceInquiryWithBands : WebAppSessionId= : ChannelSessionId=SERVICE : ClientIp=client-ip :  - ActiveSpaceCacheHandler:getServiceResponseFromCache(); exception: java.lang.Exception: getServiceResponseData: com.tibco.as.space.RuntimeASException: field key is not nullable and is missing in tuple for cachekey:Request.US
2016-12-26 08:00:01,624 [jetty-979 - /app/service/serviceName] ERROR com.testservices.cache.impl.ActiveSpaceCacheHandler S:METHOD_NAME=ServiceInquiryWithBands : WebAppSessionId= : ChannelSessionId=SERVICE : ClientIp=client-ip :  - ActiveSpaceCacheHandler:setServiceResponseInCache(); exception: com.test.as.space.RuntimeASException: field key is not nullable and is missing in tuple for cachekey

What exactly is the problem? I tried your regex with the given text and it seemed to match it correctly. — Phasmid, Dec 24 '16 at 20:32
@Phasmid, thanks for your effort. I used (...*)*\s* at the end of the regex. So suppose if i have more than 1 record like above shared that it is selecting all of them. I need to put logic in my scala application that mapper function splits the per record rather than per line. — Ashish Tyagi, Dec 24 '16 at 22:19
yes it is, i am trying to figure out logs data splitted by each record(starting with date time format) rather than per line and then my regex will test the matching pattern on that record. — Ashish Tyagi, Dec 24 '16 at 22:56
Is the number of lines always the same for each log entry? I'm guessing not because you are capturing a soap envelope? Also, are you interested in any of the lines following the line with the timestamp? — Chris Snow, Dec 25 '16 at 19:43
@AshishTyagi - do you want map function to ignore lines those are not starting with date format? - can you please add your expected output in your question? — Ronak Patel, Dec 25 '16 at 22:56
@BigDataLearner - the initial issue got resloved by myself. Actually i have multiplelines of one single record in log and each record log is starting with date time format. So i want to pass data to map function per record. — Ashish Tyagi, Dec 26 '16 at 14:31
For this my below code is working fine in spark-shell : val conf = new Configuration val rgx = "^(([0-9]{4}-[0-9]{2}-[0-9]{2}))"//date regex conf.set("record.delimiter.regex", rgx) sc.newAPIHadoopFile("", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf) val logFile = sc.textFile("/user/root/sample.log") val accessLogs = logFile.map(parseLogLine) — Ashish Tyagi, Dec 26 '16 at 14:31
But when i am running it in Apache zeppelin its giving error org.apache.spark.SparkException: Task not serializable Caused by: java.io.NotSerializableException: org.apache.hadoop.conf.Configuration. Same code working fine in Spark-shell. Can you please help on this ? — Ashish Tyagi, Dec 26 '16 at 14:33
@Ashush, if you have found a solution for your original problem, you could also create an answer post stating how you fixed it. If you now have a different question, it would probably be best to create a new question with a different title. — Chris Snow, Dec 27 '16 at 14:39
@SHC issue is still the same that i want to retrieve the data from the logs not per line wise rather per record-wise. However, i tried some solutions and make the technicality of this question as accurate as possible. — Ashish Tyagi, Dec 27 '16 at 14:41
I'm still not clear what the question is. Is it the NotSerializableException? — Chris Snow, Dec 27 '16 at 14:53
@SHC it is not about the exception, as i mentioned im not able to retrieve the records from the logs according to my requirement. Okay, do you the know the logic to parse my given logs in a way that each record starting with 2016-12-26 date should be treated as one record but each record may have multi-line in itself. — Ashish Tyagi, Dec 27 '16 at 19:25

Chris Snow · Accepted Answer · 2016-12-28T11:06:15.687

I couldn't get it working with a regex. The best I could do was hadoopConf.set("textinputformat.record.delimiter", "\n20") which may work for you if you don't have those characters in the middle of a log entry. This approach will also and give you future-proofing, supporting dates up to 2099

If you need a regex, you could try http://dronamk.blogspot.co.uk/2013/03/regex-custom-input-format-for-hadoop.html

My code:

// Create some dummy data
val s = """2016-12-23 07:00:09,693 [jetty-51 - /app/service] INFO  org.apache.cxf.interceptor.LoggingOutInterceptor S:METHOD_NAME=METHNAME : WebAppSessionId= : ChannelSessionId=web-xxx-xxx-xxx : ClientIp=xxxxxxx :  - Outbound Message
          |---------------------------
          |    ID: 1978
          |    Address: https://sample.domain.com/SampleService.xxx/basic
          |    Encoding: UTF-8
          |    Content-Type: text/xml
          |    Headers: {Accept=[*/*], SOAPAction=["WebDomain.Service/app"]}
          |    Payload: <soap:Envelope>
          |    </soap:Envelope>
          |2016-12-26 08:00:01,514 [jetty-1195 - /app/service/serviceName] ERROR com.testservices.cache.impl.ActiveSpaceCacheHandler S:METHOD_NAME=ServiceInquiryWithBands : WebAppSessionId= : ChannelSessionId=SERVICE : ClientIp=client-ip :  - ActiveSpaceCacheHandler:getServiceResponseFromCache(); exception: java.lang.Exception: getServiceResponseData: com.tibco.as.space.RuntimeASException: field key is not nullable and is missing in tuple for cachekey:Request.US
          |2016-12-26 08:00:01,624 [jetty-979 - /app/service/serviceName] ERROR com.testservices.cache.impl.ActiveSpaceCacheHandler S:METHOD_NAME=ServiceInquiryWithBands : WebAppSessionId= : ChannelSessionId=SERVICE : ClientIp=client-ip :  - ActiveSpaceCacheHandler:setServiceResponseInCache(); exception: com.test.as.space.RuntimeASException: field key is not nullable and is missing in tuple for cachekey
""".stripMargin

import java.io._
val pw = new PrintWriter(new File("log.txt"))
pw.write(s)
pw.close

// Now process the data
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
import org.apache.hadoop.io.Text
import org.apache.hadoop.io.LongWritable
import org.apache.spark.{SparkContext, SparkConf}

val conf = sc.getConf
sc.stop()
conf.registerKryoClasses(Array(classOf[org.apache.hadoop.io.LongWritable]))
val sc = new SparkContext(conf)
val hadoopConf = new Configuration(sc.hadoopConfiguration)
hadoopConf.set("textinputformat.record.delimiter", "\n20")

val accessLogs = sc.newAPIHadoopFile("log.txt", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], hadoopConf)
accessLogs.map(x => x._2.toString).zipWithIndex().collect().foreach(println)

Note that I'm using zipWithIndex just for debugging purposes. The output is:

    (2016-12-23 07:00:09,693 [jetty-51 - /app/service] INFO  org.apache.cxf.interceptor.LoggingOutInterceptor S:METHOD_NAME=METHNAME : WebAppSessionId= : ChannelSessionId=web-xxx-xxx-xxx : ClientIp=xxxxxxx :  - Outbound Message
    ---------------------------
        ID: 1978
        Address: https://sample.domain.com/SampleService.xxx/basic
        Encoding: UTF-8
        Content-Type: text/xml
        Headers: {Accept=[*/*], SOAPAction=["WebDomain.Service/app"]}
        Payload: 
        ,0)
    (16-12-26 08:00:01,514 [jetty-1195 - /app/service/serviceName] ERROR com.testservices.cache.impl.ActiveSpaceCacheHandler S:METHOD_NAME=ServiceInquiryWithBands : WebAppSessionId= : ChannelSessionId=SERVICE : ClientIp=client-ip :  - ActiveSpaceCacheHandler:getServiceResponseFromCache(); exception: java.lang.Exception: getServiceResponseData: com.tibco.as.space.RuntimeASException: field key is not nullable and is missing in tuple for cachekey:Request.US,1)
    (16-12-26 08:00:01,624 [jetty-979 - /app/service/serviceName] ERROR com.testservices.cache.impl.ActiveSpaceCacheHandler S:METHOD_NAME=ServiceInquiryWithBands : WebAppSessionId= : ChannelSessionId=SERVICE : ClientIp=client-ip :  - ActiveSpaceCacheHandler:setServiceResponseInCache(); exception: com.test.as.space.RuntimeASException: field key is not nullable and is missing in tuple for cachekey
    ,2)

Note the index is the second field in the output.

I ran this code on an IBM Datascience Exerience notebook running Scala 2.10 and Spark 1.6

thanks for your help, just for now i also used the similar approach. I will look into your shared regex example :) — Ashish Tyagi, Dec 29 '16 at 12:12

How to read multiple line elements in Spark , where each record of log is starting with yyyy-MM-dd format and each record of log is multi-line?

1 Answers1