0

I have gone through the following questions and pages seeking an answer for my problem, but they did not solve my problem:

log from spark udf to driver

Logger is not working inside spark UDF on cluster

https://www.javacodegeeks.com/2016/03/log-apache-spark.html

We are using Spark in standalone mode, not on Yarn. I have configured the log4j.properties file in both the driver and executors to define a custom logger "myLogger". The log4j.properties file, which I have replicated in both the driver and the executors, is as follows:

log4j.rootLogger=INFO, Console_Appender, File_Appender

log4j.appender.Console_Appender=org.apache.log4j.ConsoleAppender
log4j.appender.Console_Appender.Threshold=INFO
log4j.appender.Console_Appender.Target=System.out
log4j.appender.Console_Appender.layout=org.apache.log4j.PatternLayout
log4j.appender.Console_Appender.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n

log4j.appender.File_Appender=org.apache.log4j.rolling.RollingFileAppender
log4j.appender.File_Appender.Threshold=INFO
log4j.appender.File_Appender.File=/opt/spark_log/app_log.txt
log4j.appender.File_Appender.RollingPolicy=org.apache.log4j.rolling.TimeBasedRollingPolicy
log4j.appender.File_Appender.TriggeringPolicy=org.apache.log4j.rolling.SizeBasedTriggeringPolicy
log4j.appender.File_Appender.RollingPolicy.FileNamePattern=/opt/spark_log/app_log.%d{MM-dd-yyyy}.%i.txt.gz
log4j.appender.File_Appender.RollingPolicy.ActiveFileName=/opt/spark_log/app_log.txt
log4j.appender.File_Appender.TriggeringPolicy.MaxFileSize=1000
log4j.appender.File_Appender.layout=org.apache.log4j.PatternLayout
log4j.appender.File_Appender.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c - %m%n

log4j.logger.myLogger=INFO,File_Appender
# Set the default spark-shell log level to WARN. When running the spark-shell, the
# log level for this class is used to overwrite the root logger's log level, so that
# the user can have different defaults for the shell and regular Spark apps.
log4j.logger.org.apache.spark.repl.Main=WARN

# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark-project.jetty=WARN
log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO

In my Java application, I have placed loggers using the following line:

private static Logger logger = LogManager.getLogger("myLogger");

I am running the application using the following command:

spark-submit --driver-java-options "-Dlog4j.configuration=file:///opt/spark/spark-2.4.4/conf/log4j.properties" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:///opt/spark/spark-2.4.4/conf/log4j.properties" --class com.test.SparkApp file:///opt/test/cepbck/test.person.app-0.0.7.jar

When I run the application on a cluster, the logs in the main driver class are appearing fine in both the console and the log file. But when the control is going inside an UDF, no logs are being printed. I am opening the log file on the executors, but they also are not containing any of the log statements which I gave. Please help me in this regard.

supriyo_basak
  • 505
  • 1
  • 7
  • 24
  • 1
    Basic troubleshooting: when you run the job in local mode, do you see any log from your UDF? when you run the job in cluster mode, do you see any alert from Log4J initialization in `stderr` / `stdout` (captured as YARN "log" files)? – Samson Scharfrichter May 12 '20 at 12:57
  • You are absolutely right. I ran the application in local mode, and even then, the logs from my UDF are being printed in console with my custom logger, but they are not being written to file. What could be the reason of this? – supriyo_basak May 13 '20 at 05:12
  • I am not using YARN, as I told, I am using standalone mode of Spark. Could you please tell where to see those Log4J initialization alerts in that case? – supriyo_basak May 13 '20 at 05:47
  • Ah, my bad. Inspect the scratch directory used by the Worker daemons on their local disks -- not the one used by the jobs to store their temp files, but the one where the Worker stores the JAR (or Python script) for the job and then dumps their output. – Samson Scharfrichter May 13 '20 at 08:45
  • There is a Log4J property to enable "verbose logging of its internals" to stderr, but I can't remember the syntax right now. Google will help you (but don't use the V2 syntax, use the V1.2) – Samson Scharfrichter May 13 '20 at 08:48
  • @SamsonScharfrichter, I have resolved the issue. Thanks for pointing me in the right direction. Please see below for my resolution. – supriyo_basak May 13 '20 at 12:50

1 Answers1

0

I have resolved the logging issue. I found out that even in local mode, the logs from UDFs were not being written to the spark log files, even though they were being displayed in the console. Thus I narrowed the problem down to that the UDFs were perhaps not being able to access the file system. Then I found the following question:

How to load local file in sc.textFile, instead of HDFS

Here, there was no solution to my problem, but there was the hint that from inside Spark, if we require to refer to files, we have to refer to the root of the file system as “file:///” as seen by the executing JVM. So, I made a change in the log4j.properties file in driver:

log4j.appender.File_Appender.File=file:///opt/spark_log/app_log.txt

It was originally

log4j.appender.File_Appender.File=/opt/spark_log/app_log.txt

I also increased the size of the log file, otherwise the logs were going out of the current file very quickly:

log4j.appender.File_Appender.TriggeringPolicy.MaxFileSize=1000000

After this, I executed the application once again, and found that logs were being generated in the log file. So I replicated the changes to the executors, and executed the application in the cluster mode. From then on, the logs are being generated in the executor also.

supriyo_basak
  • 505
  • 1
  • 7
  • 24