0

I am running a simple batch job in Flink.

The dashboard says the job is finished but it only shows that about 30000 records were processed out of about 220000.

Otherwise, there are no errors and the output seems as expected.

How to check why the job finished prematurely?

Here is the source code:

package com.otorio.zeeklogprocessor;

import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.core.fs.FileSystem.WriteMode;
import org.apache.flink.api.java.DataSet;
import com.google.gson.*;

import java.lang.reflect.Type;

import com.otorio.zeeklogprocessor.RegulatedZeekConnRecord;

/**
 * Skeleton for a Flink Batch Job.
 *
 * <p>For a tutorial how to write a Flink batch application, check the
 * tutorials and examples on the <a href="https://flink.apache.org/docs/stable/">Flink Website</a>.
 *
 * <p>To package your application into a JAR file for execution,
 * change the main class in the POM.xml file to this class (simply search for 'mainClass')
 * and run 'mvn clean package' on the command line.
 */
public class BatchJob {

    public static void main(String[] args) throws Exception {

        // set up the batch execution environment
        final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        DataSet<String> loglines = env.readTextFile("/Users/ben.reich/software/Flink/zeek/conn.log");
        DataSet<RegulatedZeekConnRecord> jasonized = loglines.map(new Jsonizer());
        DataSet<String> aggregated = jasonized.groupBy("key").reduce(new ReductionLogic()).map(new OutputBuilder());
        aggregated.writeAsText("/Users/ben.reich/software/Flink/zeek/graphdata.log", WriteMode.OVERWRITE);


        // execute program
        env.execute("Zeek conn.log Processor");
    }

    // DeSerialize the log record
    public static final class Jsonizer implements MapFunction<String, RegulatedZeekConnRecord> {

        private static GsonBuilder gb = new GsonBuilder();
        private static Gson gson;
        private static RegulatedZeekConnRecord logObject;

        public RegulatedZeekConnRecord map(String record) {
            // Initialize gson with customized deserializer
            if (gson == null) {
                gb.registerTypeAdapter(RegulatedZeekConnRecord.class, new ConnLogDeserializer());
                gson = gb.create();
            }
            logObject = gson.fromJson(record, RegulatedZeekConnRecord.class);
            return logObject;
        }
    }

    public static class ReductionLogic implements ReduceFunction<RegulatedZeekConnRecord> {

        @Override
        public RegulatedZeekConnRecord reduce(RegulatedZeekConnRecord pre, RegulatedZeekConnRecord current) {
            pre.key = current.key;
            pre.ts = current.ts;
            pre.id_orig_h = current.id_orig_h;
            pre.id_orig_p = current.id_orig_p;
            pre.id_resp_h = current.id_resp_h;
            pre.id_resp_p = current.id_resp_p;
            pre.proto = current.proto;
            pre.conn_state = current.conn_state;
            pre.history = current.history;
            pre.service = current.service;
            pre.orig_pkts = current.orig_pkts + pre.orig_pkts;
            pre.orig_ip_bytes = current.orig_ip_bytes + pre.orig_ip_bytes;
            pre.resp_pkts = current.resp_pkts + pre.resp_pkts;
            pre.resp_ip_bytes = current.resp_ip_bytes + pre.resp_ip_bytes;
            pre.missed_bytes = current.missed_bytes + pre.missed_bytes;
            return pre;
        }
    }


    public static class OutputBuilder implements MapFunction<RegulatedZeekConnRecord, String> {
        private static Gson gson = new Gson();

        @Override
        public String map(RegulatedZeekConnRecord record) {
            String jsonTarget = "";
            jsonTarget = gson.toJson(record);
            return jsonTarget;
        }
    }

    public static class ConnLogDeserializer implements JsonDeserializer<RegulatedZeekConnRecord> {

        @Override
        public RegulatedZeekConnRecord deserialize(JsonElement json, Type typeOfT, JsonDeserializationContext context) throws JsonParseException {
            JsonObject jsonobj = json.getAsJsonObject();
            RegulatedZeekConnRecord rec = new RegulatedZeekConnRecord(jsonobj);
            return rec;

        }
    }
}
Phil Dukhov
  • 67,741
  • 15
  • 184
  • 220
Ben
  • 1
  • 2
  • Is the output complete (and the dashboard wrong), or is the dashboard correct and some output is missing? – David Anderson Dec 05 '21 at 17:32
  • A few general comments - first, don't use static members for your functions. These are multi-threaded, and avoiding threading issues (even if you _know_ the class is thread safe) is good practice. Second, Jsonizer should extend RichMapFunction, and create the Gson object (transient) in its open() call. Finally, do you know how many reduced records you expect? If so, that's what I'd be checking, in the output graphdata.log file(s). – kkrugler Dec 05 '21 at 18:58
  • It seems that the dashboard is right. Output is missing. – Ben Dec 07 '21 at 08:33
  • @kkrugler. Thank you for the tips. Do you mean that the problem may be that I create too many Gson objects? One thing I noticed is that my Outside JVM Memory shows that the task manager used 129mb out of 129mb. – Ben Dec 07 '21 at 08:37
  • @Ben - no, just that it's easy to get bit by concurrency issues when you use static members. Flink provides RichXXX versions of all operators so that you can initialize per-task objects in the `open()` method. – kkrugler Dec 08 '21 at 14:46
  • @kkrugler Thank you so much for the help. I changed to the RichMapFunction but still have an issue. The graphdata.log file comes out as a directory with multiple files in it. Is that normal? I saw you mention it could be files and not one file. I am not sure if my job creates all the records as I do not have an idea how many should be output. – Ben Dec 14 '21 at 11:49

0 Answers0