0

I'm using eBPF to trace context switches on a bare metal Ubuntu 20.04 server. For some reason, the timestamp from the traces jumps in value during the job execution, and I'm not sure why. My eBPF program to trace is as follows:

!/usr/bin/python
#
# trace context switch
import time
start_time = time.time()
from bcc import BPF

print("imports: " + str(time.time() - start_time))

# load BPF program
b = BPF(text="""
TRACEPOINT_PROBE(sched, sched_switch) {
    // cat /sys/kernel/debug/tracing/events/sched/sched_switch/format
    bpf_trace_printk("%d\\n", args->prev_prio);
    return 0;
}
""")

print("BPF program set: " + str(time.time() - start_time))

# data structures
unique_pid_counts = {}
timeline = []
value_errors = 0
time_diffs = []

"""
Collects data from the context switch trace and stores it in the data structures unique_task_counts and timeline
"""

print("Started tracing: " + str(time.time() - start_time))

while True:
    try:
        # get current time
        current_time = time.perf_counter()
        # capture data from trace
        data = b.trace_fields() # (task, pid, cpu, flags, ts, msg)
        # add task to timeline
        timeline.append(data)
        # output
        # if counter % 2000 == 0:
        time_diffs.append((data, current_time - data[4]))
    except ValueError:
        value_errors += 1
        continue
    except KeyboardInterrupt:
        # feedback to the user
        print("Finished tracing: " + str(time.time() - start_time) + " seconds")
        if value_errors:
            print("value errors: " + str(value_errors))
        # write the differences to a file
        with open("Data/time_diffs.txt", "w") as fl:
            fl.write("\n".join(str(x) for x in time_diffs))
        # process the data
        from data_process import tracing_data_process
        tracing_data_process(timeline)
        # finished tracing
        print("Finished processing timeline, total run: " + str(round(time.time() - start_time, 3)) + " seconds")
        exit()

I store the information from the traces in a file, part of which is displayed here. The format is ((task name, process ID, CPU number, flags, timestamp, message), python performance counter - timestamp). The two traces showing the jump are bolded:

((b'kworker/u32:0', 62419, 13, b'd...', 5032.112517, b'120'), 344.77145957000084)
((b'<idle>', 0, 13, b'd...', 5032.112689, b'120'), 344.77129126600084)
((b'kworker/u32:0', 62419, 13, b'd...', 5032.112701, b'120'), 344.77128291699955)
((b'<idle>', 0, 13, b'd...', 5032.114244, b'120'), 344.7697435989994)
((b'kworker/u32:0', 62419, 13, b'd...', 5032.114271, b'120'), 344.7697202419995)
((b'<idle>', 0, 1, b'd...', 5032.114313, b'120'), 344.76968189200034)

((b'sshd', 45772, 1, b'd...', 5032.114459, b'120'), 344.7695395889996)
((b'', 0, 14, b'd...', 5377.338941, b'120'), -0.45493316400006734)

((b'kworker/u32:4', 21613, 14, b'd...', 5377.338969, b'120'), -0.4549573809999856)
((b'<idle>', 0, 1, b'd...', 5377.33901, b'120'), -0.4549946520000958)
((b'sshd', 45772, 1, b'd...', 5377.339155, b'120'), -0.455133453999224)
((b'<idle>', 0, 5, b'd...', 5377.346264, b'120'), -0.4622384320000492)
((b'rcu_sched', 11, 5, b'd...', 5377.346278, b'120'), -0.462244209000346)
((b'<idle>', 0, 14, b'd...', 5377.346296, b'120'), -0.4622557519996917)
((b'kworker/u32:4', 21613, 14, b'd...', 5377.346316, b'120'), -0.4622718830005397)
((b'<idle>', 0, 14, b'd...', 5377.346673, b'120'), -0.46262520700020104)

I'm not sure why this behavior is occurring. It may be an issue with my code (which is why I posted on SO), or it may be an issue with Linux behavior. I would appreciate any suggestions for modifications. This code is being used for research, so I'm open to experimentation.

Update + More details: I've noticed this same timestamp jump across multiple runs and servers. Every time, the difference between the timestamp and the performance counter decreases. This jump takes a bit of time to occur, and in one case, I observed more than one jump (both times reducing the difference). After the timestamp was normalized, then I noticed minor fluctuations around the expected values. After a restart the timestamp values remain constant and expected.

Edit 2: After the computer is restarted, the timestamp behaves as expected. The behavior persists after the first run thought.

Jawand S.
  • 148
  • 10
  • Timestamp changes may happen due to Timezone. You can convert timestamp to datetime format and verify. I might be wrong, but it happened to me once. – user3503711 Aug 29 '22 at 19:11
  • @user3503711 In this case, I don't think the timestamp is a Unix timestamp because it's not large enough, so I'm not sure how to convert it to DateTime. Thank you for the idea, though. – Jawand S. Aug 29 '22 at 19:38

0 Answers0