Generate a key for kafka messages that haven't any, in two topics, in order to merge / join them

Question

From some row data for input in a poids_garmin_brut topic:

Durée,Poids,Variation,IMC,Masse grasse,Masse musculaire squelettique,Masse osseuse,Masse hydrique,
" 14 Fév. 2022",
06:37,72.1 kg,0.3 kg,22.8,26.3 %,29.7 kg,3.5 kg,53.8 %,
" 13 Fév. 2022",
06:48,72.4 kg,0.2 kg,22.9,25.4 %,29.8 kg,3.6 kg,54.4 %,
" 12 Fév. 2022",
06:17,72.2 kg,0.0 kg,22.8,25.3 %,29.7 kg,3.6 kg,54.5 %,

I managed to create two topics poids_garmin_split_date and poids_garmin_split_valeursPoids with that small method:

public StreamsBuilder extraire(StreamsBuilder builder) {
   KStream<Void,String> streamBrut = builder.stream("poids_garmin_brut");

   // Les lignes qui débutent par " " portent des dates
   streamBrut.filter((key, value) -> value.startsWith("\" ")).to("poids_garmin_split_date");

   // celles qui ne débutent pas par " " ni ne contiennent "Durée" (header du csv) sont des données de poids.
   streamBrut.filter((key, value) -> !value.startsWith("\" ") && !value.contains("Durée")).to("poids_garmin_split_valeursPoids");
   return builder;
}

The topic poids_garmin_split_date now contains:

" 14 Fév. 2022",
" 13 Fév. 2022",
" 12 Fév. 2022",

and poids_garmin_split_valeursPoids:

06:37,72.1 kg,0.3 kg,22.8,26.3 %,29.7 kg,3.5 kg,53.8 %,
06:48,72.4 kg,0.2 kg,22.9,25.4 %,29.8 kg,3.6 kg,54.4 %,
06:17,72.2 kg,0.0 kg,22.8,25.3 %,29.7 kg,3.6 kg,54.5 %,

Both topics have null = no key, but I need to add one to both of them, to link their content two by two :

123541, " 14 Fév. 2022",
123542, " 13 Fév. 2022",
123543, " 12 Fév. 2022",

and

123541, 06:37,72.1 kg,0.3 kg,22.8,26.3 %,29.7 kg,3.5 kg,53.8 %,
123542, 06:48,72.4 kg,0.2 kg,22.9,25.4 %,29.8 kg,3.6 kg,54.4 %,
123543, 06:17,72.2 kg,0.0 kg,22.8,25.3 %,29.7 kg,3.6 kg,54.5 %,

for example, so that I can merge these topics into a single one that would be:

123541, " 14 Fév. 2022",06:37,72.1 kg,0.3 kg,22.8,26.3 %,29.7 kg,3.5 kg,53.8 %,
123542, " 13 Fév. 2022",06:48,72.4 kg,0.2 kg,22.9,25.4 %,29.8 kg,3.6 kg,54.4 %,
123543, " 12 Fév. 2022",06:17,72.2 kg,0.0 kg,22.8,25.3 %,29.7 kg,3.6 kg,54.5 %,

that I can exploit.

If it's the good way that I should do things (I'm a beginner with Kafka),
how can I do this?

map operation? transform one?

From your answer, @OneCricketeer, I've attempted this:

KStream<Void,String> streamBrut = builder.stream("poids_garmin_brut");

// Les lignes qui débutent par " " portent des dates
final LongAccumulator compteurDate = new LongAccumulator(Long::sum, 0L);

streamBrut.filter((key, value) -> value.startsWith("\" "))
   .map((key, value) -> {
      compteurDate.accumulate(1L);
      return new KeyValue<>(compteurDate.toString(), value);
   })
   .to("poids_garmin_split_date");

KStream<String, String> streamSplitDate = builder.stream("poids_garmin_split_date");

// celles qui ne débutent pas par " " ni ne contiennent "Durée" (header du csv) sont des données de poids.
final LongAccumulator compteurValeursPoids = new LongAccumulator(Long::sum, 0L);

streamBrut.filter((key, value) -> !value.startsWith("\" ") && !value.contains("Durée"))
   .map((key, value) -> {
      compteurValeursPoids.accumulate(1L);
      return new KeyValue<>(compteurValeursPoids.toString(), value);
   })
   .to("poids_garmin_split_valeursPoids");

KStream<String, String> streamSplitValeursPoids = builder.stream("poids_garmin_split_valeursPoids");

streamSplitDate.join(streamSplitValeursPoids, 
   (String date, String valeursPoids) -> date + valeursPoids, 
   JoinWindows.ofTimeDifferenceWithNoGrace(Duration.ofMinutes(5)))
   .to("poids_garmin_join_date_valeurs");

that is resulting to a topic poids_garmin_join_date_valeurs having this content:

" 14 Fév. 2022",06:37,72.1 kg,0.3 kg,22.8,26.3 %,29.7 kg,3.5 kg,53.8 %,
" 13 Fév. 2022",06:48,72.4 kg,0.2 kg,22.9,25.4 %,29.8 kg,3.6 kg,54.4 %,
" 12 Fév. 2022",06:17,72.2 kg,0.0 kg,22.8,25.3 %,29.7 kg,3.6 kg,54.5 %,
" 11 Fév. 2022",05:54,72.2 kg,0.1 kg,22.8,25.6 %,29.7 kg,3.5 kg,54.3 %,
" 10 Fév. 2022",06:14,72.3 kg,0.0 kg,22.8,25.9 %,29.7 kg,3.5 kg,54.1 %,
" 9 Fév. 2022",06:06,72.3 kg,0.5 kg,22.8,26.3 %,29.7 kg,3.5 kg,53.8 %,
" 8 Fév. 2022",07:14,71.8 kg,0.7 kg,22.7,26.3 %,29.6 kg,3.5 kg,53.8 %,

But I don't know how much that manner of doing things is acceptable.

score 1 · Answer 1 · answered Feb 21 '22 at 15:36

1

There's no attributes across both of these topics for you to join on, but you could consume both in a simple loop and add a simple counter... Or do the same with filter().map().to()

The easier solution would be to modify your original producer to iterate over your file in pairs of lines, and produce single kg+date events, rather than send separate events for every line, or having two related events with no shared information between them in separate topics entirely.

You also don't need the file header line in your topic

answered Feb 21 '22 at 15:36

OneCricketeer

179,855
19
132
245

The `poids_garmin_brut` is the input topic, but I can't change its content. It's on the _Garmin Connect_ website the exportation of your weight, if your personal scale recorded it for you. This data is downloadable through a `csv` file, with alas the measurement records on two lines, each... So comes the work to try to flatten them. – Marc Le Bihan Feb 21 '22 at 16:17
How can I iterate over a topic by pair of lines? It's interesting. – Marc Le Bihan Feb 21 '22 at 16:31

Generate a key for kafka messages that haven't any, in two topics, in order to merge / join them

1 Answers1