Remove repeated content Java

Question

I got this text, and I need to filter out these repeated lines and words. I don't know if there's a better way than what I'm doing.

00:00:00,413|03:50:25,600|ISDB|>> FALAM QUE A GENTE COMBINA
00:00:00,413|03:50:25,600|ISDB|PERFEITAMENTE. EU
00:00:01,135|00:00:01,315|ISDB|>> FALAM QUE A GENTE COMBINA
00:00:01,135|00:00:01,315|ISDB|PERFEITAMENTE. EU PEDI REVISTAS
00:00:01,315|00:00:02,218|ISDB|PERFEITAMENTE. EU PEDI REVISTAS
00:00:01,315|00:00:02,218|ISDB|BOBAS PARA
00:00:02,218|00:00:02,398|ISDB|PERFEITAMENTE. EU PEDI REVISTAS
00:00:02,218|00:00:02,398|ISDB|BOBAS PARA AMIGOS
00:00:02,398|00:00:02,759|ISDB|PERFEITAMENTE. EU PEDI REVISTAS
00:00:02,398|00:00:02,759|ISDB|BOBAS PARA AMIGOS E AO
00:00:02,759|00:00:03,274|ISDB|PERFEITAMENTE. EU PEDI REVISTAS
00:00:02,759|00:00:03,274|ISDB|BOBAS PARA AMIGOS E AO INV?
00:00:03,274|00:00:04,357|ISDB|BOBAS PARA AMIGOS E AO INV?
00:00:03,274|00:00:04,357|ISDB|DISSO TROUXERAM ISSO A?
00:00:04,357|00:00:05,259|ISDB|BOBAS PARA AMIGOS E AO INV?
00:00:04,357|00:00:05,259|ISDB|DISSO TROUXERAM ISSO A? ELES
00:00:05,259|00:00:05,414|ISDB|DISSO TROUXERAM ISSO A? ELES
00:00:05,414|00:00:05,775|ISDB|DISSO TROUXERAM ISSO A? ELES
00:00:05,414|00:00:05,775|ISDB|COLOCARAM AS FOTOS
00:00:05,775|00:00:06,677|ISDB|DISSO TROUXERAM ISSO A? ELES
00:00:05,775|00:00:06,677|ISDB|COLOCARAM AS FOTOS COMO
00:00:06,677|00:00:06,858|ISDB|DISSO TROUXERAM ISSO A? ELES
00:00:06,677|00:00:06,858|ISDB|COLOCARAM AS FOTOS COMO PAPEL
00:00:06,858|03:50:32,400|ISDB|COLOCARAM AS FOTOS COMO PAPEL DE
00:00:06,858|03:50:32,400|ISDB|PAREDE, PARECE AT?QUE
00:00:07,914|00:00:07,916|ISDB|COLOCARAM AS FOTOS COMO PAPEL DE
00:00:07,914|00:00:07,916|ISDB|PAREDE, PARECE AT?QUE EU
00:00:07,914|00:00:08,997|ISDB|PAREDE, PARECE AT?QUE EU GOSTO
00:00:08,997|00:00:09,178|ISDB|PAREDE, PARECE AT?QUE EU GOSTO

And I'm using that code, to put these lines in a HashSet so they don't be repeated.

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.util.HashSet;
import java.util.Scanner;
import java.util.Set;
public class Testecc {
   public static void main(String args[]) throws Exception {
      String filePath = "C://teste//teste1.txt";
      String input = null;
      //Buffered reader
      BufferedReader br = new BufferedReader(new FileReader(filePath));
      while((input=br.readLine()) !=null){
                input=br.readLine();

      //FileWriter (criando arquivo)
      FileWriter writer = new FileWriter("C://teste//teste.txt");
      //hashset para elimitar duplicatas
      Set set = new HashSet();
      String line;
      //adicionando linhas no hashset
      while((line=br.readLine())!=null){
          String line1= line.substring(0,31);
          String line2=line.substring(31);
          System.out.println(line);
          if(set.add(line2)){

      writer.append(line1+line2+"\n");
          }
      }
      writer.flush();
      System.out.println("Pronto!");
   }
}
   }

With this I removed the duplicated lines like this:

00:00:01,135|00:00:01,315|ISDB|>> FALAM QUE A GENTE COMBINA
00:00:01,135|00:00:01,315|ISDB|PERFEITAMENTE. EU PEDI REVISTAS
00:00:01,315|00:00:02,218|ISDB|BOBAS PARA
00:00:02,218|00:00:02,398|ISDB|BOBAS PARA AMIGOS
00:00:02,398|00:00:02,759|ISDB|BOBAS PARA AMIGOS E AO
00:00:02,759|00:00:03,274|ISDB|BOBAS PARA AMIGOS E AO INV�S
00:00:03,274|00:00:04,357|ISDB|DISSO TROUXERAM ISSO A�.
00:00:04,357|00:00:05,259|ISDB|DISSO TROUXERAM ISSO A�. ELES
00:00:05,414|00:00:05,775|ISDB|COLOCARAM AS FOTOS
00:00:05,775|00:00:06,677|ISDB|COLOCARAM AS FOTOS COMO
00:00:06,677|00:00:06,858|ISDB|COLOCARAM AS FOTOS COMO PAPEL
00:00:06,858|03:50:32,400|ISDB|COLOCARAM AS FOTOS COMO PAPEL DE
00:00:06,858|03:50:32,400|ISDB|PAREDE, PARECE AT� QUE
00:00:07,914|00:00:07,916|ISDB|PAREDE, PARECE AT� QUE EU
00:00:07,914|00:00:08,997|ISDB|PAREDE, PARECE AT� QUE EU GOSTO

But I also need to remove the repeated words.

I'm really out of ideas.

How can I do that?

What is the rule determining which "duplicate" row gets retained? I don't see any obvious rule. — Tim Biegeleisen, Mar 15 '20 at 12:12
What exactly do you mean by "But i need to remove the repeated words also."? That you want to keep "COLOCARAM AS FOTOS COMO PAPEL DE" but not "COLOCARAM AS FOTOS", "COLOCARAM AS FOTOS COMO", ..._ — dankito, Mar 15 '20 at 12:46
Please explain what do you mean by 'repeated words', and what do you need to do with the lines, that contain them? — Adamsan, Mar 15 '20 at 13:17
Also, please use english comments, it helps communicating your intent. — Adamsan, Mar 15 '20 at 13:31
I need to keep just the last line of that, for exemple: 00:00:01,315|00:00:02,218|ISDB|BOBAS PARA 00:00:02,218|00:00:02,398|ISDB|BOBAS PARA AMIGOS 00:00:02,398|00:00:02,759|ISDB|BOBAS PARA AMIGOS E AO 00:00:02,759|00:00:03,274|ISDB|BOBAS PARA AMIGOS E AO INV�S — Rodrigo Marros, Mar 15 '20 at 14:15

Yevgen · Accepted Answer · 2020-03-16T12:31:09.890

1

Have a map which would hold line values grouped by a certain key. A key would a beginning of the line, starting from the words you are interested in, say, first 5 letters. Then add those lines to the map, and if the line is longer than the one found previously, replace it.

try (BufferedReader br = new BufferedReader(new FileReader(filepath))) {

  final Map<String, String> map = new LinkedHashMap<>();

  br.lines().forEach(line -> {
        String message = line.substring(line.lastIndexOf("|") + 1);
        if (message.isEmpty()) {
          return;
        }
        String key = message.split(" ")[0];
        if (map.get(key) == null) {
          map.put(key, line);
        } else if (map.get(key).length() < line.length()) {
          map.remove(key);
          map.put(key, line);
        }
      }
  );

  map.forEach((k, v) -> System.out.println(v));
}

The above code will give you the following output.

00:00:00,413|03:50:25,600|ISDB|>> FALAM QUE A GENTE COMBINA
00:00:01,135|00:00:01,315|ISDB|PERFEITAMENTE. EU PEDI REVISTAS
00:00:02,759|00:00:03,274|ISDB|BOBAS PARA AMIGOS E AO INV?
00:00:04,357|00:00:05,259|ISDB|DISSO TROUXERAM ISSO A? ELES
00:00:06,858|03:50:32,400|ISDB|COLOCARAM AS FOTOS COMO PAPEL DE
00:00:07,914|00:00:08,997|ISDB|PAREDE, PARECE AT?QUE EU GOSTO

edited Mar 16 '20 at 12:31

answered Mar 15 '20 at 15:06

Yevgen

1,576
1
15
17

Your logic which handles the map is superfluous, and you don't need to check for null before inserting. Just do a put (or see my answer). In addition, hash maps do NOT maintain insertion order, so the order which your answer prints the logs likely would not be correct. – Tim Biegeleisen Mar 15 '20 at 15:08
@TimBiegeleisen You are right about the order, changed it to LinkedHashMap, thanks for the hint! – Yevgen Mar 15 '20 at 15:18
I guess we almost there. But got that error: "Exception in thread "main" java.lang.StringIndexOutOfBoundsException: begin 31, end 36, length 33" – Rodrigo Marros Mar 16 '20 at 11:40
You need to handle empty or short messages. Due to lack of data, I can't say what you need to do exactly, but I have tried to generalize the code a bit, please try it. – Yevgen Mar 16 '20 at 12:27
Ok, I will try, but not so sure how. – Rodrigo Marros Mar 16 '20 at 12:30
1

@RodrigoMarros See the updated code. – Yevgen Mar 16 '20 at 12:31
1

@Eugene Yeah, now it's ok. All working fine. Thanks a lot. Got a lot to study and learn. – Rodrigo Marros Mar 16 '20 at 13:58

score 0 · Answer 2 · answered Mar 15 '20 at 12:22

You could use the final post-pipe portion of each log line as a key, then insert each line into a LinkedHashMap, to remove duplicates:

String filePath = "C:/log.txt";
BufferedReader br = new BufferedReader(new FileReader(filePath));
String input;
Map<String, String> logMap = new LinkedHashMap<>();
while ((input = br.readLine()) != null) {
    input = br.readLine();
    String key = input.replaceAll("^.*\\|", "");
    logMap.put(key, input);
}

// Now print out the map minus duplicates
for (String line : logMap.values()) {
    System.out.println(line);
}

Instead of printing to the console, you could just as easily write the filtered log out to another file. Note that this approach would retain the last occurring line of each duplicate.

The result was the same: 00:00:00,413|03:50:25,600|ISDB|PERFEITAMENTE. EU 00:00:01,135|00:00:01,315|ISDB|PERFEITAMENTE. EU PEDI REVISTAS 00:00:01,315|00:00:02,218|ISDB|BOBAS PARA 00:00:02,218|00:00:02,398|ISDB|BOBAS PARA AMIGOS 00:00:02,398|00:00:02,759|ISDB|BOBAS PARA AMIGOS E AO 00:00:02,759|00:00:03,274|ISDB|BOBAS PARA AMIGOS E AO INV�S 00:00:03,274|00:00:04,357|ISDB|DISSO TROUXERAM ISSO A�. 00:00:06,677|00:00:06,858|ISDB|DISSO TROUXERAM ISSO A�. ELES — Rodrigo Marros, Mar 15 '20 at 14:16

Remove repeated content Java

2 Answers2