Remove duplicates in text file line by line

Question

I'm trying to write a Python script that will remove duplicate strings in a text file. However, the de-duplication should only occur within each line.

For example, the text file might contain:

þ;ABC.001.123.1234;þ;;þ;10 ABC\ABCD\ABCDE;10 ABC\ABCD\ABCDE;þ
þ;ABC.001.123.1234;þ;;þ;10 ABC\ABCD\ABCDE;12 EFG\EFG;12 EFG\EFG;þ þ;ABC.001.123.1234;þ;;þ;10 ABC\ABCD\ABCDE;12 EFG\EFG;09 XYZ\XYZ\XYZ;12 EFG\EFG;þ

Thus, in the above example, the script should only remove the bold strings.

I've searched Stack Overflow and elsewhere to try to find a solution, but haven't had much luck. There seem to be many solutions that will remove duplicate lines, but I'm trying to remove duplicates within a line, line-by-line.

Update: Just to clarify - þ is the delimiter for each field, and ; is the delimiter for each item within each field. Within each line, I'm attempting to remove any duplicate strings contained between semicolons.

Update 2: Example edited to reflect that the duplicate value may not always follow directly after the first instance of the value.

Use a regular expression with a back-reference to detect a string followed by a copy of itself. — Barmar, Mar 08 '17 at 23:40
Are your fields separated by a semicolon ; ? Because ABC and 123 repeat many times. — chapelo, Mar 08 '17 at 23:56
And if the fields **are** separated by semi-colons, how should the `þ` characters be handled? Can duplicates occur *anywhere* in the line, or are you only interested in sequential repeats? Also: is the matching case-sensitive, and is whitespace significant? — ekhumoro, Mar 09 '17 at 00:52
@ekhumoro `þ` delimits each field, and `;` delimits each item under the field. So, essentially, going line-by-line, a duplicate string between semicolons should be removed. — a-goonie, Mar 09 '17 at 01:02
You still need to answer whether or not whitespace matters: 10 ABC\ABCD\ABCDE; 10 ABC\ABCD\ABCDE aren't duplicates since whitespace makes them distinct. — gregory, Mar 09 '17 at 01:04
@gregory Apologies. The whitespace was an error in the formating of my example. — a-goonie, Mar 09 '17 at 01:07
@ekhumoro Probably not case-sensitive. The duplicates being weeded out are file paths, so it's possible there will be duplicate file paths that aren't necessarily identical in terms of case. — a-goonie, Mar 09 '17 at 01:15

score 1 · Answer 1 · answered Mar 09 '17 at 01:33

@Prune's answer gives the idea but it needs to be modified like this:

input_file = """"þ;ABC.001.123.1234;þ;;þ;10 ABC\ABCD\ABCDE;10 ABC\ABCD\ABCDE;þ
þ;ABC.001.123.1234;þ;;þ;10 ABC\ABCD\ABCDE;12 EFG\EFG;12 EFG\EFG;þ"""""

input = input_file.split("\n")

for line in input:
    seen_item = []
    for item in line.split(";"):
        if item not in seen_item or item == "þ":
             seen_item.append(item)
    print(";".join(seen_item))

score 0 · Accepted Answer · answered Mar 09 '17 at 01:50

0

import re
with open('file', 'r') as f:
     file = f.readlines()
for line in file:
     print(re.sub(r'([^;]+;)(\1)', r'\1', line))

Read the file by lines; then replace the duplicates using re.sub.

answered Mar 09 '17 at 01:50

gregory

10,969
2
30
42

Thank you. That's a really great solution. The only issue I have now is removing duplicates that don't necessarily follow immediately after the first instance of the duplicate value. For example: `;þ;MEP.0002.087836;þ;;þ;15 Engineering\03 Personal Folders\ENGBIBLES\Cranes\Liebherr\LT1220;15 Engineering\03 Personal Folders\ENGBIBLES\Cranes\Liebherr\LG1300;15 Engineering\03 Personal Folders\ENGBIBLES\Cranes\Liebherr\LT1220;15 Engineering\03 Personal Folders\ENGBIBLES\Cranes\Liebherr\LG1300;` – a-goonie Mar 10 '17 at 00:12

Remove duplicates in text file line by line

2 Answers2