0

I'm trying to write a Python script that will remove duplicate strings in a text file. However, the de-duplication should only occur within each line.

For example, the text file might contain:

þ;ABC.001.123.1234;þ;;þ;10 ABC\ABCD\ABCDE;10 ABC\ABCD\ABCDE
þ;ABC.001.123.1234;þ;;þ;10 ABC\ABCD\ABCDE;12 EFG\EFG;12 EFG\EFG;þ þ;ABC.001.123.1234;þ;;þ;10 ABC\ABCD\ABCDE;12 EFG\EFG;09 XYZ\XYZ\XYZ;12 EFG\EFG

Thus, in the above example, the script should only remove the bold strings.

I've searched Stack Overflow and elsewhere to try to find a solution, but haven't had much luck. There seem to be many solutions that will remove duplicate lines, but I'm trying to remove duplicates within a line, line-by-line.

Update: Just to clarify - þ is the delimiter for each field, and ; is the delimiter for each item within each field. Within each line, I'm attempting to remove any duplicate strings contained between semicolons.

Update 2: Example edited to reflect that the duplicate value may not always follow directly after the first instance of the value.

a-goonie
  • 99
  • 1
  • 10
  • Use a regular expression with a back-reference to detect a string followed by a copy of itself. – Barmar Mar 08 '17 at 23:40
  • Are your fields separated by a semicolon ; ? Because ABC and 123 repeat many times. – chapelo Mar 08 '17 at 23:56
  • 1
    And if the fields **are** separated by semi-colons, how should the `þ` characters be handled? Can duplicates occur *anywhere* in the line, or are you only interested in sequential repeats? Also: is the matching case-sensitive, and is whitespace significant? – ekhumoro Mar 09 '17 at 00:52
  • @ekhumoro `þ` delimits each field, and `;` delimits each item under the field. So, essentially, going line-by-line, a duplicate string between semicolons should be removed. – a-goonie Mar 09 '17 at 01:02
  • 1
    You still need to answer whether or not whitespace matters: 10 ABC\ABCD\ABCDE; 10 ABC\ABCD\ABCDE aren't duplicates since whitespace makes them distinct. – gregory Mar 09 '17 at 01:04
  • @gregory Apologies. The whitespace was an error in the formating of my example. – a-goonie Mar 09 '17 at 01:07
  • @a-goonie. Case-sensitivity? – ekhumoro Mar 09 '17 at 01:12
  • @ekhumoro Probably not case-sensitive. The duplicates being weeded out are file paths, so it's possible there will be duplicate file paths that aren't necessarily identical in terms of case. – a-goonie Mar 09 '17 at 01:15

2 Answers2

1

@Prune's answer gives the idea but it needs to be modified like this:

input_file = """"þ;ABC.001.123.1234;þ;;þ;10 ABC\ABCD\ABCDE;10 ABC\ABCD\ABCDE;þ
þ;ABC.001.123.1234;þ;;þ;10 ABC\ABCD\ABCDE;12 EFG\EFG;12 EFG\EFG;þ"""""

input = input_file.split("\n")

for line in input:
    seen_item = []
    for item in line.split(";"):
        if item not in seen_item or item == "þ":
             seen_item.append(item)
    print(";".join(seen_item))
Sangbok Lee
  • 2,132
  • 3
  • 15
  • 33
0
import re
with open('file', 'r') as f:
     file = f.readlines()
for line in file:
     print(re.sub(r'([^;]+;)(\1)', r'\1', line))

Read the file by lines; then replace the duplicates using re.sub.

gregory
  • 10,969
  • 2
  • 30
  • 42
  • Thank you. That's a really great solution. The only issue I have now is removing duplicates that don't necessarily follow immediately after the first instance of the duplicate value. For example: `;þ;MEP.0002.087836;þ;;þ;15 Engineering\03 Personal Folders\ENGBIBLES\Cranes\Liebherr\LT1220;15 Engineering\03 Personal Folders\ENGBIBLES\Cranes\Liebherr\LG1300;15 Engineering\03 Personal Folders\ENGBIBLES\Cranes\Liebherr\LT1220;15 Engineering\03 Personal Folders\ENGBIBLES\Cranes\Liebherr\LG1300;` – a-goonie Mar 10 '17 at 00:12