0

I have a need to do some processing on many thousands of strings (each string being an element in a list, imported from records in a SQL table).

Each string comprises a number of phrases delimited by a consistent delimiter. I need to 1) eliminate duplicate phrases in the string; 2) sort the remaining phrases and return the deduplicated, sorted phrases as a delimited string.

This is what I've conjured:

def dedupe_and_sort(list_element, delimiter):

    list_element = delimiter.join(set(list_element.split(f'{delimiter}')))
    return( delimiter.join(sorted(list_element.split(f'{delimiter}'))) )

string_input = 'e\\\\a\\\\c\\\\b\\\\a\\\\b\\\\c\\\\a\\\\b\\\\d'
string_delimiter = "\\\\"

output = dedupe_and_sort(string_input, string_delimiter)

print(f"Input: {string_input}")
print(f"Output: {output}")

Output is as follows:

Input: e\\a\\c\\b\\a\\b\\c\\a\\b\\d
Output: a\\b\\c\\d\\e

Is this the most efficient approach or is there an alternative, more efficient method?

evand
  • 97
  • 8

1 Answers1

1

You can avoid splitting two times (just don't join in the first step), and there is no need to use an f-string when passing delimiter to split().

def dedupe_and_sort(list_element, delimiter):

    distinct_elements = set(list_element.split(delimiter))
    return delimiter.join(sorted(distinct_elements))
shriakhilc
  • 2,922
  • 2
  • 12
  • 17
  • Thanks. Also, thanks for the note re using an f-string. I recently ran into trouble with not using an f-string to pass a parameter to a df call and couldn't for the life of me figure out where the problem was until I passed the parameter as an f-string. If I find the code I'll post it here - probably my own doing. – evand Jan 03 '22 at 19:41
  • It would be better to post a new question if it is unrelated to the code in this one. `delimiter` is already a string, that's why there was no need to convert/format it in this case. – shriakhilc Jan 03 '22 at 19:44
  • If I need to remove whitespace from each delimited string before using set() to remove duplicates, what would be the most efficient method? ```distinct_elements = [x.strip() for x in list_element.split(delimiter)]``` – evand Jan 03 '22 at 22:07
  • 2
    There's no need to use a list comprehension (results in creation of a list), you can directly pass it as a generator expression like so: `distinct_elements = set(x.strip() for x in list_element.split(delimiter))` – shriakhilc Jan 03 '22 at 23:12