0

I'm trying to work with the sets and the intersect method to find which elements in a unicode list of file paths have specific characters in them. The goal is to replace these characters with other characters, so I've made a dictionary of keys and values, where the key is what will be replaced and the values is what it will be replaced with. When I try to generate an intersection set of the paths with the characters to be replaced, however, it results in an empty set. What am I doing wrong? I have this working with for loops, but I'd like to make this as efficient as possible. Feedback is appreciated!

Code:

# -*- coding: utf-8 -*-

import os

def GetFilepaths(directory):
    """
    This function will generate all file names a directory tree using os.walk.
    It returns a list of file paths.
    """
    file_paths = []
    for root, directories, files in os.walk(directory):
        for filename in files:
            filepath = os.path.join(root, filename)
            file_paths.append(filepath)
    return file_paths

# dictionary of umlauts (key) and their replacements (value)
umlautDictionary = {u'Ä': 'Ae',
                    u'Ö': 'Oe',
                    u'Ü': 'Ue',
                    u'ä': 'ae',
                    u'ö': 'oe',
                    u'ü': 'ue'
                    }

# get file paths in root directory and subfolders
filePathsList = GetFilepaths(u'C:\\Scripts\\Replace Characters\\Umlauts')
print set(filePathsList).intersection(umlautDictionary)
Crazy Otto
  • 125
  • 2
  • 13

2 Answers2

1

filePathsList is a list of strings:

[u'file1Ä.txt', u'file2Ä.txt', ...]

umlautDictionary is being used as a sequence of keys:

{u'Ä':..., ...}

The intersection is empty because the string u'Ä' doesn't appear in your list of strings. You are comparing u'Ä' to u'file1Ä.txt', which are not equal. Set intersection won't check for substrings.

Ned Batchelder
  • 364,293
  • 75
  • 561
  • 662
1

Since you want to replace the unicode characters in the filename with characters you want, I would suggest the following approach:

umlautDictionary = {u'\xc4': u'Ae'}
filePathsList = [u'file1Ä.txt', u'file2Ä.txt']

words = [w.replace(key, value) for key, value in umlautDictionary.iteritems() for w in filePathsList]

Output:

[u'file1Ae.txt', u'file2Ae.txt']

You would have to store the unicode characters in the form u'\xc4' for u'Ä' and so on.

Vaulstein
  • 20,055
  • 8
  • 52
  • 73
  • nice suggestion, but it doesn't quite work. the umlauts are indeed replaced, but the output list contains both the paths with the umlauts replaced and the paths without the umlauts replaced. – Crazy Otto Oct 22 '15 at 07:33
  • also, your code example gives me the following output: [u'file1\xc3\x84.txt', u'file2\xc3\x84.txt'] – Crazy Otto Oct 22 '15 at 09:13
  • Worked fine for me, have you added the other umlauts in your dictionary as in u'\xc3' ( u'Ã' ) replacement for this. Your dictionary should look like: umlautDictionary = {u'\xc4': u'Ae', u'\xc3': 'Replacement for à here', u'\x84': ' Replacement for __ here' } Basically every **non-ascii character's representation** as the **key**, and its **corresponding replacement** as the **value**. – Vaulstein Oct 23 '15 at 05:48