Python 2.7 - Intersect Unicode Dictionary with Unicode List

Question

I'm trying to work with the sets and the intersect method to find which elements in a unicode list of file paths have specific characters in them. The goal is to replace these characters with other characters, so I've made a dictionary of keys and values, where the key is what will be replaced and the values is what it will be replaced with. When I try to generate an intersection set of the paths with the characters to be replaced, however, it results in an empty set. What am I doing wrong? I have this working with for loops, but I'd like to make this as efficient as possible. Feedback is appreciated!

Code:

# -*- coding: utf-8 -*-

import os

def GetFilepaths(directory):
    """
    This function will generate all file names a directory tree using os.walk.
    It returns a list of file paths.
    """
    file_paths = []
    for root, directories, files in os.walk(directory):
        for filename in files:
            filepath = os.path.join(root, filename)
            file_paths.append(filepath)
    return file_paths

# dictionary of umlauts (key) and their replacements (value)
umlautDictionary = {u'Ä': 'Ae',
                    u'Ö': 'Oe',
                    u'Ü': 'Ue',
                    u'ä': 'ae',
                    u'ö': 'oe',
                    u'ü': 'ue'
                    }

# get file paths in root directory and subfolders
filePathsList = GetFilepaths(u'C:\\Scripts\\Replace Characters\\Umlauts')
print set(filePathsList).intersection(umlautDictionary)

have you tried `set(filePathsList.intersection(umlautDictionary))` ? — Onilol, Oct 21 '15 at 12:14
just tried, get an error: 'list' object has no attribute "intersection" — Crazy Otto, Oct 21 '15 at 12:21

score 1 · Accepted Answer · answered Oct 21 '15 at 12:14

1

filePathsList is a list of strings:

[u'file1Ä.txt', u'file2Ä.txt', ...]

umlautDictionary is being used as a sequence of keys:

{u'Ä':..., ...}

The intersection is empty because the string u'Ä' doesn't appear in your list of strings. You are comparing u'Ä' to u'file1Ä.txt', which are not equal. Set intersection won't check for substrings.

answered Oct 21 '15 at 12:14

Ned Batchelder

364,293
75
561
662

He can't check for key/value ? – Onilol Oct 21 '15 at 12:15
"Set intersection won't check for substrings." That explains the issue, thank you. – Crazy Otto Oct 21 '15 at 12:23

score 1 · Answer 2 · answered Oct 21 '15 at 12:34

1

Since you want to replace the unicode characters in the filename with characters you want, I would suggest the following approach:

umlautDictionary = {u'\xc4': u'Ae'}
filePathsList = [u'file1Ä.txt', u'file2Ä.txt']

words = [w.replace(key, value) for key, value in umlautDictionary.iteritems() for w in filePathsList]

Output:

[u'file1Ae.txt', u'file2Ae.txt']

You would have to store the unicode characters in the form u'\xc4' for u'Ä' and so on.

answered Oct 21 '15 at 12:34

Vaulstein

20,055
8
52
73

nice suggestion, but it doesn't quite work. the umlauts are indeed replaced, but the output list contains both the paths with the umlauts replaced and the paths without the umlauts replaced. – Crazy Otto Oct 22 '15 at 07:33
also, your code example gives me the following output: [u'file1\xc3\x84.txt', u'file2\xc3\x84.txt'] – Crazy Otto Oct 22 '15 at 09:13
Worked fine for me, have you added the other umlauts in your dictionary as in u'\xc3' ( u'Ã' ) replacement for this. Your dictionary should look like: umlautDictionary = {u'\xc4': u'Ae', u'\xc3': 'Replacement for Ã here', u'\x84': ' Replacement for __ here' } Basically every **non-ascii character's representation** as the **key**, and its **corresponding replacement** as the **value**. – Vaulstein Oct 23 '15 at 05:48

Python 2.7 - Intersect Unicode Dictionary with Unicode List

2 Answers2