1

I need help in reducing the cyclomatic complexity of the following code:

def avg_title_vec(record, lookup):
    avg_vec = []
    word_vectors = []
    for tag in record['all_titles']:
        titles = clean_token(tag).split()
        for word in titles:
            if word in lookup.value:
                word_vectors.append(lookup.value[word])
    if len(word_vectors):
        avg_vec = [
            float(val) for val in numpy.mean(
                numpy.array(word_vectors),
                axis=0)]

    output = (record['id'],
              ','.join([str(a) for a in avg_vec]))
    return output

Example input:

record ={'all_titles': ['hello world', 'hi world', 'bye world']}

lookup.value = {'hello': [0.1, 0.2], 'world': [0.2, 0.3], 'bye': [0.9, -0.1]}

def clean_token(input_string):
    return input_string.replace("-", " ").replace("/", " ").replace(
    ":", " ").replace(",", " ").replace(";", " ").replace(
    ".", " ").replace("(", " ").replace(")", " ").lower()

So all the words that are present in the lookup.value, I am taking average of the their vector form.

Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
futurenext110
  • 1,991
  • 6
  • 26
  • 31

1 Answers1

0

It probably doesn't count as a correct answer really, as in the end cyclomatic complexity isn't reduced.

This variant is a little bit shorter, but I can't see any way it can be generalized in. And it seems that you need those ifs you have.

def avg_title_vec(record, lookup):
    word_vectors = [lookup.value[word] for tag in record['all_titles']
                    for word in clean_token(tag).split() if word in lookup.value]
    if not word_vectors:
        return (record['id'], None)
    avg_vec = [float(val) for val in numpy.mean(
               numpy.array(word_vectors),
               axis=0)]

    output = (record['id'],
              ','.join([str(a) for a in avg_vec]))
    return output

Your CC is 6, which is already good, according to this. You can reduce CC of your function by using helper functions, like

def get_tags(record):
    return [tag for tag in record['all_titles']]

def sanitize_and_split_tags(tags):
    return [word for tag in tags for word in
            re.sub(r'[\-/:,;\.()]', ' ', tag).lower().split()]

def get_vectors_words(words):
    return [lookup.value[word] for word in words if word in lookup.value]

And it will drop average CC, but overall CC will stay the same or increase. I don't see how you can get rid of those ifs checking if word is in lookup.value or checking if we have any vectors to work with.

Pavel Gurkov
  • 737
  • 5
  • 14