0

The goal is for the variable androidApps to be the original dataset. androidApps1 be the dataset with missing data removed and androidApps2 to be the dataset with duplicates removed. This issue arises when I run the removeDuplicates() function and save it to variable androidApps2 then dataset saved to androidApps1 also get updated to the same dataset as androidApps2 and I can't figure out why.

Here's my code:

Here I open and read my dataset file from csv import reader

openFile = open(r'C:\Users\Jason Minhas\Profitable App Profiles for the App Store and Google Play Markets\rawData\googleplaystore.csv', encoding="utf8")
readFile = reader(openFile)
androidApps = list(readFile)

This function checks to see if any row has a blank and returns true or false. This function is used in the next function

def hasBlank(row):
    for colIndex in range(0,len(row)):
        while row[colIndex] != '':
            break
        else:
            return True
    return False

This function finds apps with missing datapoints and returns a dataset that only has rows with all the columns filled.

def removeRowsWithMissingData(dataset, hasHeader=True):
    cleanDataset = []
    if hasHeader:
        start = 1
    start = 0

    for row in dataset[start:]:
        if hasBlank(row):
            pass  
        else:
            cleanDataset.append(row)

    return cleanDataset

androidApps1 = removeRowsWithMissingData(androidApps)

This is a function that checks for duplicates and prints the number of duplicates, uniques and total.

def dupeCount(dataset, index):
    UNQItems = []
    dupeItems = []
    for item in dataset:
        if item[index] in UNQItems:
            dupeItems.append(item[index])
        else:
            UNQItems.append(item[index])     
    print('Unique Apps = ' + str(len(UNQItems)))
    print('Duplicate Apps = ' + str(len(dupeItems)))
    print('Total Apps = ' + str(len(dupeItems)+len(UNQItems)))

dupeCount(androidApps1,0)

This function removes the duplicates and keeps the rows that have the highest review count. This function returns a clean dataset

def removeDuplicates(dataset, nameIndex, reviewCountIndex, hasHeader=True):
    if hasHeader:
        start = 1
    else:
        start = 0

    #create temp dataset so I dont alter orignial    
    tempDataset = dataset

    tempDataset[start:] = sorted(tempDataset[start:], key=lambda l: int(l[reviewCountIndex]), reverse=True)

    # create UNQ and dupe list
    UNQApp = []
    cleanDataset = []

    #Iterate through apps and keep only first app which will have highest review count since it's sorted.
    for row in tempDataset[start:]:
        appName = row[nameIndex]
        if appName not in UNQApp:
            UNQApp.append(appName)
            cleanDataset.append(row)

    tempDataset[start:] = cleanDataset
    return tempDataset

androidApps2 = removeDuplicates(androidApps1,0,3,hasHeader=True)

dupeCount(androidApps2,0)
  • That `hasBlank` is bizarre. Couldn't you just do `'' in row`? – user2357112 Mar 19 '20 at 00:54
  • `tempDataset = dataset` is not a copy. – user2357112 Mar 19 '20 at 00:55
  • Python does pass by reference for mutable objects. So when you call the function you are actually sending the reference to the object and not making a new list in the function. Try calling the function as removeDuplicates(androidApps1.copy(), 0, 3, hasHeader=True) – Rashid 'Lee' Ibrahim Mar 19 '20 at 00:56
  • 1
    @Rashid'Lee'Ibrahim: Python parameter passing and variable semantics are the same for all objects, mutable or immutable. See https://nedbatchelder.com/text/names.html – user2357112 Mar 19 '20 at 00:59
  • @user2357112supportsMonica Your link says almost exactly what I said. Look at the section titled Assignment. That's what a pass by reference does. – Rashid 'Lee' Ibrahim Mar 19 '20 at 01:01
  • @Rashid'Lee'Ibrahim no, **it absolutely does not**. Python uses a single evaluation strategy, which certainly doesn't depend on the *type of object* being passed, which is *neither* pass by reference nor pass by value. That link you are referring to **does not describe pass by reference at all**. – juanpa.arrivillaga Mar 19 '20 at 01:04
  • `tempDataset = dataset` does not create a new object. It merely assigns a new name, `tempDataset` to **the same object** being referred to by `dataset` – juanpa.arrivillaga Mar 19 '20 at 01:05

0 Answers0