0
import pandas as pd
import re
import numpy as np

data= [['Empty','CMI-General Liability | 05-9362','Empty','Empty'],['Empty','Central Operations','Empty','Empty'],['Empty','Alarm Central 05-8642','Empty','Empty'],['Empty','Market 466','Empty','Empty'],['Empty','Talent, Experience','Empty','Empty'],['Empty','Food Division','Empty','Empty'],['Empty','Quality WMCC','Empty','Empty'],['Empty','Modular Execution Team | 01-9700','Empty','Empty'],['Empty','US Central Operations','Empty','Empty'],['Empty','CE - Engineering - US','Empty','Empty'],['Empty','Fresh, Freezer & Cooler - 18-8110','Empty','Empty'],['Empty','9701','Empty','Empty'],['Empty','Contact Center','Empty','Empty'],['Empty','Central Operations','Empty','Empty'],['Empty','US Central Operations','Empty','Empty'],['Empty','Private Brands GM - 01-8683','Empty','Empty']]
df2=pd.DataFrame(data,columns=['JobTitle','Department','TrueDepartment','Dept_Function'])
data5 = [[1,'TRUCKING, MARCY, NY','Empty','Empty'],[2,'TRUCKING-GREENVILLE,TN','Empty','Empty'],[3,'DC 40, HOPE MILLS, NC','Empty','Empty'],[4,'TRUCKING, SHARON SPRINGS','Empty','Empty'],[5,'DISP PAULS VALLEY OK FDC','Empty','Empty'],[6,'COLDWATER, MI','Empty','Empty'],[7,'AMERICOLD LOGISTICS','Empty','Empty'],[8,'DFW3N FORT WORTH FC WHS.COM','Empty','Empty'],[9,'PCCC CURRENTLY BEING REVIEWED','Empty','Empty'],[466,'Springfield, MO','Empty','Empty'],[8110,'Fresh Dept','Empty','Empty'],[8642,'Security','Security & Compliance','Empty'],[8683,'General Merchandise','Empty','Empty'],[9362,'General Liability','Empty','Empty'],[9700,'Execution Team','Empty','Empty'],[9701,'Produce TN','Empty','Empty']]

df5=pd.DataFrame(data5,columns=['Dept_Nbr','Dept_Desc_good','Dept_Desc_better','Dept_Abrv'])

is dataframe 2

JobTitle    Department                   TrueDepartment    Dept_Function

            CMI-General Liability | 05-9362     
            Central Operations      
            Alarm Central 05-8642       
            Market 466      
            Talent, Experience      
            Food Division       
            Quality WMCC        
            Modular Execution Team | 01-9700        
            US Central Operations       
            CE - Engineering - US       
            Fresh, Freezer & Cooler - 18-8110       
            9701        
            Contact Center      
            Central Operations      
            US Central Operations       
            Private Brands GM - 01-8683          

df5 is dataframe5

Dept_Nbr    Dept_Desc_good                Dept_Desc_better     Dept_Abrv
1           TRUCKING, MARCY, NY                     
2           TRUCKING-GREENVILLE,TN              
3           DC 40, HOPE MILLS, NC                   
4           TRUCKING, SHARON SPRINGS            
5           DISP PAULS VALLEY OK FDC            
6           COLDWATER, MI                       
7           AMERICOLD LOGISTICS           
8           DFW3N FORT WORTH FC - WHS.COM       
9           PCCC CURRENTLY BEING REVIEWED       
466         Springfield, MO     
8110        Fresh Dept      
8642        Security                      Security & Compliance 
8683        General Merchandise                                    
9362        General Liability       
9700        Execution Team      
9701        Produce TN      

Desired result after running code

JobTitle Department                         TrueDepartment  

         CMI-General Liability | 05-9362    General Liability   
         Central Operations     
         Alarm Central 05-8642              Security & Compliance   
         Market 466     
         Talent, Experience     
         Food Division      
         Quality WMCC       
         Modular Execution Team | 01-9700   Execution Team  
         US Central Operations      
         CE - Engineering - US      
         Fresh, Freezer & Cooler - 18-8110  Fresh Dept  
         9701                               Produce TN  
         Contact Center     
         Central Operations     
         US Central Operations      
         Private Brands GM - 01-8683        General Merchandise     

Current code:

import pandas as pd
import re


numbers = df5['Dept_Nbr'].tolist()
df5['Dept_Nbr'] = [int(i) for i in df5['Dept_Nbr']]
df5.set_index('Dept_Nbr')
for n in numbers:
    for i in df5.index:
        if n in df2.loc[i, 'Department']:
            if df5.at[int(n), 'Dept_Desc_better']: #if values exists
                df2.at[i, 'TrueDepartment'] = df5.at(int(n), 'Dept_Desc_better')
            else:
                df2.at[i, 'TrueDepartment'] = df5.at(int(n), 'Dept_Desc_good')

getting errorTypeError: 'in ' requires string as left operand, not int'
I think I should try and change n to a string type?

Also i have to figure out how to find the substring in "Department" column from df2 that either follow a hypen or are the only number in the cell (i.e. 9701). I probably need to use regular expressions for that (re). for the first department in df2 it would find the string "9362" and match that with the Dept_Nbr in df5 and write "General Liability" to the TrueDepartment column. df5 is actually has Dept_Nbr's that go consecutively from 1 to over 10000.

newest error after making Mr. Armstrong's suggested changes to my code... only getting an error when used on my actual full dataframe not on the example dataframes I gave.

KeyError                                  Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py in 
get_loc(self, key, method, tolerance)
   3062             try:
-> 3063                 return self._engine.get_loc(key)
   3064             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in 
pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in 
pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'Dept_Nbr'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-14-89dd44993593> in <module>()
----> 1 numbers = df5['Dept_Nbr'].tolist()
      2 df5['Dept_Nbr'] = [int(i) for i in df5['Dept_Nbr']]
      3 df5 = df5.set_index('Dept_Nbr')  #<-- need to actually set df5 to the new index
      4 
      5 for n in numbers:

~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in 
__getitem__(self, key)
   2683             return self._getitem_multilevel(key)
   2684         else:
-> 2685             return self._getitem_column(key)
   2686 
   2687     def _getitem_column(self, key):

~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in 
_getitem_column(self, key)
   2690         # get column
   2691         if self.columns.is_unique:
-> 2692             return self._get_item_cache(key)
   2693 
   2694         # duplicate columns & possible reduce dimensionality

~/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py in 
_get_item_cache(self, item)
   2484         res = cache.get(item)
   2485         if res is None:
-> 2486             values = self._data.get(item)
   2487             res = self._box_item_values(item, values)
   2488             cache[item] = res

~/anaconda3/lib/python3.6/site-packages/pandas/core/internals.py in get(self, item, fastpath)
   4113 
   4114             if not isna(item):
-> 4115                 loc = self.items.get_loc(item)
   4116             else:
   4117                 indexer = np.arange(len(self.items))[isna(self.items)]

~/anaconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3063                 return self._engine.get_loc(key)
   3064             except KeyError:
-> 3065                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   3066 
   3067         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in 
pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in 
pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'Dept_Nbr'
Royale_w_cheese
  • 297
  • 2
  • 9
  • Welcome to StackOverflow. Please read and follow the posting guidelines in the help documentation, as suggested when you created this account. [Minimal, complete, verifiable example](http://stackoverflow.com/help/mcve) applies here. We cannot effectively help you until you post your MCVE code and accurately describe the problem. We should be able to paste your posted code into a text file and reproduce the problem you described. (1) Hard-code your data frames into your code; (2) Provide the full error message, so we don't have to guess the location and traceback details. – Prune Jul 06 '18 at 16:44
  • From your text, I'm guessing that you got the error on the `if n in` line. `if in ` is illegal. Without your debugging trace, without your MCVE to make one of our own, we can't be sure of the problem ... yet. – Prune Jul 06 '18 at 16:47
  • What is the logic behind what has a true department and what doesn't? Most are blank after joining the two dataframes. – dasvootz Jul 06 '18 at 16:47
  • there will only be a TrueDepartment value written to the column if the Department column of df2 contains a department number that is also in df5 in the column called "Dept_Nbr" – Royale_w_cheese Jul 06 '18 at 16:55
  • here is complete error message: TypeError Traceback (most recent call last) in () 5 for n in numbers: 6 for i in df5.index: ----> 7 if n in df2.loc[i, 'Department']: 8 if df5.at[int(n), 'Dept_Desc_AD']: #if values exists 9 df2.at[i, 'TrueDepartment'] = df5.at(int(n), 'Dept_Desc_AD') TypeError: 'in ' requires string as left operand, not int – Royale_w_cheese Jul 06 '18 at 16:56
  • i will start hardcoding the dataframes in now. – Royale_w_cheese Jul 06 '18 at 17:09
  • ok it should be good to go so you can copy and past hard-coded dataframes now Prune – Royale_w_cheese Jul 06 '18 at 18:06
  • Any help is GREATLY APPREACIATED! – Royale_w_cheese Jul 06 '18 at 18:06

1 Answers1

1

To start - your dataframes above do not match your dataframe construction. It took me forever to figure out why 9362 != 9362 :-)

Here so something to consider:

numbers = df5['Dept_Nbr'].tolist()
df5['Dept_Nbr'] = [int(i) for i in df5['Dept_Nbr']]
df5 = df5.set_index('Dept_Nbr')  #<-- need to actually set df5 to the new index

for n in numbers:
    for i in range(len(df5.index)):  #<-- you want to iterate through the number of elements not the elements themselves
        if str(n) == df2.loc[i, 'Department'][-4:]: #<-- convert n to str and slice df2 string for the last 4 chars
            if df5.loc[n, 'Dept_Desc_better'] != "Empty":  #<-- you're actually checking against a string, not a NaN
                df2.loc[i, 'TrueDepartment'] = df5.loc[n, 'Dept_Desc_better']  #<-- use .loc not .at
            else:
                df2.loc[i, 'TrueDepartment'] = df5.loc[n, 'Dept_Desc_good']

df2 = df2.replace(to_replace="Empty", value="")   #<-- your desired output has '' rather than 'Empty' - so replaced.

The assumption is that the df2 description under Department always has the Dept_Nbr at the end and it is always 4 characters long. And here's the output:

df2

   JobTitle                         Department         TrueDepartment  Dept_Function
0              CMI-General Liability | 05-9632
1                           Central Operations
2                        Alarm Central 05-8642  Security & Compliance
3                                   Market 466
4                           Talent, Experience
5                                Food Division
6                                 Quality WMCC
7             Modular Execution Team | 01-9700         Execution Team
8                        US Central Operations
9                        CE - Engineering - US
10           Fresh, Freezer & Cooler - 18-8110             Fresh Dept
11                                        9701             Produce TN
12                              Contact Center
13                          Central Operations
14                       US Central Operations
15                 Private Brands GM - 01-8683    General Merchandise

By the way - the code does work, there isn't an error in index line 0 - this is where your dataframes are different. I also kept as much of your code as possible, but I would guess that there is a better way to iterate.

Bill Armstrong
  • 1,615
  • 3
  • 23
  • 47
  • Thank you so much for your help. I really do appreciate it! Sorry about the incorrect hardcoding of the first dept 9362. – Royale_w_cheese Jul 07 '18 at 11:45
  • This code is working perfectly on my example dataframes. The actual dataframes are just much longer. When apply this code to the actual dataframes, I get the following error: – Royale_w_cheese Jul 07 '18 at 13:28
  • This code is working perfectly on my example dataframes. Thank you. Can you think of a reason why when applied to the full dataframes (only difference is that they have many more rows), i get an error. i can't list the whole error without enough characters. KeyError Traceback (most recent call last) ~/anaconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 3062 try: -> 3063 return self._engine.get_loc(key) 3064 except KeyError: – Royale_w_cheese Jul 07 '18 at 13:34
  • KeyError: 'Dept_Nbr' – Royale_w_cheese Jul 07 '18 at 13:35
  • Can you edit your question and at the end put the entire error trace back. There’ll be enough room there. Thanks. – Bill Armstrong Jul 07 '18 at 14:47
  • ok hopefully that helps and again thank you so much for your help. I am truly learning a great deal. – Royale_w_cheese Jul 07 '18 at 16:19
  • The only other thing i would add that could possibly cause a problem (that I can think of anyway) is the fact that there are a couple numbers skipped in the Dept_Nbr column of df5. There are no empty rows, but at least two numbers are missing from dept numbers 1 to over 10,0000. For example the list of dept numbers might look like this [1,2,3,4,5,6,7,8,9,10,11,13,14,15,16,17,18,19,21,22,23,24,25,26,27] but of course going all the way to over 10,000. Thanks again. – Royale_w_cheese Jul 07 '18 at 16:24
  • let me correct that....here is what the last row (A:9833)in the df5 dataframe looks like: this is row A9833 --> 10473 ASSOCIATE RELATIONS CLT – Royale_w_cheese Jul 07 '18 at 16:29
  • so that means there are many numbers skipped in the labeling of Dept_Nbr's since at row 9833, the department number is 10473 – Royale_w_cheese Jul 07 '18 at 16:32