1
import os
import pandas as pd


def concat_df(file_list):
      df = pd.DataFrame()
    for f in file_list:
        tmp_df = pd.read_csv(f)
        df = pd.concat((df, tmp_df), axis=0)

return df



root_dir = 'ce survey'

if os.path.exists('total.csv'):
    os.remove('total.csv')

   years = list(range(2012, 2022))

for year in years:
    print(year)
    dir_path = os.path.join(root_dir, str(year))
    files = os.listdir(dir_path)
    expd_files = []
    fmld_files = []
    fmli_files = []
    memd_files = []
    for file in files:
        if file.endswith('.csv'):
            fp = os.path.join(dir_path, file)
            if 'expd' in file:
                expd_files.append(fp)
            elif 'fmld' in file:
                fmld_files.append(fp)
            elif 'fmli' in file:
                fmli_files.append(fp)
            elif 'memd' in file:
                fmli_files.append(fp)

     expd_df = concat_df(expd_files)
     fmld_df = concat_df(fmld_files)
     fmli_df = concat_df(fmli_files)
     memd_df = concat_df(memd_files)
    expd_df = expd_df.loc[expd_df['UCC'] == 180620 ][['NEWID', 'COST', [['UCC']=290420,320130,410901], 'EXPNYR']]
    if year == 2015 or year == 2016 or year == 2017:
        fmld_df = fmld_df[['NEWID', 'REGION', 'FWAGEX', 'INC_RANK', 'HISP_REF', 'HORREF1', 'HORREF2', 'RACE2', 'REF_RACE', 'EDUC_REF', 'FAM_SIZE', 'INCLASS','FAM_TYPE','CHILDAGE',]]
    elif year == 2018 or year == 2019 or year == 2020 or year == 2021:
        fmld_df = fmld_df[
            ['NEWID', 'REGION', 'FWAGEX', 'INC_RANK', 'HISP_REF', 'HORREF1', 'HORREF2', 'RACE2', 'REF_RACE', 'EDUC_REF', 'FAM_SIZE','FAM_TYPE','CHILDAGE',]]
    else:
        fmld_df = fmld_df[['NEWID', 'REGION', 'FWAGEX', 'INC_RANK', 'HISP_REF', 'HORREF1', 'HORREF2', 'RACE2', 'REF_RACE', 'EDUC_REF', 'FAM_SIZE', 'FINCAFTX', 'INCLASS']]
    if year == 2012:
        fmli_df = fmli_df[['NEWID', 'INCLASS2', 'RACE2']]
    else:
        fmli_df = fmli_df[['NEWID', 'INCLASS2', 'RACE2', 'FINATXEM']]
    memd_df = memd_df[['NEWID','CU_CODE1','EMPLTYPE', 'SEX', 'OCCUEARN', 'OCCULIST', 'AGE', 'WKS_WRKD', 'MARITAL']] 


    expd_df = pd.merge(left=expd_df, right=fmld_df, on='NEWID', how='left')
    expd_df = pd.merge(left=expd_df, right=fmli_df, on='NEWID', how='left')
    expd_df = pd.merge(left=expd_df, right=memd_df, on='NEWID', how='left')
    if year == 2012:
        expd_df['FINATXEM'] = ''
    if year == 2015 or year == 2016 or year == 2017:
        expd_df['FINCAFTX'] = ''
    elif year == 2018 or year == 2019 or year == 2020 or year == 2021:
        expd_df['INCLASS'] = ''
        expd_df['FINCAFTX'] = ''

    expd_df['YEAR'] = year
    expd_df = expd_df[['NEWID', 'COST', 'UCC', 'EXPNYR', 'REGION', 'FWAGEX', 'INC_RANK',
       'HISP_REF', 'HORREF1', 'HORREF2', 'RACE2_x', 'REF_RACE', 'EDUC_REF',
       'FAM_SIZE', 'FINCAFTX', 'INCLASS', 'INCLASS2', 'RACE2_y', 'FINATXEM']]
    if year == 2012:
        expd_df.to_csv('total.csv', mode='a', index=False)
    else:
        expd_df.to_csv('total.csv', mode='a', index=False, header=None)

I'm trying to get information to see how other factors could influence the cost of 180620. 290420,320130,410901 are in the same column with 180620. I tried to use 180620 as the condition to filter other information and 290420,320130,410901 should be contained as other information. In the sheet of expd, UCC is the expenses of NEWIDS, different UCC code could be paired with same newid. I and trying to find NEWIDS paired with 180620 first and get the cost of 180620, then see if the NEWIDS' cost on 290420,320130,410901. In the end, merge all the information from different files into one CSV sheet.

Thank you for your time to look at this. This problem bothers me for really a long time.

Se Youuu
  • 11
  • 2
  • please post minimal reproducible example – Himanshu Poddar Jul 16 '22 at 16:47
  • Thank you for looking at it. what do you mean? Cause my problem is also that I am worrying that the whole code does not work in the end. Cause I tried a lot of different ways to correct my code, even it can be run,variables in the MEMD still does not show in the final sheet. I am sorry that i did not explain this in my question. – Se Youuu Jul 16 '22 at 16:58
  • Please, read how to create a [mre], and in particular how to create a [good reproducible example in pandas](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-example). In brief: try to isolate the problem, show only the code that is relevant for that problem, and generate artificial data representative of yours. We don't have (and we don't want) your csv file. – Ignatius Reilly Jul 16 '22 at 17:23
  • I see. The first problem is:expd_df = expd_df.loc[expd_df['UCC'] == 180620 ][['NEWID', 'COST', [['UCC']=290420,320130,410901], 'EXPNYR']] The second problem is: memd_df = memd_df[['NEWID','CU_CODE1','EMPLTYPE', 'SEX', 'OCCUEARN', 'OCCULIST', 'AGE', 'WKS_WRKD', 'MARITAL']] For the second problem, I think it is right cause I can run it but in the end, I could not see the variables in the final sheet when I get the output. – Se Youuu Jul 16 '22 at 17:30
  • Moreover, after i deleted the problem one and the whole code could run, I still could not get the variables of MEMD in the output. So, the third problem is, python did not show errors but the output is not what it should be so I don't know what is the problem specifically. Because, before adding codes related MEMD, the whole code could run and get the right output, so I guess the problem might be codes related to MEMD. – Se Youuu Jul 16 '22 at 17:48
  • Usually, memd_df[['NEWID','CU_CODE1','EMPLTYPE', 'SEX', 'OCCUEARN', 'OCCULIST', 'AGE', 'WKS_WRKD', 'MARITAL']] , after running this code, I should get a table with all the variables included. However, after I successfully ran this code, I did not get the expected output. So i think i should post the whole code for the inference. – Se Youuu Jul 16 '22 at 17:50

0 Answers0