Looking for a more effective way to prep data for Kmeans analysis. Using the BLS ( Bureau of Labor Statistics ) and trying to learn Kmeans, I am doing the first pass of the data and want to add two columns, percentage of change over time per in median salary and tote employment. The formula is a simply (((current year, state and occ_code) minus the (min of those fourteen data points) / divided by (same min ) ... (( current year, occ_code - min) / min *100), add two columns to dataframe. The data set is ~500K rows by 24 columns. the code is run about 400 rows per min.... current expected to take about 24 hours for a full run, thus the question. Thank you
The sample data is here:
Any columns will work, I am just using a_median and tot_emp
Unnamed: 0 area st state occ_code occ_title tot_emp emp_prse h_mean a_mean ... h_pct75 h_pct90 a_pct10 a_pct25 a_median a_pct75 a_pct90 year tot_emp_growth a_median_growth
128037 128037 31 NE Nebraska 25-2022 Middle school teachers, except special and voc... 3720 4.9 0 45270 ... 0 0 32850 37160 44220 53170 62120 2008 25.566343 0.0
491755 491755 19 IA Iowa 47-2131 Insulation Workers, Floor, Ceiling, and Wall 360 18 19.59 40750 ... 23.56 27.98 27160 32230 38920 49010 58190 2018 25.566343 0.0
470924 470924 42 PA Pennsylvania 19-1021 Biochemists and Biophysicists 1330 24 43.69 90880 ... 50.04 67.42 52300 64700 84400 104070 140240 2017 25.566343 0.0
267336 267336 20 KS Kansas 39-4031 Morticians, Undertakers, and Funeral Directors 460 25.4 16.6 34540 ... 23.98 27.83 20400 21730 23950 49880 57890 2012 25.566343 0.0
491263 491263 19 IA Iowa 11-9033 Education Administrators, Postsecondary 2360 6.8 51.81 107760 ... 61.22 86.07 52120 68670 93650 127330 179020 2018 25.566343 0.0
5 rows × 24 columns
The code currently being tested is:
def occ_code_growths(df):
for i in range(len(df)):
cols_lit = ['year', 'occ_code', 'st' , 'tot_emp', 'a_median']
df_lookup = df.lookup(list([df.index[i]]*len(cols_lit)), cols_lit)
idx_emp_min = df[(df['occ_code'] == df_lookup[1]) & (df['st'] == df_lookup[2]) ]\
['tot_emp'].values.astype(int).min()
idx_median_min = df[(df['occ_code'] == df_lookup[1]) & (df['st'] == df_lookup[2]) ]\
['a_median'].values.astype(int).min()
idx_emp = df[(df['occ_code'] == df_lookup[1]) & (df['st'] == df_lookup[2]) \
& (df['year'] == df_lookup[0]) ]['tot_emp'].values.astype(int)
idx_median = df[(df['occ_code'] == df_lookup[1]) & (df['st'] == df_lookup[2]) \
& (df['year'] == df_lookup[0]) ]['a_median'].values.astype(int)
df['tot_emp_growth'] = float((((idx_emp - idx_emp_min) / idx_emp_min) * 100)[0])
df['a_median_growth'] = float((((idx_median - idx_median_min) / idx_median_min) * 100)[0])
if i % 200 == 0 :
print(df.index[i])
return(df)
df_4 = occ_code_growths(df)
df_4.to_csv('./data/kmeans.csv')