What is a pythonic way of populating in a dataframe from a csv file that has only few rows?

Question

I have a csv file that has only the two following rows:

ID    CODE       DATE        STARTDATE         AVERAGE1    AGGREGATE    MEDIAN1    MEDIAN2    AVERAGE2    AVERAGE3    POSITION
123    112     2016OCT25     2016OCT25         821          3              1021       4000       172         183          S8
124    111     2016OCT25     2016OCT25         834          3              1025       4210       173         185          ENG

The position column takes values from S1-S8 and T1-T8(16 in total).I would like to create a dataframe that looks like the following:

 AVERAGE1  MEDIAN1   MEDIAN2   AVERAGE2   AVERAGE3         POSITION                        
                                                                 S1

                                                                 S2

                                                                 S3

                                                                 S4

                                                                 S5

                                                                 S6

                                                                 S7

                                                                 S8

                                                                 T1

                                                                 T2

                                                                 T3

                                                                 T4

                                                                 T5

                                                                 T6

                                                                 T7

                                                                 T8

Now the problem is if the 'POSITION' column in the csv file reads 'ENG' the same corresponding data in the column data has to be filled for all the positions in the dataframe, from S1-T8. If it reads any particular position , for example S8 in the second column above, the data from that column in the csv has to be filled to the corresponding POSITION in the dataframe. To make it more clear, the output should look like this:

 AVERAGE1     MEDIAN1     MEDIAN2     AVERAGE2      AVERAGE3    POSITION                        
    821          1021        4000       172           183        S1

    821          1021        4000       172           183        S2                                              

    821          1021        4000       172           183        S3                                                     

    821          1021        4000       172           183        S4  

    821          1021        4000       172           183        S5                                         

    821          1021        4000       172           183        S6                                                   

    821          1021        4000       172           183        S7                                                  

    834          1025        4210       173           185        S8                                                    

    821          1021        4000       172           183        T1                                                     

    821          1021        4000       172           183        T2

    821          1021        4000       172           183        T3

    821          1021        4000       172           183        T4

    821          1021        4000       172           183        T5

    821          1021        4000       172           183        T6

    821          1021        4000       172           183        T7

    821          1021        4000       172           183        T8

I have accomplished this by creating an empty dataframe and then filled the cells with df.iloc after reading the csv. Any method that is more efficient or pythonic will be much appreciated. Thanks in advance!

Fyi, "Pythonic" typically means, "As simple and common sense as possible." ;) — jpmc26, Oct 26 '16 at 05:16
@jpmc26 thanks will keep in mind. Fairly new to python here. — Ruffy26, Oct 26 '16 at 05:19

jezrael · Accepted Answer · 2016-10-26T05:28:25.557

I think you can use reindex by pos, fillna by values from row with ENG in column POSITION, drop unnecessary columns, cast to int and last reset_index:

pos = ['S' + str(x) for x in range(1, 9)] + ['T' + str(x) for x in range(1,9)] 
print (pos)
['S1', 'S2', 'S3', 'S4', 'S5', 'S6', 'S7', 'S8', 
 'T1', 'T2', 'T3', 'T4', 'T5', 'T6', 'T7', 'T8']

df.set_index('POSITION', inplace=True)

print (df.reindex(pos).fillna(df.ix['ENG']))
             ID   CODE       DATE  STARTDATE  AVERAGE1  AGGREGATE  MEDIAN1  \
POSITION                                                                     
S1        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   
S2        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   
S3        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   
S4        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   
S5        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   
S6        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   
S7        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   
S8        123.0  112.0  2016OCT25  2016OCT25     821.0        3.0   1021.0   
T1        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   
T2        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   
T3        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   
T4        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   
T5        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   
T6        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   
T7        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   
T8        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   

          MEDIAN2  AVERAGE2  AVERAGE3  
POSITION                               
S1         4210.0     173.0     185.0  
S2         4210.0     173.0     185.0  
S3         4210.0     173.0     185.0  
S4         4210.0     173.0     185.0  
S5         4210.0     173.0     185.0  
S6         4210.0     173.0     185.0  
S7         4210.0     173.0     185.0  
S8         4000.0     172.0     183.0  
T1         4210.0     173.0     185.0  
T2         4210.0     173.0     185.0  
T3         4210.0     173.0     185.0  
T4         4210.0     173.0     185.0  
T5         4210.0     173.0     185.0  
T6         4210.0     173.0     185.0  
T7         4210.0     173.0     185.0  
T8         4210.0     173.0     185.0

df = df.reindex(pos)
       .fillna(df.ix['ENG'])
       .drop(['ID','CODE','DATE','STARTDATE'],axis=1)
       .astype(int)
       .reset_index()

print (df)
   POSITION  AVERAGE1  AGGREGATE  MEDIAN1  MEDIAN2  AVERAGE2  AVERAGE3
0        S1       834          3     1025     4210       173       185
1        S2       834          3     1025     4210       173       185
2        S3       834          3     1025     4210       173       185
3        S4       834          3     1025     4210       173       185
4        S5       834          3     1025     4210       173       185
5        S6       834          3     1025     4210       173       185
6        S7       834          3     1025     4210       173       185
7        S8       821          3     1021     4000       172       183
8        T1       834          3     1025     4210       173       185
9        T2       834          3     1025     4210       173       185
10       T3       834          3     1025     4210       173       185
11       T4       834          3     1025     4210       173       185
12       T5       834          3     1025     4210       173       185
13       T6       834          3     1025     4210       173       185
14       T7       834          3     1025     4210       173       185
15       T8       834          3     1025     4210       173       185

Another solution:

df.set_index('POSITION', inplace=True)
#remove unnecessary columns
df = df.drop(['ID','CODE','DATE','STARTDATE'],axis=1)

#index values
pos = ['S' + str(x) for x in range(1, 9)] + ['T' + str(x) for x in range(1,9)] 
#DataFrame constructor
df1 = pd.DataFrame(data=[df.ix['ENG']], index=pos)
df1.index.name = 'POSITION'

#get all not ENG index values
not_eng_idx = df.drop('ENG').index
print (not_eng_idx)
Index(['S8'], dtype='object', name='POSITION')

#overwrite DataFrame by values from df
df1.ix[not_eng_idx] = df.ix[not_eng_idx]
df1.reset_index(inplace=True)

print (df1)
   POSITION  AVERAGE1  AGGREGATE  MEDIAN1  MEDIAN2  AVERAGE2  AVERAGE3
0        S1       834          3     1025     4210       173       185
1        S2       834          3     1025     4210       173       185
2        S3       834          3     1025     4210       173       185
3        S4       834          3     1025     4210       173       185
4        S5       834          3     1025     4210       173       185
5        S6       834          3     1025     4210       173       185
6        S7       834          3     1025     4210       173       185
7        S8       821          3     1021     4000       172       183
8        T1       834          3     1025     4210       173       185
9        T2       834          3     1025     4210       173       185
10       T3       834          3     1025     4210       173       185
11       T4       834          3     1025     4210       173       185
12       T5       834          3     1025     4210       173       185
13       T6       834          3     1025     4210       173       185
14       T7       834          3     1025     4210       173       185
15       T8       834          3     1025     4210       173       185

Thank you for accepting! I add another solution, maybe similar as your old one. — jezrael, Oct 26 '16 at 05:28
I have a question if you dont mind. Let's say the Position column has duplicated values like say S8 is repeated sometimes. reindex will not work then would it? Any alternative for such cases? Thank! — Ruffy26, Oct 26 '16 at 05:53
Yes, reindex doestn work with duplicates. But if there are multiple `S8` positions, what is desired output? All `S8` rows? Or only one - e.g. first `S8` or last `S8` row? — jezrael, Oct 26 '16 at 05:54
Say all the data in the S8 rows like Averages and Medians are the same too. S8 data should appear only once in the output! — Ruffy26, Oct 26 '16 at 05:55
Yes, but in output `Dataframe` will be multiple `S8` also if `S8` is multiple in source DataFrame? — jezrael, Oct 26 '16 at 05:57
Yes, then need `df = df.drop_duplicates('POSITION')` before `set_index` code. see [`drop_duplicates`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html) — jezrael, Oct 26 '16 at 06:02

What is a pythonic way of populating in a dataframe from a csv file that has only few rows?

1 Answers1