Add multiple string values to pandas column based on numpy values

Question

Have a numpy_array like below which I calculated based on some conditions on main_df dataframes price variable

2021-06-09 14:55:00    0
2021-06-09 15:00:00    1
2021-06-09 15:05:00    0
2021-06-09 15:10:00   -1

#saves the above numpy array in a study_name_1_result variable study_name_1_result=above_numpy_array

Have a main_df like this that I need to add values to

                  price  positive_studies negative_studies
date_time                                                              
2021-06-09 14:55:00    100         []               []
2021-06-09 15:00:00    110         []               []
2021-06-09 15:05:00    222         []               []
2021-06-09 15:10:00    332         []               []

I tried like this to add studies to appropriate columns

   #'study_name_1' is the name of the study I used to generate the study_name_1_results variable(numpy array)
    numpy.where((study_name_1_result > 0),main_df['positive_studies'].append('study_name_1'))
    numpy.where((study_name_1_result < 0),main_df['negative_studies'].append('study_name_1'))

But getting error TypeError: cannot concatenate object of type '<class 'str'>'; only Series and DataFrame objs are valid

My expected output is like below

                      price   positive_studies negative_studies
date_time                                                              
2021-06-09 14:55:00    100     []               []
2021-06-09 15:00:00    110     ['study_name_1']   []
2021-06-09 15:05:00    222     []               []
2021-06-09 15:10:00    332     []               ['study_name_1']

Could someone tell me what am doing wrong here??

Why the studies column have list? Isn't string sufficient for them? — ThePyGuy, Jun 09 '21 at 22:59
@Don'tAccept actually am looping through multiple studies(numpy values) for the same dataframe. This numpy values is what got generated for lets say study_name_1 — Abhilash, Jun 09 '21 at 23:00
if i'm reading this correctly you don't wan to append to the dataframe you want to replace what's in the cell with the numpy array. so maybe .replace('[]', 'study_name_1') instead of append??? — Jonathan Leon, Jun 10 '21 at 00:18
@JonathanLeon not really. I have multiple studies ..so if `main_df['positive_studies']` is `['study_name_1']` currently then later I need to add another like `['study_name_1','study_name_2']`. — Abhilash, Jun 10 '21 at 00:44
ah, then you'll have to get the original value as a list, extend the list, and the update the cell with the new list — Jonathan Leon, Jun 10 '21 at 00:48
@JonathanLeon you are right. I thought append would work. If you provide this as an answer I will accept it.Thank you — Abhilash, Jun 10 '21 at 00:50

score 1 · Answer 1 · answered Jun 10 '21 at 00:56

1

Not the slickest coding, but this gets you started

data='''                  
date_time               price   positive_studies   negative_studies                                               
2021-06-09 14:55:00    100         []               []
2021-06-09 15:00:00    110         []               []
2021-06-09 15:05:00    222         []               []
2021-06-09 15:10:00    332         []               []'''
df = pd.read_csv(io.StringIO(data), sep=' \s+', engine='python')
old_list = eval(df.iat[1,2]) # make it a list from string '[]'
new_list = old_list + ['my_new_study']

df.iat[1,2] = new_list

answered Jun 10 '21 at 00:56

Jonathan Leon

5,440
2
6
14

is there a pandas way to do this for all rows together like I did append in the question? Also, main_df variable is already a data frame. – Abhilash Jun 10 '21 at 01:09
probably in a function called by lambda; filter your df, then df.apply(lamda x: myfunc(x['positive_studies'], axis=1) may work – Jonathan Leon Jun 10 '21 at 01:13
Not sure about that. It will be great if you can update the answer when possible. Thank you – Abhilash Jun 10 '21 at 01:54
try and see if you can get it to work and post the code your using. If you aren't familiar with the concept you can research apply(), groupby().apply() and how to call functions (either with lambda or directly) – Jonathan Leon Jun 10 '21 at 02:39
Will do that. Thanks – Abhilash Jun 10 '21 at 02:53

score 1 · Accepted Answer · answered Jun 10 '21 at 04:11

This is often a subject of puzzlement (trying to add elements to a list inside a cell of a dataframe). See for example this SO answer.

Even the initialization of your main_df can be a bit finicky.

Here is a way to do what you are looking for. There might be better/faster ways, but at least this is one way.

# reproducible setup

price = [100, 110, 222, 332]
tidx = pd.date_range('2021-06-09 14:55:00', periods=len(price), freq='5min')
df = pd.DataFrame(dict(
    price=price,
    positive_studies=[[]] * len(price),
    negative_studies=[[]] * len(price),
), index=tidx)

Then:

def list_append(df, colname, sublist, where):
    df.loc[where, colname] = df.loc[where, colname].apply(lambda a: a + sublist)

Application:

name = 'study_name_1'
study_name_1_result = pd.Series([0, 1, 0, -1], index=tidx)

list_append(df, 'positive_studies', [name], study_name_1_result > 0)
list_append(df, 'negative_studies', [name], study_name_1_result < 0)

Outcome:

>>> df
                     price positive_studies negative_studies
2021-06-09 14:55:00    100               []               []
2021-06-09 15:00:00    110   [study_name_1]               []
2021-06-09 15:05:00    222               []               []
2021-06-09 15:10:00    332               []   [study_name_1]

sounds good. Will try and update and yes there could be faster ways which I definitely need to look for as am doing this calculation every 5 minute on around 100 records with around 70 studies for each record. But I think first I will at least make it work and then look for performance improvements. Thank you for the support — Abhilash, Jun 10 '21 at 05:09

Add multiple string values to pandas column based on numpy values

2 Answers2