5

Trying to combine two data frames when a datetime object from one dataframe is within a datetime object range in the other.

Keep getting: KeyError: 'cannot use a single bool to index into setitem' on this line of code in the second chunk I posted.

gametaxidf.loc[arrivemask, 'relevant'] = 1

I'm assuming it would happen on the following line with a similar command as well.

This is the part giving me trouble:

with open('/Users/benjaminprice/Desktop/TaxiCombined/Data/combinedtaxifiltered.csv', 'w') as csvfile: 
    fieldnames1 = ['index','pickup_datetime', 'dropoff_datetime', 'pickup_long', 'pickup_lat','dropoff_long','dropoff_lat','passenger_count','trip_distance','fare_amount','tip_amount','total_amount','stadium_code'] 
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames1) 
    writer.writeheader()

for index, row in baseballdf.iterrows(): 
    gametimestart = row['Start.Time'] 
    gametimeend = row['End.Time'] 
    arrivemin = gametimestart - datetime.timedelta(minutes=120) 
    arrivemax = gametimeend - datetime.timedelta(minutes = 30) 
    departmin = gametimeend - datetime.timedelta(minutes = 60) 
    departmax = gametimeend + datetime.timedelta(minutes = 90)

    gametaxidf = combineddf[combineddf.DATE==row.DATE]
    gametaxidf['relevant']=0

    for index, row in gametaxidf.iterrows():
        arrivemask = (arrivemin < row['dropoff_datetime']) and (row['dropoff_datetime'] < arrivemax)
        departmask = (departmin < row['pickup_datetime']) and (row['pickup_datetime'] < departmax) 
        gametaxidf.loc[arrivemask, 'relevant'] = 1
        gametaxidf.loc[departmask, 'relevant'] = 1

        with open('/Users/benjaminprice/Desktop/TaxiCombined/Data/combinedtaxifiltered.csv','a') as combinedtaxi:
            gametaxidf.to_csv(combinedtaxi,header=None)
    print(str(index) + "done")

Gametaxidf.head(5):

   index     pickup_datetime    dropoff_datetime  pickup_long  pickup_lat  \
0    195 2014-04-01 00:08:13 2014-04-01 00:15:32   -73.922218   40.827557   
1    344 2014-04-01 00:16:30 2014-04-01 00:20:38   -73.846046   40.754566   
2    558 2014-04-01 00:28:59 2014-04-01 00:36:36   -73.921692   40.831394   
3    744 2014-04-01 00:42:00 2014-04-01 00:49:46   -73.938080   40.804646   
4    776 2014-04-01 00:43:54 2014-04-01 00:53:22   -73.952652   40.810577   

   dropoff_long  dropoff_lat  passenger_count  trip_distance  fare_amount  \
0    -73.900620    40.856174                1           2.30          9.0   
1    -73.890259    40.753246                1           0.56          4.5   
2    -73.942719    40.823257                1           1.53          7.0   
3    -73.928490    40.830433                1           2.96         11.0   
4    -73.924332    40.827320                1           2.28         10.5   

   tip_amount  total_amount  stadium_code       DATE  relevant  
0           0          10.0           1.1 2014-04-01         0  
1           0           5.5           2.1 2014-04-01         0  
2           0           8.0           1.1 2014-04-01         0  
3           0          12.0           1.0 2014-04-01         0  
4           0          11.5           1.0 2014-04-01         0 

Also getting this warning: A value is trying to be set on a copy of a slice from a DataFrame.

Try using .loc[row_indexer,col_indexer] = value instead

But it's letting me continue through that... any help would be great.

Ben Price
  • 677
  • 3
  • 8
  • 16
  • could you also display the head of the dataframe? – Hypothetical Ninja Nov 20 '15 at 04:09
  • Just added them above... any particular columns in the baseballdf you're interested in? – Ben Price Nov 20 '15 at 04:19
  • want to see gametaxidf, that is the one causing problems, right? – Hypothetical Ninja Nov 20 '15 at 04:23
  • yeah... just realized thats what you meant... added above – Ben Price Nov 20 '15 at 04:23
  • you should edit your question and remove all other parts of code that aren't relevant including the original df you pasted. What are you trying to do with arrivemask and departmask? – Hypothetical Ninja Nov 20 '15 at 04:26
  • gametaxidf has time values for pickup and dropoff times that I converted to datetime objects using datetime.strptime() I'm trying to see if the arrival time before/during the game falls (and/or departure time after the game) within a certain window that I defined with the variables above – Ben Price Nov 20 '15 at 04:31
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/95637/discussion-between-sword-and-ben-price). – Hypothetical Ninja Nov 20 '15 at 04:36
  • the problem is that arrivemask and departmask have values True or False and hence you cant use it to index the rows. try : if arrivemask: gametaxidf.loc[index, 'relevant'] – Hypothetical Ninja Nov 20 '15 at 04:40

1 Answers1

5

Here

gametaxidf.loc[arrivemask, 'relevant'] = 1

you're trying to set dataframe values by .loc operator. Pandas docs for selecting rows says:

.loc is primarily label based, but may also be used with a boolean array. .loc will raise KeyError when the items are not found. Allowed inputs are:

  • A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index. This use is not an integer position along the index)
  • A list or array of labels ['a', 'b', 'c']
  • A slice object with labels 'a':'f', (note that contrary to usual python slices, both the start and the stop are included!)
  • A boolean array

You're trying to use the last type of input, but this

arrivemask = (arrivemin < row['dropoff_datetime']) and 
    (row['dropoff_datetime'] < arrivemax)

is scalar boolean, not array.

You need not to iterate through dataframe. Pandas does it for you. Just use:

gametaxidf.loc[
   (arrivemin < gametaxidf['dropoff_datetime'])
   &
   (gametaxidf['dropoff_datetime'] < arrivemax)
   , 'relevant'] = 1
Community
  • 1
  • 1
tworec
  • 4,409
  • 2
  • 29
  • 34