1

Apologies, I'm very new to Python.

I have the current code:

# Put data into a dataframe
df = pd.DataFrame(ZACKSP_raw_data)

""" Reformat dataframe data """    
# Change exchange from NSDQ to NASDAQ
df['exchange'] = df['exchange'].str.replace('NSDQ','NASDAQ')

# Change date format to DD/MM/YYYY
df['date'] = df['date'].dt.strftime('%d/%m/%Y')

# Round closing share price to 2 digits
df['close'] = df['close'].round(2)

# Filter data for Jan 
ZACKSP_data_StartOfJanYearMinus1 = df[df['date'] == '05/01/%s' % CurrentYearMinus1]

# Test
print(ZACKSP_data_StartOfJanYearMinus1.head())

Which returns data in the format: enter image description here

Now I want the array to only keep the closing price for the first recorded close in January and the last recorded close in Dec (for each ticker). I thought about trying to use a wildcard for the day and then something like a head() or tail() to get achieve this but I'm struggling. Any ideas?

smci
  • 32,567
  • 20
  • 113
  • 146
user3709511
  • 221
  • 3
  • 10

2 Answers2

1

Solution if all datetimes are sorted:

I think you need concat with drop_duplicates for first row and last row for each ticker.

Also is necessary add new column for with years for first and last values per years with tickers.

df['year'] = pd.to_datetime(df['date']).dt.year

df1 = pd.concat([df.drop_duplicates(['ticker', 'year']), 
                 df.drop_duplicates(['ticker', 'year'], keep='last')])  

More general solution with unsorted datetimes:

c = ['ticker','exchange','date','close']
df = pd.DataFrame({'date':pd.to_datetime(['2017-01-04','2017-01-12',
                                          '2017-01-05',
                           '2018-01-02','2018-12-27','2017-12-27',
                           '2018-01-05','2018-01-12','2017-01-05',
                           '2017-01-12','2018-12-22','2017-12-22']),
                   'close':[4.56,5.45,4.32,5.67,5.23,4.78,7.43,8.67,
                            9.32,4.73,2.42,3.45],
                   'ticker':['BA','BA','BA','BA','BA','BA',
                             'AAPL','AAPL','AAPL','AAPL','AAPL','AAPL'],
                    'exchange':['NYSE'] * 6 + ['NSDQ'] * 6}, columns=c)

print (df)
   ticker exchange       date  close
0      BA     NYSE 2017-01-04   4.56
1      BA     NYSE 2017-01-12   5.45
2      BA     NYSE 2017-01-05   4.32
3      BA     NYSE 2018-01-02   5.67
4      BA     NYSE 2018-12-27   5.23
5      BA     NYSE 2017-12-27   4.78
6    AAPL     NSDQ 2018-01-05   7.43
7    AAPL     NSDQ 2018-01-12   8.67
8    AAPL     NSDQ 2017-01-05   9.32
9    AAPL     NSDQ 2017-01-12   4.73
10   AAPL     NSDQ 2018-12-22   2.42
11   AAPL     NSDQ 2017-12-22   3.45

""" Reformat dataframe data """    
# Change exchange from NSDQ to NASDAQ
df['exchange'] = df['exchange'].str.replace('NSDQ','NASDAQ')

# Round closing share price to 2 digits
df['close'] = df['close'].round(2)

#sorting dates for first date per ticker is first day in Jan and last day in Dec
df = df.sort_values('date')

#extract years from dates
df['year'] = pd.to_datetime(df['date']).dt.year

#get first rows per tickers and year
df1 = df.drop_duplicates(['ticker', 'year'])
print (df1)
  ticker exchange       date  close  year
0     BA     NYSE 2017-01-04   4.56  2017
8   AAPL   NASDAQ 2017-01-05   9.32  2017
3     BA     NYSE 2018-01-02   5.67  2018
6   AAPL   NASDAQ 2018-01-05   7.43  2018

#get last row per year and ticker
df2 = df.drop_duplicates(['ticker', 'year'], keep='last')
print (df2)
   ticker exchange       date  close  year
11   AAPL   NASDAQ 2017-12-22   3.45  2017
5      BA     NYSE 2017-12-27   4.78  2017
10   AAPL   NASDAQ 2018-12-22   2.42  2018
4      BA     NYSE 2018-12-27   5.23  2018

#join DataFrames together and sorting if necessary
df = pd.concat([df1, df2]).sort_values(['ticker','date'])
print (df)
   ticker exchange       date  close  year
8    AAPL   NASDAQ 2017-01-05   9.32  2017
11   AAPL   NASDAQ 2017-12-22   3.45  2017
6    AAPL   NASDAQ 2018-01-05   7.43  2018
10   AAPL   NASDAQ 2018-12-22   2.42  2018
0      BA     NYSE 2017-01-04   4.56  2017
5      BA     NYSE 2017-12-27   4.78  2017
3      BA     NYSE 2018-01-02   5.67  2018
4      BA     NYSE 2018-12-27   5.23  2018

Another solution with different output of data with aggregating first and last:

""" Reformat dataframe data """    
# Change exchange from NSDQ to NASDAQ
df['exchange'] = df['exchange'].str.replace('NSDQ','NASDAQ')

# Round closing share price to 2 digits
df['close'] = df['close'].round(2)

#sorting dates for first date per ticker is first day in Jan and last day in Dec
df = df.sort_values('date')

#extract years from dates
df['year'] = pd.to_datetime(df['date']).dt.year

df = (df.groupby(['ticker','year'])['close']
       .agg(['first','last'])
       .reset_index())
print (df)
  ticker  year  first  last
0   AAPL  2017   9.32  3.45
1   AAPL  2018   7.43  2.42
2     BA  2017   4.56  4.78
3     BA  2018   5.67  5.23
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • OK, this was different to what I expected but I like the output generated. I have two followup questions: 1. How can I filter the years? I want to include all years either greater than a variable have defined OR include all years that match in array with 5 years in it. – user3709511 Jan 29 '18 at 15:16
  • 1. You can filter by `df['year'] = pd.to_datetime(df['date']).dt.year` and then `df[df['year'] > 2016]` - it is called [`boolean indexing`](http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing). – jezrael Jan 29 '18 at 15:20
  • 2. I'd like to have a column for the first and last in each year, named something like 2017-first, 2017-last, 2018-first, 2018-last, what method would I use to achieve this? – user3709511 Jan 29 '18 at 15:22
  • The year is defined by the following: #find current year now = datetime.datetime.now() current_year = str(now.year) – user3709511 Jan 29 '18 at 15:25
  • 2. for output from solution 1 `df1 = df.drop_duplicates(['ticker', 'year'])` add `df1['year'] = df1['year'].astype(str) + 'first'` and similar for `df2` – jezrael Jan 29 '18 at 15:25
  • need `current_year = now.year` - not convert to `str`, because `df1['year']` is numeric - integers. – jezrael Jan 29 '18 at 15:26
0

You want to df.groupby('ticker'), then group-by month, filter for month=='Dec' and take the tail(), filter for month=='Jan' and take the head(), then ungroup().

(If you post reproducible data I will post code that does this.)

Read the pandas doc about Group By: split-apply-combine paradigm, one of the key paradigms in data science. For examples on SO, see tag .

smci
  • 32,567
  • 20
  • 113
  • 146