2

I have a bash script that I am trying to run with a cron job. I'm trying to run the cron job on my ubuntu server. I want it to run everyday at 8 hr utc. the bash script activates a conda python virtual environment and runs a python script. the script is supposed to pull data and load it into a mysql database. I also have logging throughout the python script. no new data showed up in the database last night and no new logs were created. below I've shown what's in the crontab and what's in the stocks_etl.sh script. does anyone see what the issue might be, and how to fix it?

sudo crontab -e

crontab shows

0 8 * * * /mnt/data/sda/user_storage/stocks_etl.sh

stocks_etl.sh

#!/bin/bash
source activate py36
python /mnt/data/sda/user_storage/stocks_etl.py

update #3:

when I run this command in the command line on my ubuntu server it works fine

bash ~/etl_scripts/stocks_etl.bashrc

when I run it in crontab using the same user, it throws the error below

error:

Started stocks_etl.bash
Thu Feb 25 05:20:01 UTC 2021
/home/user/etl_scripts/stocks_etl.bashrc: line 5: activate: No such file or directory
Traceback (most recent call last):
  File "/home/user/etl_scripts/stocks_etl.py", line 4, in <module>
    import numpy as np
ImportError: No module named numpy

here's the bashrc file:

#!/bin/bash -l
echo 'Started stocks_etl.bash'
date +'%a %b %e %H:%M:%S %Z %Y'


source activate py36
python ~/etl_scripts/stocks_etl.py

it's like when I run it in crontab it can't find conda and it's just running it with the base python installation that doesn't have numpy installed. does anyone see what the issue might be and can you suggest how to solve it?

update #2: now that I've run chmod 777 on the files, when the crontab executes I'm getting the error below. it's like the conda virtual env isn't being activated and it's just trying to run it with the base python installation

error:

/mnt/data/sda/user_storage/etl_scripts/stocks_etl.sh: line 2: activate: No such file or directory
Traceback (most recent call last):
  File "/mnt/data/sda/user_storage/etl_scripts/stocks_etl.py", line 1, in <module>
    import numpy as np
ImportError: No module named numpy

update:

stocks_etl.py

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from yahoofinancials import YahooFinancials

import pymysql

import datetime
import logging

import time

import glob

from sqlalchemy import create_engine

import os

import datetime


# helper functions



# function for creating error logs
# Note: function not currently working, doesn't recognize logger

def error_logger(path):
    
    # adding a timestamp to logname
    ts=str(datetime.datetime.now().isoformat())
    
    # logging.basicConfig(filename='example.log',level=logging.DEBUG)
    logging.basicConfig(filename=path+ts+'.log', level=logging.DEBUG, 
                        format='%(asctime)s %(levelname)s %(name)s %(message)s')

    logger=logging.getLogger(__name__)


# function to query mysql db and return dataframe of results
def mysql_query(user,password,database,host,query):
    
    connection = pymysql.connect(user=user, password=password, database=database, host=host)


    try:
        with connection.cursor() as cursor:
            query = query


        df = pd.read_sql(query, connection)
        
        logging.info('query succeeded: '+query)
        
#     finally:
        connection.close()
        
        logging.info('close connection mysql')

    except Exception as err:
        
        logger.error('query failed: '+query+' got error: '+str(err))
        
        return df
        
    pass

    
        
    


# function to download OHLC stock data

def download_stocks(Ticker_list,start_date,end_date,time_interval,path):
    
    
    # get data for stocks in Ticker_list and save as csv

    failed_list=[]
    passed_list=[]

    Ticker_list = Ticker_list

    for x in range(len(Ticker_list)):


        try:

            yahoo_financials = YahooFinancials(Ticker_list[x])
            # data = yahoo_financials.get_historical_price_data('2019-01-01', '2019-09-30', time_interval='daily')
            data = yahoo_financials.get_historical_price_data(start_date, end_date, time_interval=time_interval)

            prices_df=pd.DataFrame(data[Ticker_list[x]]['prices'])

            prices_df=prices_df[['adjclose', 'close', 'formatted_date', 'high', 'low', 'open',
                   'volume']]

            prices_df['date']=prices_df['formatted_date']

            prices_df=prices_df[['date','adjclose', 'close', 'high', 'low', 'open',
                   'volume']]

            prices_df['Ticker']=Ticker_list[x]

            prices_df.to_csv(path+Ticker_list[x]+'.csv')

            passed_list.append(Ticker_list[x])

            logging.info('downloaded: '+Ticker_list[x])

            time.sleep(1)

        except Exception as err:

            failed_list.append(Ticker_list[x])
            logger.error('tried download: '+Ticker_list[x]+' got error: '+str(err))

        pass
        

# function read csv in and append to one dataframe

def stock_dataframe(path):    

    try:
        path = path
        all_files = glob.glob(path + "/*.csv")

        li = []

        for filename in all_files:
            df = pd.read_csv(filename, index_col=None, header=0)
            li.append(df)

        frame = pd.concat(li, axis=0, ignore_index=True)

        frame=frame[['date', 'adjclose', 'close', 'high', 'low', 'open',
               'volume', 'Ticker']]

        return frame
    
        logging.info('created stock dataframe')
        
    except Exception as err:

            logger.error('stock dataframe create failed got error: '+str(err))
            
    pass


# write dataframe to mysql db

def write_dataframe(username, password, host, schema,dataframe,table,if_exists,index):
    
    try:
        
        from sqlalchemy import create_engine
        
        # connection = pymysql.connect(user='user', password='psswd', database='sandbox', host='xxxxx')

        engine = create_engine("mysql+pymysql://"+str(username)+":"+str(password)+"@"+str(host)+"/"+str(schema))
        # engine = create_engine("mysql+mysqldb://user:"+'psswd'+"@xxxxx/sandbox")
        dataframe.to_sql(con=engine, name=table, if_exists=if_exists, index=index)
        
        logging.info('write_dataframe succeeded')
        
    except Exception as err:

            logger.error('write_dataframe failed got error: '+str(err))
            
    pass




# to do

# - create directory with datetime prefix as part of path
# - add step that checks max date in current table
# - only pull data later than max date in current table
# - check max date in current derived table
# - only pull data later than current date from source table


def etl_pipeline(table_var):


    i=table_var

    max_date_query="""select max(date) as max_date from """+i+""""""

    try:
        
        max_date_df=mysql_query(user='user',
                            password='psswd',
                            database='stocks',
                            host='xxxxx',
                            query=max_date_query)
            
        logging.info('max_date succeeded: '+i)
            
    except Exception as err:

            logger.error('max_date failed: '+i)

    pass
        


    # In[8]:

    try:
        # get max date
        max_date=max_date_df.astype(str)['max_date'][0]


        # create directory

        base_path='/mnt/data/sda/user_storage/stock_data_downloads/'

        # get current_date
        current_date=datetime.datetime.today().strftime('%Y-%m-%d')

        directory_path=base_path+i+'/'+current_date

        # create directory for downloading new stocks in to
        os.mkdir(directory_path)

        logging.info('create directory succeeded: '+i)

    except Exception as err:

            logger.error('create directory failed: '+i)

    pass


    # In[9]:


    # getting ticker symbols

    ticker_query="""select distinct ticker as ticker from """+i+""""""

    try:
        
        tickers_df=mysql_query(user='user',
                            password='psswd',
                            database='stocks',
                            host='xxxxx',
                            query=ticker_query)
            
        logging.info('get tickers succeeded: '+i)
            
    except Exception as err:

            logger.error('get tickers failed: '+i)

    pass


    # In[12]:


    # get ticker symbols 
    stocks=tickers_df.ticker.tolist()


    # download stocks
    # Note: must add '/' to end of path
    # '2019-01-01', '2021-01-01', time_interval='daily'
    download_stocks(Ticker_list=stocks,
                    start_date=max_date,
                    end_date=current_date,
                    time_interval='daily',
                    path=directory_path+'/')


    # In[70]:


    # directory_path


    # In[13]:


    # create dataframe
    stocks_df=stock_dataframe(path=directory_path)

    # trav_stocks_df.head()


    # In[14]:





    # create mysql table
    write_dataframe(username='user', 
                    password='psswd', 
                    host='xxxxx', 
                    schema='stocks',
                    dataframe=stocks_df,
                    table=i,
                    if_exists='append',
                    index=False)


    # In[15]:


    # creating additional avg annual returns

    try:
        
        query="""select ticker, avg(annual_returns) as avg_annual_returns from (
        select ticker,date, ( -1 +
                a.adjclose / max(a.adjclose) over (partition by ticker 
                                             order by date
                                             range between interval 365 day preceding and interval 365 day preceding
                                            ) 
               ) as annual_returns              
        from """+i+""" a
        ) b where annual_returns is not null
        group by ticker"""

        df=mysql_query(user='user',password='psswd',database='stocks',host='xxxxx',query=query)

        logging.info('etl succeeded: '+i+'_returns')

    except Exception as err:

            logger.error('etl failed: '+i+'_returns')

    pass


    # In[16]:


    # adding additional avg annual returns to table

    # create mysql table
    write_dataframe(username='user', 
                    password='psswd', 
                    host='xxxxx', 
                    schema='stocks',
                    dataframe=df,
                    table=i+'_returns',
                    if_exists='replace',
                    index=False)
    
    
# start logging

# adding a timestamp to logname
ts=str(datetime.datetime.now().isoformat())  

# logging.basicConfig(filename='example.log',level=logging.DEBUG)
logging.basicConfig(filename='/mnt/data/sda/user_storage/logs/etl_scripts/'+ts+'.log', level=logging.DEBUG, 
                    format='%(asctime)s %(levelname)s %(name)s %(message)s')

logger=logging.getLogger(__name__)


    
table_list=['trav_stocks','s_and_p','american_mutual_funds']

for j in table_list:
    
    try:
        
        etl_pipeline(j)
        
        logging.info('etl_pipeline succeeded: '+j)
        
    except Exception as err:

            logger.error('etl_pipeline failed: '+j)

    pass

update:

I changed my file to a .bash file and the code inside to

#!/bin/bash -l
echo ''
'Started stocks_etl.bash'
date +'%a %b %e %H:%M:%S %Z %Y'


source /home/user/anaconda3/envs/py36/bin/activate 
conda activate py36
python ~/etl_scripts/stocks_etl.py

now I'm getting the error below when run in crontab

error:

/home/user/etl_scripts/stocks_etl.bash: line 3: Started stocks_etl.bash: command not found
Fri Feb 26 16:28:01 UTC 2021
/home/user/etl_scripts/stocks_etl.bash: line 7: /home/user/anaconda3/envs/py36/bin/activate: No such file or directory
/home/user/etl_scripts/stocks_etl.bash: line 8: conda: command not found
Traceback (most recent call last):
  File "/home/user/etl_scripts/stocks_etl.py", line 4, in <module>
    import numpy as np
ImportError: No module named numpy

update:

code:

#!/bin/bash
echo ''
'Started stocks_etl.bash'
date +'%a %b %e %H:%M:%S %Z %Y'


/home/user/anaconda3 run -n py36 python ~/user/etl_scripts/stocks_etl.py

error:

/home/user/etl_scripts/stocks_etl.bash: line 3: Started stocks_etl.bash: command not found
Fri Feb 26 16:43:01 UTC 2021
/home/user/etl_scripts/stocks_etl.bash: line 7: /home/user/anaconda3: Is a directory
user3476463
  • 3,967
  • 22
  • 57
  • 117
  • Could you post the contents of stocks_etl.py as well? – rhymefororange Feb 23 '21 at 16:43
  • @rhymefororange thank you for getting back to me so quickly. I added the stocks_etl.py script to the original post as an update. – user3476463 Feb 23 '21 at 17:01
  • @rhymefororange I added some logging to the initial crontab call and it looks like I may have been lacking permissions. ran chmod 777 on the files, we'll see if that solves it shortly. – user3476463 Feb 23 '21 at 17:04
  • I strongly encourage a reopen vote. The issue OP presents concerns activating a Conda environment, which falls under the scope of [tag:conda]. This is not a general computing problem. – merv Feb 26 '21 at 17:05

3 Answers3

2

First, source activate syntax was deprecated years ago (how old is your Conda instance?) - you should be using conda activate. Second, Conda shell commands are loaded into the shell as part of sourcing .bashrc or .bash_profile. So at the very least, you need to include the -l in the shebang and

#!/bin/bash -l
conda activate py36
python /mnt/data/sda/user_storage/stocks_etl.py

You may need to do something extra to ensure the .bashrc it sources is correct (e.g., what user does it run as?).

Note that Conda also has the conda run command for executing commands inside envs, which I think should be preferred:

#!/bin/bash -l
conda run -n py36 python /mnt/data/sda/user_storage/stocks_etl.py

The latter form should also work without the Conda initialization, but providing the full path to the conda entry point:

#!/bin/bash

# change to match where your `conda` location
/home/user/anaconda3/condabin/conda run -n py36 python /mnt/data/sda/user_storage/stocks_etl.py
merv
  • 67,214
  • 13
  • 180
  • 245
  • thanks for the suggestion. I tried the first one and got a message that no such file or directory conda. I also added an update. when I run it in the command line it works fine, but when I run it with the same user using crontab I get the can't find conda error message. does that give you any better idea what the issue might be? – user3476463 Feb 26 '21 at 06:13
  • thanks for getting back to me. I've added my last two attempts as updates to my original post. including one of your suggestions. and I changed it back to a .bash file. I've also posted the errors I'm still getting. – user3476463 Feb 26 '21 at 16:50
  • The equivalent in your system to what I wrote is `/home/user/anaconda3/condabin/conda` – merv Feb 26 '21 at 17:03
  • 1
    thank you! your last comment did the trick. – user3476463 Feb 27 '21 at 01:23
0

Did you check if your bash file is executable?

If not you should either change its mode:

chmod 755 /mnt/data/sda/user_storage/stocks_etl.sh

Or explicitly execute it with bash:

0 8 * * * bash /mnt/data/sda/user_storage/stocks_etl.sh
Jonas
  • 1,401
  • 9
  • 25
  • that works as well of course. 755 just means that other people can't overwrite the script. I tend yo use it by default. but it doesn't really matter on your own machine – Jonas Feb 26 '21 at 08:14
-2

for me it's just:

crontab -e

enter my execution line:

0 8 * * * python3 script.py&

and save.

putting the '&' at the end tells it to run in the background. I'm using AWS ubuntu servers, so everything needs to be python3.

Kyle Hurst
  • 252
  • 2
  • 6