0

I want to set up a RDBMS for structured time series data of limited size (about 6000 series, 50mb of data) at various frequencies (daily, monthly, quarterly, annual CY and annual FY), and I want to run SQL queries on the database (mostly join various tables by time). The database is updated once a month. The variable names of the tables in this database are rather technical not very informative. The raw data is labeled as shown in the table below (example of a monthly table).

enter image description here

I started setting this up in MySQL and figured that just equipping tables with appropriate temporal identifiers gives me the join functionality I want. I could however not find out how to store the variable labels appropriately. Is it possible to somehow add attributes to the columns? Or can I link a table to the table mapping labels to the column names, such that it is carried along in joins? Or should I set this up using a different kind of database? (database must be easy to set up and host though, and SQL is strongly preferred). I am grateful for any advice.

Update: I figured you can add comments to MySQL columns and tables, but it seems these cannot be queried in a standard way or carried along in joins. Is it possible to retrieve the information in the comments along with the queried data from a standard database connector (like this one for the R language: https://github.com/r-dbi/RMySQL)? Below a DDL example for tables with variable labels as comments.


-- Annual FY Table
CREATE TABLE IF NOT EXISTS BOU_MMI_AF (
   FY VARCHAR(7) COMMENT "Fiscal Year (July - June)",
   NFA DOUBLE COMMENT "Net Foreign Assets (NFA) (Shs billion)",
   NDA DOUBLE COMMENT "Net Domestic Assets (NDA) (Shs billion)",
   PRIMARY KEY (FY)
) COMMENT = "Annual FY";

-- Quarterly Table
CREATE TABLE IF NOT EXISTS BOU_FS (
   Year INT CHECK (Year >= 1800 AND Year < 2100) COMMENT "Year",
   Quarter VARCHAR(2) CHECK (Quarter IN ('Q1', 'Q2', 'Q3', 'Q4')) COMMENT "Quarter",
   FY VARCHAR(7) COMMENT "Fiscal Year (July - June)",
   QFY VARCHAR(2) CHECK (QFY IN ('Q1', 'Q2', 'Q3', 'Q4')) COMMENT "Quarter of Fiscal Year",
   KA_RC_RWA DOUBLE COMMENT "Capital Adequacy (%): Regulatory capital to risk-weighted assets",
   AQ_NPL_GL DOUBLE COMMENT "Asset quality (%): NPLs to total gross loans",
   EP_RA DOUBLE COMMENT "Earnings & profitability (%): Return on assets",
   L_BFA_TD DOUBLE COMMENT "Liquidity (%): Bank-funded advances to total deposits",
   MS_FX_T1CA DOUBLE COMMENT "Market Sensitivity (%): Forex exposure to regulatory tier 1 capital",
   PRIMARY KEY (Year, Quarter)
) COMMENT = "Quarterly";

-- Daily Table
CREATE TABLE IF NOT EXISTS BOU_I (
   Date DATE CHECK (Date >= '1800-01-01' AND Date < '2100-01-01') COMMENT "Date",
   Year INT CHECK (Year >= 1800 AND Year < 2100) COMMENT "Year",
   Quarter VARCHAR(2) CHECK (Quarter IN ('Q1', 'Q2', 'Q3', 'Q4')) COMMENT "Quarter",
   FY VARCHAR(7) COMMENT "Fiscal Year (July - June)",
   QFY VARCHAR(2) CHECK (QFY IN ('Q1', 'Q2', 'Q3', 'Q4')) COMMENT "Quarter of Fiscal Year",
   Month VARCHAR(9) CHECK (Month IN ('January' , 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December')) COMMENT "Month",
   Day INT CHECK (Day > 0 AND Day < 32) COMMENT "Day",
   I_Overnight DOUBLE COMMENT "Daily Interbank Money-Market Rates: Overnight  (%)",
   I_7day DOUBLE COMMENT "Daily Interbank Money-Market Rates: 7-day (%)",
   I_Overall DOUBLE COMMENT "Daily Interbank Money-Market Rates: Overall (%)",
   PRIMARY KEY (Date)
) COMMENT = "Daily";


So if I execute a query like

SELECT * FROM BOU_I NATURAL JOIN BOU_FS NATURAL JOIN BOU_MMI_AF; 

using a statistical software environment like R or STATA connecting to the database using a MySQL connector, I'd like to see a table similar to the one shown in the figure, where I can retrieve both the names of the variables and the labels stored as comments in the DDL.

Sebastian
  • 1,067
  • 7
  • 12
  • Hi - please can you clearly describe with examples (minimum reproducible example) what it is that you have and what are you trying to achieve as it is really unclear (to me) what you are doing. What do you mean by "variable" names/labels; why are you setting up tables with names that are "rather technical not very informative"; what join functionality are you trying to achieve; is data at a lower frequency (year) just an aggregate of data at a higher frequency (quarter)? – NickW Nov 26 '20 at 11:04
  • Thanks, I hope my example DDL makes things clearer. So the task is to set up a database for economic time-series at different frequencies (Daily, Monthly, Quarterly, Annual CY and Annual FY). Series at different frequencies need to be queried together (by joining the appropriate tables). This as far as I have figured works fine by just adding all lower-frequency ID variables to higher-frequency tables so that any two tables can be joined. The main complication seems to be how to retrieve the label describing the data. The database should be queried directly from statistical software (like R). – Sebastian Nov 26 '20 at 20:18
  • Also different frequencies are there simply because different economic variables are recorded at different frequencies. We can get daily exchange rates, monthly balance of trade, but only quarterly GDP etc. These variables still need to be analyzed together so the database should supply joins of datasets at different frequencies. – Sebastian Nov 26 '20 at 20:34

2 Answers2

1

I would structure your data differently. I would put all your measures in a single table and have a single measure per row. I would then add a DATE table (so that you have the week/month/quarter/year values for each metric date) and a METRIC_TYPE table that holds the labels for each metric code.

By normalising the data like this I think you have a more flexible design and it'll allow you to do what you want.

This is only for illustration of what I mean - it is not meant to be a definitive design:

enter image description here

NickW
  • 8,430
  • 2
  • 6
  • 19
  • Thanks @NickW, I appreciate this suggestion, and thought about doing something like this as well. The only thing is that this is likely to result in complex queries, and also storing data in a long-format table like that is ging to result in lots of data duplication and > 1 million rows. I'll look into it. Definitely like the idea of the Date and Metric tables. – Sebastian Nov 29 '20 at 17:49
  • 1
    I wouldn't worry about data volumes - any modern DB can handle 10s of millions of rows without breaking a sweat (assuming proper indexing, etc. obviously. If you are worried about query complexity then feel free to post some examples of what you are trying to do and we can look at them - might be easier in a new question (linked to this one) – NickW Nov 30 '20 at 09:47
  • I'll experiment a bit this week and come back with what I have. – Sebastian Dec 01 '20 at 08:05
0

So I am pretty happy with the suggestion of @NickW. For reference I am sharing my final schema below. I still have some questions regarding it. So I mostly query the DATA table directly (which has some 700,000 obs), and joining information from the TIME, SERIES and DATASET tables as needed. I noticed that retrieving larger amounts of data can take some time. So I wondered: am I indexing this optimally?

Then, there are a few computed columns: The Ndatasets column in DATASOURCE is counting the number of DSID by Source in the DATASET table, the Updated column in DATASET shows when data was last added to a particular dataset. DS_From, DS_to, and S_from, S_to give the maximum time range where data is available for a given dataset and series. Currently, I am doing all these computations in R and inserting the data. I wonder if these computations could be done in MySQL, so as to have self-updating columns?

Grateful for any further comment on this.

DDL:

DROP SCHEMA IF EXISTS TSDB;
CREATE SCHEMA IF NOT EXISTS TSDB;
USE TSDB;

CREATE TABLE IF NOT EXISTS DATASOURCE (
   Source VARCHAR(120),
   Source_Url VARCHAR(200),
   NDatasets INT NOT NULL, 
   Desription VARCHAR(3000) NOT NULL,
   Access VARCHAR(3000) NOT NULL,
   PRIMARY KEY (Source)
);

CREATE TABLE IF NOT EXISTS DATASET (
   DSID VARCHAR(30), -- INT
   Dataset VARCHAR(120) NOT NULL,
   Frequency VARCHAR(9) NOT NULL CHECK (Frequency IN ('Daily' , 'Monthly', 'Quarterly', 'Annual CY', 'Annual FY')),
   DS_From DATE CHECK (DS_From >= '1800-01-01' AND DS_From < '2100-01-01'), 
   DS_To DATE CHECK (DS_To >= '1800-01-01' AND DS_To < '2100-01-01'), 
   Updated DATE CHECK (Updated >= '1800-01-01' AND Updated < '2100-01-01'), 
   Desription VARCHAR(3000) NOT NULL,
   Source VARCHAR(120), -- NOT NULL
   DS_Url VARCHAR(200),
   PRIMARY KEY (DSID),
   FOREIGN KEY (Source) REFERENCES DATASOURCE (Source) ON DELETE CASCADE ON UPDATE CASCADE
);

CREATE INDEX idx_dataset_source ON DATASOURCE (Source); 

CREATE TABLE IF NOT EXISTS SERIES (
   DSID VARCHAR(30), -- INT
   Series VARCHAR(30) NOT NULL,
   Label VARCHAR(120) NOT NULL,
   S_From DATE CHECK (S_From >= '1800-01-01' AND S_From < '2100-01-01'), 
   S_To DATE CHECK (S_To >= '1800-01-01' AND S_To < '2100-01-01'), 
   S_Source VARCHAR(120),
   S_Url VARCHAR(200),
   PRIMARY KEY (DSID, Series),
   FOREIGN KEY (DSID) REFERENCES DATASET (DSID) ON DELETE CASCADE ON UPDATE CASCADE
);

CREATE INDEX idx_series_DSID ON SERIES (DSID); 

CREATE TABLE IF NOT EXISTS TIME (
    Date DATE UNIQUE CHECK (Date >= '1800-01-01' AND Date < '2100-01-01'),
    Year INT NOT NULL CHECK (Year >= 1800 AND Year < 2100),
    Quarter INT NOT NULL CHECK (Quarter >= 1 AND  Quarter <= 4), 
    FY CHAR(7) NOT NULL,
    QFY INT NOT NULL CHECK (QFY >= 1 AND  QFY <= 4), 
    Month INT NOT NULL CHECK (Month >= 1 AND  Month <= 12), 
    Day INT NOT NULL CHECK (Day > 0 AND Day < 32),
    PRIMARY KEY (Date)
);

CREATE TABLE IF NOT EXISTS DATA (
   Date DATE, 
   DSID VARCHAR(30), 
   Series VARCHAR(30),
   Value DOUBLE NOT NULL,
   PRIMARY KEY (Date, DSID, Series),
   FOREIGN KEY (DSID) REFERENCES DATASET (DSID) ON DELETE CASCADE ON UPDATE CASCADE,
   FOREIGN KEY (DSID, Series) REFERENCES SERIES (DSID, Series) ON DELETE CASCADE ON UPDATE CASCADE,
   FOREIGN KEY (Date) REFERENCES TIME (Date) ON DELETE CASCADE ON UPDATE CASCADE
);

CREATE INDEX idx_data_DSID ON DATA (DSID); 
CREATE INDEX idx_data_series ON DATA (DSID, Series); 
CREATE INDEX idx_data_date ON DATA (Date); 

EER Diagram: enter image description here

Sebastian
  • 1,067
  • 7
  • 12