# Forecasting Inflation: Romanian Case Study Using SARIMA Models

In this article, we look at how one may choose an optimal SARIMA model by selecting the one with lowest in-sample errors. We also look into how one may use both Eikon and Datastream data together, as well as statistical concepts of stationarity and differencing among others. We also investigate and compare models only using comparative months (e.g.: Jan. with Jan., Feb. with Feb., etc.). You may see the Webinar recording at the end.

We find that the optimal model is a SARIMA(5,0,4)(4,0,0)$_{12}$.

To do so we look at (i) how one may check for stationarity graphically and via (ADF, KPSS & PP) test statistics, (ii) differencing, (iii) ex-ante (before modelling) parameter identification via autocorrelation & partial autocorrelation functions, (iv) the difference between Autoregressive Moving Average (ARMA) & Seasonal Integrated ARMA (SARIMA) models, (v) ex-post (after modelling) parameter identification via information criteria and mean of absolute & squared errors (as well as the use of Python's 'pickle' library), (vi) how one may choose an optimal model specification - reducing errors, (vii) recursive one-step-ahead out-of-sample forecasts, and finally (iix) corresponding month models before concluding.

As a precursor to ARMA modelling, the reader is advised to read 'Information Demand and Stock Return Predictability (Coded in R): Part 2: Econometric Modelling of Non-Linear Variances' - an article without coding that looks into our methods of modelling.

## What are SARIMA Models?

Seasonal AutoRegressive Integrated Moving Average (SARIMA) models are best explained in a separate article. "Time Series Forecasting with SARIMA in Python" is great at showing both the mathematics and fundamental Python basics at play below. I would suggest reading this article 1st that goes through the slightly simpler ARMA model. The only difference is the inclusion of the seasonal factor in:

SARIMA(5,0,4)(4,0,0)$_{12}$:

\begin{align} y_t &= c + \sum_{j=1}^{5} \phi_j y_{t-j} + \sum_{i=1}^{4} \psi_i \epsilon_{t-i} + \sum_{J=1}^{4} \Phi_i y_{t-(J*12)} \\ &= c + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \phi_3 y_{t-3} + \phi_4 y_{t-4} + \phi_5 y_{t-5} + \psi_1 \epsilon_{t-1} + \psi_2 \epsilon_{t-2} + \psi_3 \epsilon_{t-3} + \psi_4 \epsilon_{t-4} + \Phi_1 y_{t-(1*12)} + \Phi_2 y_{t-(2*12)} + \Phi_3 y_{t-(3*12)} + \Phi_4 y_{t-(4*12)} \\ &= c + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \phi_3 y_{t-3} + \phi_4 y_{t-4} + \phi_5 y_{t-5} + \psi_1 \epsilon_{t-1} + \psi_2 \epsilon_{t-2} + \psi_3 \epsilon_{t-3} + \psi_4 \epsilon_{t-4} + \Phi_1 y_{t-12} + \Phi_2 y_{t-24} + \Phi_3 y_{t-36} + \Phi_4 y_{t-48} \end{align}

In our study, we will also use exogenous variables ($x_t$'s) too such that - for a model with $m$ exogenous variables with potentially different $r_k$ lags each:

$y_t = c + \sum_{j=1}^{5} \phi_j y_{t-j} + \sum_{i=1}^{4} \psi_i \epsilon_{t-i} + \sum_{J=1}^{4} \Phi_i y_{t-(J*12)} + \sum_{k=1}^{m} \left( \sum_{n= 1}^{r_k} \beta_{k,n} x_{k , t-{r_k}} \right)$

## Get to Coding ### Development Tools & Resources The example code demonstrating the use case is based on the following development tools and resources:

• Refinitiv's DataStream Web Services (DSWS): Access to DataStream data. A DataStream or Refinitiv Workspace IDentification (ID) will be needed to run the code below.
• Refinitiv's Eikon API: To access Eikon data. An Eikon API key will be needed to use this API.

### Import Libraries We need to gather our data. Since Refinitiv's DataStream Web Services (DSWS) allows for access to ESG data covering nearly 70% of global market cap and over 400 metrics, naturally it is more than appropriate. We can access DSWS via the Python library "DatastreamDSWS" that can be installed simply by using pip install.

    	
import DatastreamDSWS as dsws

# We can use our Refinitiv's Datastream Web Socket (DSWS) API keys that allows us to be identified by Refinitiv's back-end services and enables us to request (and fetch) data: Credentials are placed in a text file so that it may be used in this code without showing it itself.

# It is best to close the files we opened in order to make sure that we don't stop any other services/programs from accessing them if they need to.

# # Alternatively one can use the following:
# import getpass





We use Refinitiv's Eikon Python Application Programming Interface (API) to access financial data such as Clc1. We can access it via the Python library "eikon" that can be installed simply by using pip install.

    	
import eikon as ek

# The key is placed in a text file so that it may be used in this code without showing it itself:
eikon_key = open("eikon.txt","r")
# It is best to close the files we opened in order to make sure that we don't stop any other services/programs from accessing them if they need to:
eikon_key.close()




    	
import numpy as np
import pandas as pd
import pickle # need to ' pip install pickle-mixin '. This library is native to Python and therefore doesn't have a version of its own.
from datetime import datetime, timedelta # ' datetime ' is native to Python and therefore doesn't have a version of its own.




    	
# Import the relevant plotting libraries:

import plotly # Needed to show plots in line in Notebook.
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected = False) # Initialize plotly.js in the browser if it hasn't been loaded into the Document Object Model (DOM) yet.

import cufflinks, seaborn # Needed to show plots in line in Notebook.
cufflinks.go_offline() # plotly & cufflinks are online platforms but we want to stay offline.




    	
for i,j in zip([ek, np, pd, plotly, cufflinks, seaborn],
["eikon", "numpy", "pandas", "plotly", "cufflinks", "seaborn"]):
print("The library '" + j + "' in this notebook is version " + i.__version__)





The library 'eikon' in this notebook is version 1.1.8
The library 'numpy' in this notebook is version 1.18.2
The library 'pandas' in this notebook is version 1.2.4
The library 'plotly' in this notebook is version 4.14.3
The library 'cufflinks' in this notebook is version 0.17.3
The library 'seaborn' in this notebook is version 0.11.1

## Data-sets

• Dependent variables (Y):

• Annual inflation rate – the increase (in percentage term) in consumer prices in one month of the current year compared to the same month of the previous year. This rate is calculated as a ratio, expressed as a percentage, between the price index of one month of the current year and that of the same month of the previous year, calculated against the same base, from which 100 is subtracted.

This data-set will be differenced once to be stationary.

• RMCONPRCF

Due to popular demand, we decided not to use Consumer Prices (Romania, Consumer Prices, by Commodity, All Items, Total, Index, Corresponding Period of the Previous Year = 100) as a dependent variable, but RMCPANNL instead.

### Collecting Datastream Data

We can go ahead and collect our data from Datastream:

    	
df = ds.get_data(tickers = "RMCPANNL,RMUNPTOTP,RMXRUSD.,RMXREUR.,RMGOVBALA", # be carefulnot to put spaces in between elements here, or else these spaces will be included in column names.
fields = "X",
start = '2000-01-01',
end = '2021-04-06',
freq = 'M')





Note that RMCPANNL's values are percentages, we thus ought to divide it by 100 to get proper fractions:

    	
df["RMCPANNL", "X"] = df["RMCPANNL"]["X"]/100




Let's have a look at our Pandas data-frame ' df '

    	
df



Instrument RMCPANNL RMUNPTOTP RMXRUSD. RMXREUR. RMGOVBALA
Field X X X X X
Dates
2000-01-15 0.5680 1175.0 1.84 1.8636 -162.8
2000-02-15 0.5570 1196.6 1.87 1.8421 -461.9
2000-03-15 0.4900 1166.7 1.92 1.8538 -826.0
2000-04-15 0.4890 1139.2 1.98 1.8713 -1210.9
2000-05-15 0.4400 1097.4 2.04 1.8507 -1367.8
... ... ... ... ... ...
2020-12-15 0.0206 296.1 4.00 4.8707 -105906.6
2021-01-15 0.0299 292.2 4.00 4.8728 -112.3
2021-02-15 0.0316 293.5 4.03 4.8741 -8430.8
2021-03-15 0.0305 NaN 4.11 4.8878 NaN
2021-04-15 NaN NaN NaN NaN NaN

256 rows × 5 columns

Note on timing: I aim to make this article as realistic as possible, putting myself in the shoes of a professional using this workflow to predict inflation. I do not expect our SARIMA models to perform particularly well, mainly because we did not do extensive research on the best exogenous variables to use - I leave that to you to decide. But while I do not expect this model to perform well, the workflow ought to be at its best. To use it optimally, please try to use exogenous variables (such as RMXRUSD.) as soon as their monthly values are released. It is possible for the values as of the - 15th of the month - of one variable to be published at another time to another (exempli gratia (e.g.): values for RMXRUSD. and RMGOVBALA could be published on the 20th and 25th of the month respectively, even though they are both representing values for the 15th of the month).

The 'X' is the default value for each of the items we requested. Let's change those with relevant value names:

    	
df.columns = pd.MultiIndex.from_tuples(
[('RMCPANNL', 'yoy'), # yoy will be the increase in one month compared to the same month of the previous year
('RMUNPTOTP', 'raw'),
('RMXRUSD.', 'raw'),
('RMXREUR.', 'raw'),
('RMGOVBALA', 'raw')],
names=['Instrument', 'Field'])





### Collecting Eikon data

Now let's add our CLc1 data from Eikon. Note that we want monthly data, so let's collect monthly (30 day) moving average data (MAVG(TR.CLOSEPRICE,-29)). 1st let's collect that data:

    	
CLc1, err = ek.get_data(
instruments = "CLc1",
fields = ["TR.CLOSEPRICE.timestamp",
"TR.CLOSEPRICE",
"MAVG(TR.CLOSEPRICE,-29)"],
parameters = {'SDate': '1999-01-01', # We want to start 1 year earlier because we're about to take moving averages which makes us loose degrees of freedom.
'EDate': str(datetime.today())[:10], # ' df.index[-1] ' picks the last date in our previously defined data-frame ' df '. You may want to use 'str(datetime.today())[:10]'.
'FRQ': 'D'})
CLc1




 Instrument Timestamp Close Price MAVG(TR.CLOSEPRICE,-29) 0 CLc1 1999-01-01T00:00:00Z 11.5477 1 CLc1 1999-01-04T00:00:00Z 12.36 11.5667 2 CLc1 1999-01-05T00:00:00Z 12 11.611 3 CLc1 1999-01-06T00:00:00Z 12.85 11.646 4 CLc1 1999-01-07T00:00:00Z 13.05 11.6833 ... ... ... ... ... 5592 CLc1 2021-04-13T00:00:00Z 60.45 61.6431 5593 CLc1 2021-04-14T00:00:00Z 62.75 5594 CLc1 2021-04-15T00:00:00Z 63.32 5595 CLc1 2021-04-16T00:00:00Z 63.07 5596 CLc1 2021-04-19T00:00:00Z

5597 rows × 4 columns

But note that the last few moving average data points are NaN. Just in case these last few days are useful, let's make our own moving average:

    	
# This takes the moving average of the last 30 datapoints, not the last 30 days (since not all of the last 30 days might have been trading days with price data).
CLc1["CLc1 30D Moving Average"] = CLc1["Close Price"].dropna().rolling(30).mean()
# Let's remove the ' MAVG(TR.CLOSEPRICE,-29) ' column, it was only there to display the issue outlined.
CLc1.drop('MAVG(TR.CLOSEPRICE,-29)', axis = 1, inplace = True)
CLc1 # Let's have a look at our data-frame now.




 Instrument Timestamp Close Price CLc1 30D Moving Average 0 CLc1 1999-01-01T00:00:00Z NaN 1 CLc1 1999-01-04T00:00:00Z 12.36 NaN 2 CLc1 1999-01-05T00:00:00Z 12 NaN 3 CLc1 1999-01-06T00:00:00Z 12.85 NaN 4 CLc1 1999-01-07T00:00:00Z 13.05 NaN ... ... ... ... ... 5592 CLc1 2021-04-13T00:00:00Z 60.45 61.645 5593 CLc1 2021-04-14T00:00:00Z 62.75 61.7543 5594 CLc1 2021-04-15T00:00:00Z 63.32 61.8297 5595 CLc1 2021-04-16T00:00:00Z 63.07 61.7977 5596 CLc1 2021-04-19T00:00:00Z NaN

5597 rows × 4 columns

### Joining the Datastream and Eikon data-frames together

Now, if we want to combine the data from df and CLc1, we may want to do so joining the two data-frames on their dates; but one may notice that CLc1's Timestamp and df's index (Dates) portray time in different ways:

    	
CLc1.Timestamp # Note that ' CLc1.Timestamp ' is the same as CLc1["Timestamp"]




'1999-01-01T00:00:00Z'

    	
df.index




'2000-01-15'

To harmonize the two, I will use df's way:

    	
'2000-01-31T00:00:00Z'[0:10]




'2000-01-31'

    	
CLc1.index = [i[0:10] for i in CLc1.Timestamp]
CLc1




 Instrument Timestamp Close Price CLc1 30D Moving Average 01/01/1999 CLc1 1999-01-01T00:00:00Z NaN 04/01/1999 CLc1 1999-01-04T00:00:00Z 12.36 NaN 05/01/1999 CLc1 1999-01-05T00:00:00Z 12 NaN 06/01/1999 CLc1 1999-01-06T00:00:00Z 12.85 NaN 07/01/1999 CLc1 1999-01-07T00:00:00Z 13.05 NaN ... ... ... ... ... 13/04/2021 CLc1 2021-04-13T00:00:00Z 60.45 61.645 14/04/2021 CLc1 2021-04-14T00:00:00Z 62.75 61.7543 15/04/2021 CLc1 2021-04-15T00:00:00Z 63.32 61.8297 16/04/2021 CLc1 2021-04-16T00:00:00Z 63.07 61.7977 19/04/2021 CLc1 2021-04-19T00:00:00Z NaN

5597 rows × 4 columns

Now we are encountering the issue whereby the 15th of the month every month is not necessarily a trading day. But we want to normalise the two data-frames to have monthly data on the 15th of every month (since that's how Datastream returns data). What we can do is use CLc1 30D Moving Average on the 15th of every month and (if it's not available) we'll use the 14th's; if that's unavailable we'll use the 13th; and so on until a set limit - say the 12th of every month (repeating this process 3 times). This is what the function below does:

    	
def Match_previous_days(df1, df2, common_day = 15, back = 3):
"""Match_previous_days Version 1.0:
This function compares two Pandas data-frames (df1 and df2, with monthly and daily
data respectively and dates in its index), and checks if any index from df1 is
missing in df2. If one for a day set in 'common_day' is missing, it changes it to
the previous day. It does so as many times as is set in 'back'.
It is created to ease the joining of Datastream API (DSWS) and Eikon API
retrieved Pandas data-frames (df1 and df2 respectively). E.g.:
Setting ' common_day ' to ' 15 ' and back to ' 3 ', our function will scan for
any row-name (called an index) in df1 for the 15th of the month that does not
appear in df2 and change any such index name to the 14th of the month (essentially
showing df2 data for the 14th of the month for those rows specifically as 15th of
the month). It can do it again for the 13th of the month, and then 12th; 3 times
in total (as specified in ' back ').

Parameters:
----------

df1: Pandas data-frame
A monthly data-frame from DSWS - or formatted similarly
(i.e.: with Multi-Index'ed columns, etc.).

df2: Pandas data-frame
A daily data-frame from eikon's API - or formatted
similarly - with one exception that its index is made of strings for dates in the
"yyyy-mm-dd" format.

common_day: int
The day of the month to start from. For an example, see 1st
paragraph above.
Default: common_day = 15

common_day: int
The number of times we are ready to run the procedure for. For an
example, see 1st paragraph above.
Default: back = 3

Dependencies:
----------

pandas 1.2.3
numpy 1.19.5

Examples:
--------

>>> import DatastreamDSWS as dsws
>>>
>>> import eikon as ek
>>> ek.set_app_key("insert eikon key here")
>>>
>>> df_1 = ds.get_data(tickers = "RMCPANNL,RMUNPTOTP,RMXRUSD.,RMXREUR.,RMGOVBALA", # be carefulnot to put spaces in between elements here, or else these spaces will be included in column names.
>>>                    fields = "X", start = '2000-01-01', freq = 'M')
>>>
>>> df_2, err = ek.get_data(instruments = "CLc1", fields = ["TR.CLOSEPRICE.timestamp", "TR.CLOSEPRICE"],
>>>                         parameters = {'SDate': '1999-01-01', # We want to start 1 year earlier because we're about to take moving averages which makes us loose degrees of freedom.
>>>                                       'EDate': str(datetime.today())[:10], # ' df_1.index[-1] ' picks the last date in our previously defined data-frame ' df_1 '. You may want to use 'str(datetime.today())[:10]'.
>>>                                       'FRQ': 'D'})
>>> df_2["CLc1 30D Moving Average"] = df_2["Close Price"].rolling(30).mean() # This takes the moving average of the last 30 datapoints, not the last 30 days (since not all of the last 30 days might have been trading days with price data).
>>> df_2.index = [i[0:10] for i in CLc1.Timestamp]
>>>
>>> Match_previous_days(df1 = df_1, df2 = df_2, common_day = 15, back = 3) # Note that this is the same as ' Match_previous_days(df1 = df_1, df2 = df_2) '.
"""

_common_day = "-" + "{:02d}".format(common_day)
df2_list = []
df2_df1_diff_index1 = []
df2_df1_diff_index2 = []
index_dictionary = []

df2_list.append(df2.copy())

for k in range(back):

if k > 0:
df2_list.append(df2_list[-1])

df2_df1_diff_index1.append(np.setdiff1d(
df1.index.tolist(),
df2_list[k][df2_list[k].index.str.contains(_common_day)].index.tolist()))

df2_df1_diff_index2.append([
i.replace("-15",  "-" + "{:02d}".format(common_day - 1 - k)) # ' -1 ' because k starts at 0.
for i in df2_df1_diff_index1[k]])

index_dictionary.append({})
#         display(index_dictionary[k])
for i,j in zip(df2_df1_diff_index2[k], df2_df1_diff_index1[k]):
index_dictionary[k][i] = j
#         display(index_dictionary[k])

df2_list[k].rename(index = index_dictionary[k], inplace = True)

return df2_list[-1]




    	
CLc1_of_interest0 = Match_previous_days(df1 = df, df2 = CLc1, common_day = 15, back = 4) # Applying the previously defined function.



    	
CLc1_of_interest0.iloc[258:261]



 Instrument Timestamp Close Price CLc1 30D Moving Average 13/01/2000 CLc1 2000-01-13T00:00:00Z 26.69 25.8317 15/01/2000 CLc1 2000-01-14T00:00:00Z 28 25.9453 18/01/2000 CLc1 2000-01-18T00:00:00Z 28.85 26.0753

As per the above, we can see the difference between the index and Timestamp in row 259, indicating that our function above worked.

    	
CLc1_of_interest1 = CLc1_of_interest0[CLc1_of_interest0.index.str.contains("-15")] # We're only interested in the 15th of each month.

CLc1_of_interest = pd.DataFrame(
data = CLc1_of_interest1[
["Close Price", "CLc1 30D Moving Average"] # We're only interested in those two columns.
].values,
index = CLc1_of_interest1.index,
columns = pd.MultiIndex.from_tuples(
[('CLc1', 'Close Price'),
('CLc1', 'Close Price 30D Moving Average')]))

CLc1_of_interest




 CLc1 Close Price Close Price 30D Moving Average 15/01/1999 12.18 NaN 15/03/1999 14.46 12.6067 15/04/1999 16.88 15.5177 15/06/1999 18.54 17.667 15/07/1999 20.2 18.655 ... ... ... 15/12/2020 47.59 43.203 15/01/2021 52.04 48.7013 15/02/2021 59.73 53.606 15/03/2021 65.28 60.8053 15/04/2021 63.32 61.8297

265 rows × 2 columns

    	
df = pd.merge(df, CLc1_of_interest,
left_index = True,
right_index = True)
df




Instrument RMCPANNL RMUNPTOTP RMXRUSD. RMXREUR. RMGOVBALA CLc1
Field yoy raw raw raw raw Close Price
2000-01-15 0.5680 1175.0 1.84 1.8636 -162.8 28.0
2000-02-15 0.5570 1196.6 1.87 1.8421 -461.9 30.03
2000-03-15 0.4900 1166.7 1.92 1.8538 -826.0 30.65
2000-04-15 0.4890 1139.2 1.98 1.8713 -1210.9 25.4
2000-05-15 0.4400 1097.4 2.04 1.8507 -1367.8 29.97
... ... ... ... ... ... ...
2020-12-15 0.0206 296.1 4.00 4.8707 -105906.6 47.59
2021-01-15 0.0299 292.2 4.00 4.8728 -112.3 52.04
2021-02-15 0.0316 293.5 4.03 4.8741 -8430.8 59.73
2021-03-15 0.0305 NaN 4.11 4.8878 NaN 65.28
2021-04-15 NaN NaN NaN NaN NaN 63.32

256 rows × 7 columns

Now, we want comparable data, and RMCPANNL is the only series depicting the increase in one month compared to the same month of the previous year. So let's transform our other series to depict the data similarly:

    	
df["RMUNPTOTP", "yoy"] = df["RMUNPTOTP"]["raw"].pct_change(periods = 12)
df["RMXRUSD.", "yoy"] = df["RMXRUSD."]["raw"].pct_change(periods = 12)
df["RMXRUSD.", "yoy"] = df["RMXRUSD."]["raw"].pct_change(periods = 12)
df["RMXREUR.", "yoy"] = df["RMXREUR."]["raw"].pct_change(periods = 12)
df["RMGOVBALA", "yoy"] = df["RMGOVBALA"]["raw"].pct_change(periods = 12)
df["CLc1", "yoy"] = df["CLc1"][
"Close Price 30D Moving Average"].pct_change(periods = 12)
df = df.sort_index(axis=1)
df = df.dropna()
df




 Instrument CLc1 RMCPANNL RMGOVBALA RMUNPTOTP RMXREUR. RMXRUSD. Field Close Price Close Price 30D Moving Average yoy yoy raw yoy raw yoy raw yoy raw yoy 15/01/2001 30.05 28.535 0.09981 0.399 -306.1 0.88022 1032.9 -0.1209 2.4646 0.32249 2.62 0.42391 15/02/2001 28.8 29.8113 0.07816 0.4 -601.2 0.30158 1032.3 -0.1373 2.4729 0.34244 2.68 0.43316 15/03/2001 26.58 28.976 -0.0423 0.403 -865.2 0.04746 992.8 -0.1491 2.4849 0.34044 2.73 0.42188 15/04/2001 28.2 27.25 -0.0318 0.375 -1087.5 -0.1019 948.4 -0.1675 2.488 0.32956 2.79 0.40909 15/05/2001 28.98 27.9327 0.0575 0.374 -1404.5 0.02683 890.8 -0.1883 2.491 0.34598 2.85 0.39706 ... ... ... ... ... ... ... ... ... ... ... ... ... 15/10/2020 40.92 39.4793 -0.2909 0.0224 -76512 1.81579 285.7 0.10437 4.8733 0.02514 4.14 -0.0372 15/11/2020 40.12 39.5773 -0.2836 0.0214 -85982 1.47431 290.7 0.12066 4.8699 0.02131 4.12 -0.0441 15/12/2020 47.59 43.203 -0.2489 0.0206 -105907 1.20275 296.1 0.14812 4.8707 0.01955 4 -0.0698 15/01/2021 52.04 48.7013 -0.1892 0.0299 -112.3 -0.8426 292.2 0.12862 4.8728 0.01973 4 -0.0719 15/02/2021 59.73 53.606 -0.0281 0.0316 -8430.8 -0.199 293.5 0.14336 4.8741 0.01909 4.03 -0.0799

242 rows × 12 columns

## Stationarity

You may wonder why we spoke of differencing our data above. This is because our data may not be stationary. We need to verify that our data is stationary (for more information as to why, please see here - I suggest reading this article part (2) from the start - it is very much relevant to our current investigation). differencing it once may resolve that problem. In our case, for a time series $\{z\}_1^T$ (i.e.: going from $z_1, z_2, ...$ till $z_T$), differencing it once would render $\{z_2 - z_1, z_3 - z_2, ..., z_T - z_{T-1}\}$. Let's check if our independent variable - RMCPANNL's y/y - is stationary. We can check for stationarity (i) via graphical analysis, via the (ii) Augmented Dickey-Fuller (ADF), (iii) Phillips-Perron (PP) or the (iv) Kwiatkowski–Phillips–Schmidt–Shin (KPSS) test statistics.

### Checking For Stationarity Via Graphical Analysis

We must admit that our RMCPANNL dataset does not look stationary

    	
df["RMCPANNL"]["yoy"].iplot(theme = "solar", title = "RMCPANNL", xTitle = "date",
yTitle = "% increase from same month in previous year (yoy)")





RMCPANNL's 1st difference (1d) does look rather stationary:

    	
df["RMCPANNL"]["yoy"].diff(1).iplot(theme = "solar",
title = "RMCPANNL's 1st difference",
xTitle = "date",
yTitle = "1st difference in yoy (yoy1d)")





Upon quick graphical visualisation, it seems as though we ought to choose RMCPANNL's 1st difference

### Checking For Stationarity Via Test Statistics

We can create a Python function for the three aforementioned test-statistics for stationarity: Augmented Dickey-Fuller (ADF), Phillips-Perron (PP) and the Kwiatkowski–Phillips–Schmidt–Shin (KPSS) ones.

For the ADF test, one may use the adfuller function from the statsmodels library (imported via the line from statsmodels.tsa.stattools import adfuller), but that library doesn't have functions for all the tests in question. the arch library does - however. So I will define a Python function Stationarity_Table returning all relevant arch stationarity results of interest:

    	
import arch
from arch.unitroot import ADF, PhillipsPerron, KPSS
print(arch.__version__)





4.19

    	
def Stationarity_Table(data = [df["RMCPANNL"]["yoy"].dropna(),
df["RMUNPTOTP"]["raw"].dropna()],
dataset_names = ["RMCPANNL", "RMUNPTOTP"],
enumerate_data = False):
"""Stationarity_Table Version 1.0:
This function returns a Pandas data-frame with the results of the
Augmented Dickey-Fuller (ADF), Phillips-Perron (PP) and the
Kwiatkowski–Phillips–Schmidt–Shin (KPSS) stationarity tests from
the arch library.

Parameters:
----------

data: list
list of nan-less Pandas data-frames of number-series on which the test will be performed
Default: data = [df["RMCPANNL"]["yoy"].dropna(), df["RMUNPTOTP"]["raw"].dropna()]

dataset_names: list
List of strings of the names to show as columns in our resulted table
Default: dataset_names = ["RMCPANNL", "RMUNPTOTP"]

tests: list
List of strings of the stationarity tests wished.
You may choose between "ADF", "PP", and/or "KPSS".
Default: tests = ["ADF", "PP", "KPSS"]

Dependencies:
----------

pandas 1.2.3
numpy 1.19.5
arch 4.15 as ' from arch.unitroot import ADF, PhillipsPerron, KPSS '

Examples:
--------

>>> import DatastreamDSWS as dsws
>>>
>>> df_1 = ds.get_data(tickers = "RMCPANNL,RMUNPTOTP", # be carefulnot to put spaces in between elements here, or else these spaces will be included in column names.
>>>                    fields = "X", start = '2000-01-01', freq = 'M')
>>>
>>> Stationarity_Table(data = [df_1["RMCPANNL"]["X"].dropna(), df_1["RMUNPTOTP"]["X"].dropna()], dataset_names = ["RMCPANNL", "RMUNPTOTP"], tests = ["ADF", "PP", "KPSS"], enumerate_data = False)
"""

_columns, stationarity_test = [], []
for i in tests:
_columns.append((i, "Test-statistic", "")) # The ', ""' is there for the Pandas MultiIndex
_columns.append((i, "p-value", ""))
_columns.append((i, "Number of lags used", ""))
_columns.append(
(i,
"Number of observations used for the regression and critical values'calculation",
""))
_columns.append((i, "Critical values", '1%'))
_columns.append((i, "Critical values", '5%'))
_columns.append((i, "Critical values", '10%'))
_columns.append((i, "Null hypothesis", ""))

for d, _d in zip([D.astype(np.float64) for D in data], enumerate(data)):

if enumerate_data == True:
print(_d)

_tests, _data = [], []
if "PP" in tests: _tests.append(PhillipsPerron(d))
if "KPSS" in tests: _tests.append(KPSS(d))

for i in _tests:
_data.append(i.stat)
_data.append(i.pvalue)
_data.append(i.lags)
_data.append(i.nobs)
for k in [i.critical_values[j] for j in i.critical_values]:
_data.append(k)
_data.append(i.null_hypothesis)

stationarity_test.append(
pd.DataFrame(
data = np.array(_data)[np.newaxis],
columns = pd.MultiIndex.from_tuples(_columns)).T)

stationarity_test_table = pd.concat(
stationarity_test, axis = 1, join = "inner")

stationarity_test_table = pd.DataFrame(
data = stationarity_test_table.values,
index = stationarity_test_table.index,
columns = dataset_names)

return stationarity_test_table




    	
Stationarity_Table(
data =  [df["RMCPANNL"]["yoy"].dropna(),
df["RMUNPTOTP"]["raw"].dropna(),
df["RMUNPTOTP"]["yoy"].dropna(),
df["RMXRUSD."]["raw"].dropna(),
df["RMXRUSD."]["yoy"].dropna(),
df["RMXREUR."]["raw"].dropna(),
df["RMXREUR."]["yoy"].dropna(),
df["RMGOVBALA"]["raw"].dropna(),
df["RMGOVBALA"]["yoy"].dropna(),
df["CLc1"]["Close Price 30D Moving Average"].dropna(),
df["CLc1"]["yoy"].dropna()],
dataset_names = ["RMCPANNL yoy",
"RMUNPTOTP raw","RMUNPTOTP yoy",
"RMXRUSD. raw", "RMXRUSD. yoy",
"RMXREUR. raw", "RMXREUR. yoy",
"RMGOVBALA raw", "RMGOVBALA yoy",
"CLc1 Close Price 30D Moving Average",
"CLc1 Close Price yoy"],
enumerate_data = False)
# # for ' data ', one could use:
# [df[i][j].dropna() for i,j in zip(
#     ["RMCPANNL", "RMUNPTOTP", "RMUNPTOTP",
#      "RMXRUSD.", "RMXRUSD.", "RMXREUR.", "RMXREUR.",
#      "RMGOVBALA", "RMGOVBALA", "CLc1", "CLc1"],
#     ["yoy", "raw", "yoy", "raw", "yoy", "raw", "yoy",
#      "raw", "yoy", "Close Price 30D Moving Average", "yoy"])]




 RMCPANNL yoy RMUNPTOTP raw RMUNPTOTP yoy RMXRUSD. raw RMXRUSD. yoy RMXREUR. raw RMXREUR. yoy RMGOVBALA raw RMGOVBALA yoy CLc1 Close Price 30D Moving Average CLc1 Close Price yoy ADF Test-statistic -4.0516 -2.2359 -3.6895 -1.5542 -3.439 -2.4889 -2.642 -0.0676 -15.433 -2.5897 -2.9546 p-value 0.00116 0.19348 0.00426 0.50657 0.0097 0.1182 0.08463 0.95257 2.93E-28 0.09515 0.03935 Number of lags used 15 15 15 1 13 1 14 15 0 2 13 Number of observations used for the regression and critical values'calculation 226 226 226 240 228 240 227 226 241 239 228 Critical values 1% -3.4596 -3.4596 -3.4596 -3.4579 -3.4594 -3.4579 -3.4595 -3.4596 -3.4578 -3.458 -3.4594 5% -2.8744 -2.8744 -2.8744 -2.8737 -2.8743 -2.8737 -2.8744 -2.8744 -2.8736 -2.8737 -2.8743 10% -2.5736 -2.5736 -2.5736 -2.5732 -2.5736 -2.5732 -2.5736 -2.5736 -2.5732 -2.5733 -2.5736 Null hypothesis The process contains a unit root. The process contains a unit root. The process contains a unit root. The process contains a unit root. The process contains a unit root. The process contains a unit root. The process contains a unit root. The process contains a unit root. The process contains a unit root. The process contains a unit root. The process contains a unit root. PP Test-statistic -6.2226 -2.6368 -2.9359 -1.4358 -3.3385 -2.4428 -2.8822 -6.6524 -15.445 -2.0948 -3.1094 p-value 5.18E-08 0.08563 0.04133 0.56501 0.01324 0.13002 0.04746 5.08E-09 2.84E-28 0.2466 0.02587 Number of lags used 15 15 15 15 15 15 15 15 15 15 15 Number of observations used for the regression and critical values'calculation 241 241 241 241 241 241 241 241 241 241 241 Critical values 1% -3.4578 -3.4578 -3.4578 -3.4578 -3.4578 -3.4578 -3.4578 -3.4578 -3.4578 -3.4578 -3.4578 5% -2.8736 -2.8736 -2.8736 -2.8736 -2.8736 -2.8736 -2.8736 -2.8736 -2.8736 -2.8736 -2.8736 10% -2.5732 -2.5732 -2.5732 -2.5732 -2.5732 -2.5732 -2.5732 -2.5732 -2.5732 -2.5732 -2.5732 Null hypothesis The process contains a unit root. The process contains a unit root. The process contains a unit root. The process contains a unit root. The process contains a unit root. The process contains a unit root. The process contains a unit root. The process contains a unit root. The process contains a unit root. The process contains a unit root. The process contains a unit root.

Interpreting the stationarity table: As per the above, the null hypothesis of the ADF is that there is a unit root, with the alternative that there is no unit root. If the P-value is above a critical size, then we cannot reject that there is a unit root. If the process has a unit root, then it is a non-stationary time series. If the p-value is close to significant, then the critical values should be used to judge whether to reject the null.

Since ${ADF}_{{Test-statistic}_{RMCPANNL yoy}} \approx -6.025 < {ADF}_{{1\% Critical value}_{RMCPANNL yoy}} \approx -3.471$, we can reject the hypothesis that there is a unit root and presume our series to be stationary. Same conclusions can be taken from the PP and KPSS tests of our RMCPANNL data. This is an interesting finding since our graphical analysis suggested otherwise.

Nonetheless, analysing the table's figures for RMUNPTOTP, RMXRUSD, RMXREUR, and RMGOVBALA shows us that they're non-stationary (for the most part - especially with the ADF test). We ought to difference them once and see if they are stationary then.

### Differencing

As aforementioned, differencing a time series $\{z\}_1^T$ once would render $\{z_2 - z_1, z_3 - z_2, ..., z_T - z_{T-1}\}$. Let's add the 1st difference of our data to our Pandas data-frame df:

    	
df["RMCPANNL", "yoy1d"] = df["RMCPANNL"]["yoy"].diff(1)
df["RMUNPTOTP", "yoy1d"] = df["RMUNPTOTP"]["yoy"].diff(1)
df["RMXRUSD.", "yoy1d"] = df["RMXRUSD."]["yoy"].diff(1)
df["RMXRUSD.", "yoy1d"] = df["RMXRUSD."]["yoy"].diff(1)
df["RMXREUR.", "yoy1d"] = df["RMXREUR."]["yoy"].diff(1)
df["RMGOVBALA", "yoy1d"] = df["RMGOVBALA"]["yoy"].diff(1)
df["CLc1", "yoy1d"] = df["CLc1"]["yoy"].diff(1)
df = df.sort_index(axis = 1)

df.dropna()




 Instrument CLc1 RMCPANNL RMGOVBALA RMUNPTOTP RMXREUR. RMXRUSD. Field Close Price Close Price 30D Moving Average yoy yoy1d yoy yoy1d raw yoy yoy1d raw yoy yoy1d raw yoy yoy1d raw yoy yoy1d 15/02/2001 28.8 29.8113 0.07816 -0.0217 0.4 0.001 -601.2 0.30158 -0.5786 1032.3 -0.1373 -0.0164 2.4729 0.34244 0.01994 2.68 0.43316 0.00924 15/03/2001 26.58 28.976 -0.0423 -0.1205 0.403 0.003 -865.2 0.04746 -0.2541 992.8 -0.1491 -0.0117 2.4849 0.34044 -0.002 2.73 0.42188 -0.0113 15/04/2001 28.2 27.25 -0.0318 0.01048 0.375 -0.028 -1087.5 -0.1019 -0.1494 948.4 -0.1675 -0.0184 2.488 0.32956 -0.0109 2.79 0.40909 -0.0128 15/05/2001 28.98 27.9327 0.0575 0.08932 0.374 -0.001 -1404.5 0.02683 0.12874 890.8 -0.1883 -0.0208 2.491 0.34598 0.01642 2.85 0.39706 -0.012 15/06/2001 28.53 28.6177 -0.0403 -0.0978 0.357 -0.017 -2268.9 0.25973 0.2329 840.3 -0.2125 -0.0242 2.4732 0.23846 -0.1075 2.9 0.38095 -0.0161 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 15/10/2020 40.92 39.4793 -0.2909 -0.0379 0.0224 -0.0021 -76512 1.81579 -0.0034 285.7 0.10437 0.00959 4.8733 0.02514 -0.0004 4.14 -0.0372 0.00465 15/11/2020 40.12 39.5773 -0.2836 0.00733 0.0214 -0.001 -85982 1.47431 -0.3415 290.7 0.12066 0.0163 4.8699 0.02131 -0.0038 4.12 -0.0441 -0.0069 15/12/2020 47.59 43.203 -0.2489 0.03465 0.0206 -0.0008 -105907 1.20275 -0.2716 296.1 0.14812 0.02746 4.8707 0.01955 -0.0018 4 -0.0698 -0.0257 15/01/2021 52.04 48.7013 -0.1892 0.05969 0.0299 0.0093 -112.3 -0.8426 -2.0454 292.2 0.12862 -0.0195 4.8728 0.01973 0.00018 4 -0.0719 -0.0022 15/02/2021 59.73 53.606 -0.0281 0.16113 0.0316 0.0017 -8430.8 -0.199 0.64362 293.5 0.14336 0.01474 4.8741 0.01909 -0.0006 4.03 -0.0799 -0.008

241 rows × 18 columns

    	
Stationarity_Table(
data = [
df[i]["yoy1d"].dropna()
for i in [
"RMCPANNL", "RMUNPTOTP", "RMXRUSD.",
"RMXREUR.", "RMGOVBALA", "CLc1"]],
dataset_names = [
"RMCPANNL yoy1d", "RMUNPTOTP yoy1d",
"RMXRUSD. yoy1d", "RMXREUR. yoy1d",
"RMGOVBALA yoy1d", "CLc1 Close Price yoy1d"],




 RMCPANNL yoy1d RMUNPTOTP yoy1d RMXRUSD. yoy1d RMXREUR. yoy1d RMGOVBALA yoy1d CLc1 Close Price yoy1d ADF Test-statistic -3.9779 -5.1251 -5.8272 -3.691 -8.2178 -8.5853 p-value 0.00153 1.25E-05 4.05E-07 0.00424 6.55E-13 7.53E-14 Number of lags used 14 12 12 15 9 11 Number of observations used for the regression and critical values'calculation 226 228 228 225 231 229 Critical values 1% -3.4596 -3.4594 -3.4594 -3.4598 -3.459 -3.4592 5% -2.8744 -2.8743 -2.8743 -2.8745 -2.8741 -2.8742 10% -2.5736 -2.5736 -2.5736 -2.5737 -2.5735 -2.5735 Null hypothesis The process contains a unit root. The process contains a unit root. The process contains a unit root. The process contains a unit root. The process contains a unit root. The process contains a unit root. PP Test-statistic -13.791 -9.5569 -9.553 -10.36 -57.826 -7.7634 p-value 8.94E-26 2.49E-16 2.54E-16 2.41E-18 0 9.33E-12 Number of lags used 15 15 15 15 15 15 Number of observations used for the regression and critical values'calculation 240 240 240 240 240 240 Critical values 1% -3.4579 -3.4579 -3.4579 -3.4579 -3.4579 -3.4579 5% -2.8737 -2.8737 -2.8737 -2.8737 -2.8737 -2.8737 10% -2.5732 -2.5732 -2.5732 -2.5732 -2.5732 -2.5732 Null hypothesis The process contains a unit root. The process contains a unit root. The process contains a unit root. The process contains a unit root. The process contains a unit root. The process contains a unit root.

Now they're all stationary. We ought to use datasets treated similarly for consistency (e.g.: compare 'RMCPANNL yoy1d' with 'RMUNPTOTP yoy1d' rather than with 'RMUNPTOTP yoy'), so we will use the ones shown in the last table (i.e.: 'yoy1d' datasets).

N.B.: The RMGOVBALA yoy1d dataset is causing issues with the KPSS test, but that one is only for weak stationarity, and is not of as much interest to us in our study - I just wanted to include it to portray the extent of the Stationarity_Table function. The RMGOVBALA yoy1d dataset is causing issues with the KPSS test because of its large variance, as can be seen in this graph:

    	
df["RMGOVBALA"]["yoy1d"].iplot(
theme = "solar", xTitle = "date", yTitle = "yoy1d",
title = "RMGOVBALA's year on year % increase's 1st difference")





## Modelling

Now we will look at which ARMA-family model to implement. We'll look at the best model orders to choose before (ex-ante) and after (ex-post) going through modelling procedures.

### Ex-Ante Parameter Identification: Autocorrelation and Partial Autocorrelation Functions

To figure out our parameter identification (i.e.: what lags to use in our model), we may want to plot ACF and PACF diagrams:

    	
import matplotlib.pyplot as plt
import statsmodels.api as sm





N.B.: the semicolon at the end of the code for our plots below is there to avoid duplicates.

    	
sm.graphics.tsa.plot_acf(x = df["RMCPANNL"]["yoy1d"].dropna(), lags = 25);
sm.graphics.tsa.plot_pacf(x = df["RMCPANNL"]["yoy1d"].dropna(), lags = 25);





From these graphs (also known as (a.k.a.): correlograms), we could identify an AR(1,4,5) (non-consecutive) model; but the MA part is harder to decipher. Let's have a look at how those models would fair:

• 1st: a simple ARMA(5,2) model (i.e.: a SARIMA(5,0,2) model), then
• 2nd: a SARIMA(5,0,2)(1, 0, 0)12 model:

#### ARMA(5,2)

    	
from statsmodels.tsa.arima.model import ARIMA



    	
arima502c = ARIMA(endog = np.array(df["RMCPANNL"]["yoy1d"].dropna()),
order = (5, 0, 2), # We don't need to difference our data again, so 'd' is set to 0.
trend = "c")
arima502c_fit = arima502c.fit()




    	
arima502c_fit.summary()



 SARIMAX Results Dep. Variable: y No. Observations: 241 Model: ARIMA(5, 0, 2) Log Likelihood 847.684 Date: Mon, 19 Apr 2021 AIC -1677.4 Time: 14:33:02 BIC -1646 Sample: 0 HQIC -1664.7 -241 Covariance Type: opg coef std err z P>|z| [0.025 0.975] const -0.0015 0.001 -1.913 0.056 -0.003 3.66E-05 ar.L1 0.0858 17.28 0.005 0.996 -33.783 33.955 ar.L2 0.0603 8.153 0.007 0.994 -15.919 16.039 ar.L3 0.1161 1.11 0.105 0.917 -2.059 2.292 ar.L4 -0.0028 1.219 -0.002 0.998 -2.393 2.387 ar.L5 -0.0027 0.782 -0.003 0.997 -1.535 1.53 ma.L1 0.0995 17.281 0.006 0.995 -33.77 33.969 ma.L2 0.0615 5.295 0.012 0.991 -10.317 10.44 sigma2 5.15E-05 3.30E-06 15.595 0 4.51E-05 5.80E-05 Ljung-Box (L1) (Q): 0 Jarque-Bera (JB): 274.28 Prob(Q): 0.96 Prob(JB): 0 Heteroskedasticity (H): 0.53 Skew: -0.6 Prob(H) (two-sided): 0.01 Kurtosis: 8.09

Warnings:

 Covariance matrix calculated using the outer product of gradients (complex-step).

#### SARIMA(5,0,2)(1, 0, 0)12

    	
s1arima502c = ARIMA(endog = np.array(df["RMCPANNL"]["yoy1d"].dropna()),
order = (5, 0, 2),
trend = "c",
seasonal_order = (1, 0, 0, 12))
s1arima502c_fit = s1arima502c.fit()




    	
s1arima502c_fit.summary()



 SARIMAX Results Dep. Variable: y No. Observations: 241 Model: ARIMA(5, 0, 2)x(1, 0, [], 12) Log Likelihood 868.387 Date: Mon, 19 Apr 2021 AIC -1716.8 Time: 14:33:04 BIC -1681.9 Sample: 0 HQIC -1702.7 -241 Covariance Type: opg coef std err z P>|z| [0.025 0.975] const -0.0015 0.001 -1.856 0.063 -0.003 8.27E-05 ar.L1 0.0761 2.209 0.034 0.973 -4.253 4.405 ar.L2 0.0641 1.92 0.033 0.973 -3.698 3.827 ar.L3 0.2522 0.31 0.815 0.415 -0.355 0.859 ar.L4 0.052 0.474 0.11 0.913 -0.877 0.981 ar.L5 0.0496 0.366 0.136 0.892 -0.667 0.766 ma.L1 0.1203 2.203 0.055 0.956 -4.198 4.439 ma.L2 0.0607 1.679 0.036 0.971 -3.231 3.352 ar.S.L12 -0.4396 0.042 -10.401 0 -0.522 -0.357 sigma2 4.29E-05 2.85E-06 15.025 0 3.73E-05 4.85E-05 Ljung-Box (L1) (Q): 0.01 Jarque-Bera (JB): 142.78 Prob(Q): 0.92 Prob(JB): 0 Heteroskedasticity (H): 0.53 Skew: -0.56 Prob(H) (two-sided): 0 Kurtosis: 6.6

Warnings:

 Covariance matrix calculated using the outer product of gradients (complex-step).

### Ex-Post Parameter Identification: Information Criteria and Mean of Absolute & squared Errors

#### Graphical Analysis

After having fit a model, one may try to look at the residuals to compare models:

    	
models_resid = pd.DataFrame(
data = {'arima502c residuals' : arima502c_fit.resid,
's1arima502c residuals' : s1arima502c_fit.resid})
models_resid.iplot(
theme = "solar", xTitle = "date",
yTitle = "residuals (%)",
title = "arima502c and s1arima502c models' residuals")





Looking at such graphical information is not simple. (i) Not only is it arbitrary to chose a model over another from graphical information (ii) it quickly becomes difficult when comparing many models together. That's why we may compare them instead with performance measurements such as Information Criteria:

#### Information Criteria

    	
def IC_Table(models_fit, models_name):
return pd.DataFrame(
data = [[i.aic, i.aicc, i.bic, i.hqic]
for i in models_fit],
columns = ['AIC', 'AICC', 'BIC', 'HQIC'],
index = [str(i) for i in models_name])

# ' ic ' standing for 'information criteria':
ic0 = IC_Table(models_fit = [arima502c_fit, s1arima502c_fit],
models_name = ['arima502c', 's1arima502c'])

ic0




 AIC AICC BIC HQIC arima502c -1677.4 -1676.6 -1646 -1664.7 s1arima502c -1716.8 -1715.8 -1681.9 -1702.7

We can similarly go through each permutation within limits of our SARIMA model's parameters, and go beyond just recording IC information - including mean squared error (MSE) and sum of squared error (SSE, a.k.a.: RSS) information too.

#### Mean Squared Errors and Sum of squared Errors

We can define the below function to return a table of

• the Akaike information criterion (AIC),
• the AICc (AIC with a correction for small sample sizes),
• the Bayesian information criterion (BIC) (aka Schwarz criterion (also SBC, SBIC)) and
• the Hannan-Quinn information criterion (HQC) as well as
• the Mean Squared Errors (MSE) and
• the Sum of Squared Errors (SSE) (aka: residual sum of squares (RSS))

of any specified ARIMAX model, and then choose the one providing the lowest (of any one chosen error measure above) as the optimal model of choice:

    	
import itertools # We need this library to iteratethrough each permutation of exogenous variable possible




N.B.: Note that we need to lag our exogenous variables since we are putting ourselves in the shoes of a programmer who is using past values to predict tomorrow's.

    	
def Reg_Perm_Table(endog = np.array(df.dropna()["RMCPANNL"]["yoy1d"][1:].values.tolist()), # the ' [1:] ' is here to get same number of figures as in our exogenous variables.
exogenous = [df.dropna()[i]["yoy1d"].shift(1)[1:].values.tolist() # ' .shift(1) ' is here to get lagged values; the ' [1:] ' is there to remove the 0th value which is nan due to the lagging.
for i in ["CLc1", "RMGOVBALA", "RMUNPTOTP", "RMXREUR.", "RMXRUSD."]],
exogenous_names = ["CLc1", "RMGOVBALA", "RMUNPTOTP", "RMXREUR.", "RMXRUSD."],
d = , D = , s = ,
p = range(5), q = range(5),
P = range(5), Q = range(5),
trend = "c",
print_progress = False,
try_loop = False,
save_model_fit = False,
lowest_no = 3):
""" Reg_Perm_Table Version 1.0:
This Python function ' Reg_Perm_Table ' goes through each permutation of possible ARIMA orders
(chosen in parameters d, D, s, p, P, q, and Q) and exogenous variables and returns two Pandas
data-frames (table and lowest) to help decide (i) the optimal exogenous variables to use,
(ii) the optimal order of our ARIMA model to use.
' table ' consists of the Akaike information criterion (AIC),
the AICc (AIC with a correction for small sample sizes),
the Bayesian information criterion (BIC) (aka Schwarz criterion (also SBC, SBIC)) and
the Hannan-Quinn information criterion (HQC)
as well as the Mean Squared Errors (MSE) and
the Sum of Squared Errors (SSE) (aka: residual sum of squares (RSS)) of any specified ARIMAX model.
' lowest ' consists of the name of all models with lowest values of all the metrics in ' table '.

For more information on ARIMA orders (both (i) orders p, d, and q and (ii) seasonal orders P, D,
Q and s), look at the statsmodels website:
https://www.statsmodels.org/stable/generated/statsmodels.tsa.arima.model.ARIMA.html

Parameters:
----------

endog: Numpy array
Numpy array of a list of values used as endogenous variables in our ARIMAX model.
Defaulted to: np.array(df.dropna()["RMCPANNL"]["yoy1d"][1:].values.tolist()) # The ' [1:] ' is here to get same number of figures as in our exogenous variables.

exogenous: list
List of lists of each of the exogenous variables wished. can be set to ' None ' BUT
IT CANNOT CONSIST OF ONLY ONE ELEMENT.
Defaulted to: exogenous = [df.dropna()[i]["yoy1d"].shift(1)[1:].values.tolist() for i in ["CLc1", "RMGOVBALA", "RMUNPTOTP", "RMXREUR.", "RMXRUSD."]] # ' .shift(1) ' is here to get lagged values; the ' [1:] ' is there to remove the 0th value which is nan due to the lagging.

exogenous_names: list
list of strings of the names of the exogenous variables wished to be seen in
the outputed data-frames. They have to be in the same order as in the ' exogenous ' variables.
Defaulted to: exogenous_names = ["CLc1", "RMGOVBALA", "RMUNPTOTP", "RMXREUR.", "RMXRUSD."]

p: list
list of the ARIMA order 'p' that the user wishes to iteratethrough. This can be a range.
Defaulted to: p = 

d: list
list of the ARIMA order 'd' that the user wishes to iteratethrough. This can be a range.
Defaulted to: d = 

q: list
list of the ARIMA order 'q' that the user wishes to iteratethrough. This can be a range.
Defaulted to: q = 

P: list
list of the ARIMA seasonal order 'P' that the user wishes to iteratethrough. This can be a range.
Defaulted to: P = 

D: list
list of the ARIMA seasonal order 'D' that the user wishes to iteratethrough. This can be a range.
Defaulted to: D = 

Q: list
list of the ARIMA seasonal order 'Q' that the user wishes to iteratethrough. This can be a range.
Defaulted to: Q = 

s: list
list of the ARIMA seasonal order 's' that the user wishes to iteratethrough. This can be a range.
Defaulted to: s = 

trend: str
str{‘n’,’c’,’t’,’ct’} or 'iterate-able', optional as per statsmodels.tsa.arima.model.ARIMA.
Defaulted to: trend = "c"

print_progress: Boolean
If True, then a summary of each fitted model will appear in order of execution.
Defaulted to: print_progress = False

try_loop: Boolean
If true, a try and except (then pass) loop is used over the model specification and
fitting. If that part of the code fails, it will just ignore the model that caused an error from the output.
Defaulted to: try_loop = False

save_model_fit: Boolean
ARIMA model fit's Python objects take up a lot of memory, so they are not saved. If
this argument is st to True, our function then returns a new list full of each fit objects in the order they
were executed called ' model_fit_list '; the function returns: table, model_fit_list, lowest
Defaulted to: save_model_fit = False

lowest_no: int
the returned table ' lowest ' provides the name of each model with lowest AIC, AICC, BIC,
HQIC, MSE and SSE up to the 3rd lowest value by default, but you can change that here.
Defaulted to: lowest_no = 3

Dependencies:
----------

itertools
numpy as np: version 1.19.5
pandas as pd: version 1.2.3
from statsmodels.tsa.arima.model import ARIMA: version 0.11.1

Examples:
--------

>>> import DatastreamDSWS as dsws # Using LSEG Refinitiv's Datastream Python API named DataStreem Web Service (DSWS).
>>>
>>> df_1 = ds.get_data(tickers = "RMCPANNL,RMGOVBALA,RMXRUSD.", # be carefulnot to put spaces in between elements here, or else these spaces will be included in column names.
>>>                    fields = "X", start = '2000-01-01', freq = 'M')
>>>
>>> Reg_Perm_Table(endog = np.array(df_1.dropna()["RMCPANNL"]["X"][1:].values.tolist()), # the ' [1:] ' is here to get same number of figures as in our exogenous variables.
>>>                exogenous = [df_1.dropna()[i]["X"].shift(1)[1:].values.tolist() # ' .shift(1) ' is here to get lagged values; the ' [1:] ' is there to remove the 0th value which is nan due to the lagging.
>>>                             for i in ["RMXRUSD.", "RMGOVBALA"]],
>>>                exogenous_names = ["RMXRUSD.", "RMGOVBALA"])
"""

model_table_data, model_fit_list, models_name_list = [], [], [] # Lists to be appended.

for _p in p: # Going through each permutation of possible arguments.
for _d in d:
for _q in q:
for _P in P:
for _D in D:
for _Q in Q:
for _s in s:

def Modelling(exog1, endog1 = endog,
order1 = (_p, _d, _q), trend1 = trend,
seasonal_order1 = (_P, _D, _Q, _s)):
# global?
return ARIMA(endog = endog1, exog = exog1,
order = order1, trend = trend1,
seasonal_order = (_P, _D, _Q, _s)).fit()
if exogenous != None:
for j in range(1,len(exogenous_names)+1):
for i,i_name in zip(itertools.combinations(exogenous, r = j),
itertools.combinations(exogenous_names, r = j)):

if try_loop == True:
try: # statsmodels' ARIMA function does not work 100% of the time.
model = Modelling(exog1 = np.array(i).T)
model_table_data.append(
[model.aic, model.aicc, model.bic,
model.hqic, model.mse, model.sse])
if save_model_fit == True: model_fit_list.append(model)
models_name_list.append(
f"sarima{_p},{_d},{_q},{_P},{_D},{_Q},{_s},{trend},exo={i_name}")
if print_progress == True: # For debugging.
print(models_name_list[-1])
display(model_fit_list[-1].summary())
except:
pass
else: # i.e.: if try_loop != True
model = Modelling(exog1 = np.array(i).T)
model_table_data.append(
[model.aic, model.aicc, model.bic,
model.hqic, model.mse, model.sse])
if save_model_fit == True: model_fit_list.append(model)
models_name_list.append(
f"sarima{_p},{_d},{_q},{_P},{_D},{_Q},{_s},{trend},exo={i_name}")
if print_progress == True: # For debugging.
print(models_name_list[-1])
display(model_fit_list[-1].summary())

elif exogenous == None:
if try_loop == True:
try: # statsmodels' ARIMA function does not work 100% of the time.
model = Modelling(exog1 = None)
model_table_data.append(
[model.aic, model.aicc, model.bic,
model.hqic, model.mse, model.sse])
if save_model_fit == True: model_fit_list.append(model)
models_name_list.append(
f"sarima{_p},{_d},{_q},{_P},{_D},{_Q},{_s},{trend},exo={i_name}")
if print_progress == True: # For debugging.
print(models_name_list[-1])
display(model_fit_list[-1].summary())
except:
pass

else: # i.e.: if try_loop != True
model = Modelling(exog1 = None)
model_table_data.append(
[model.aic, model.aicc, model.bic,
model.hqic, model.mse, model.sse])
if save_model_fit == True: model_fit_list.append(model)
models_name_list.append(
f"sarima{_p},{_d},{_q},{_P},{_D},{_Q},{_s},{trend},exo={i_name}")
if print_progress == True: # For debugging.
print(models_name_list[-1])
display(model_fit_list[-1].summary())

table = pd.DataFrame(
data = model_table_data,
columns = ['AIC', 'AICC', 'BIC', 'HQIC', "MSE", "SSE"],
index = [str(i) for i in models_name_list]) # Just like the previously defined 'IC_Table' function.

if lowest_no == None or lowest_no == False:
if save_model_fit == True:
return table, model_fit_list
else:
return table
else:
if lowest_no == 1:
lowest_index = ["1st lowest"]
elif lowest_no == 2:
lowest_index = [f"{i} lowest" for i in ['1st', '2nd']]
elif lowest_no == 3:
lowest_index = [f"{i} lowest" for i in ['1st', '2nd', '3rd']]
elif lowest_no > 3:
lowest_index = [f"{i} lowest"
for i in list(['1st', '2nd', '3rd'] +
[f"{j}th" for j in range(4, lowest_no + 1)])]
else:
print(f"{lowest_no} should be a positive integer")
lowest = pd.DataFrame({'AIC': table.nsmallest(lowest_no, 'AIC').index,
'AICC': table.nsmallest(lowest_no, 'AICC').index,
'BIC': table.nsmallest(lowest_no, 'BIC').index,
'HQIC': table.nsmallest(lowest_no, 'HQIC').index,
'MSE': table.nsmallest(lowest_no, 'MSE').index,
'SSE': table.nsmallest(lowest_no, 'SSE').index},
index = lowest_index)
if save_model_fit == True:
return table, model_fit_list, lowest
else:
return table, lowest





Now let's put this Python function into practice:

    	
# We get a lot of 'Maximum Likelihood optimization failed to converge.
# Check mle_retvals' error messages, this cell will remove them.
# You don't need to remove them, it's just neater to.
import warnings
from statsmodels.tools.sm_exceptions import ConvergenceWarning
warnings.simplefilter('ignore', ConvergenceWarning)




    	
# This step takes a long while
test1_df, test1_lowest = Reg_Perm_Table(p = )





Let's have a look at that table returned:

 AIC AICC BIC HQIC MSE SSE sarima1,0,0,0,0,0,12,c,exo=('CLc1',) -1067.6 -1067.4 -1055.4 -1062.7 6.8E-05 0.01073 sarima1,0,0,0,0,0,12,c,exo=('RMGOVBALA',) -1067 -1066.7 -1054.7 -1062 6.8E-05 0.01078 sarima1,0,0,0,0,0,12,c,exo=('RMUNPTOTP',) -1067.7 -1067.4 -1055.4 -1062.7 6.7E-05 0.01073 sarima1,0,0,0,0,0,12,c,exo=('RMXREUR.',) -1067 -1066.8 -1054.8 -1062 6.8E-05 0.01078 sarima1,0,0,0,0,0,12,c,exo=('RMXRUSD.',) -1069.2 -1068.9 -1056.9 -1064.2 6.7E-05 0.01063 ... ... ... ... ... ... ... sarima1,0,4,4,0,4,12,c,exo=('CLc1', 'RMGOVBALA', 'RMUNPTOTP', 'RMXRUSD.') -1047.2 -1041.8 -988.92 -1023.6 6.3E-05 0.01005 sarima1,0,4,4,0,4,12,c,exo=('CLc1', 'RMGOVBALA', 'RMXREUR.', 'RMXRUSD.') -1047.9 -1042.4 -989.59 -1024.2 6.3E-05 0.01002 sarima1,0,4,4,0,4,12,c,exo=('CLc1', 'RMUNPTOTP', 'RMXREUR.', 'RMXRUSD.') -1057.4 -1051.9 -999.09 -1033.7 0.00006 0.00951 sarima1,0,4,4,0,4,12,c,exo=('RMGOVBALA', 'RMUNPTOTP', 'RMXREUR.', 'RMXRUSD.') -1043.1 -1037.6 -984.75 -1019.4 6.5E-05 0.01029 sarima1,0,4,4,0,4,12,c,exo=('CLc1', 'RMGOVBALA', 'RMUNPTOTP', 'RMXREUR.', 'RMXRUSD.') -1048.2 -1042.1 -986.82 -1023.3 6.2E-05 0.00994

3875 rows × 6 columns

Let's have a look at that table lowest returned:

    	
test1_lowest



 AIC AICC BIC HQIC MSE SSE 1st lowest sarima1,0,1,0,0,0,12,c,exo=('RMUNPTOTP', 'RMXR... sarima1,0,1,0,0,0,12,c,exo=('RMUNPTOTP', 'RMXR... sarima1,0,0,0,0,0,12,c,exo=('RMXRUSD.',) sarima1,0,1,0,0,0,12,c,exo=('RMUNPTOTP',) sarima1,0,4,4,0,0,12,c,exo=('RMUNPTOTP', 'RMXR... sarima1,0,4,4,0,0,12,c,exo=('RMUNPTOTP', 'RMXR... 2nd lowest sarima1,0,4,0,0,2,12,c,exo=('RMUNPTOTP', 'RMXR... sarima1,0,1,0,0,0,12,c,exo=('RMUNPTOTP',) sarima1,0,1,0,0,0,12,c,exo=('RMUNPTOTP',) sarima1,0,1,0,0,0,12,c,exo=('RMXRUSD.',) sarima1,0,4,4,0,0,12,c,exo=('CLc1', 'RMUNPTOTP... sarima1,0,4,4,0,0,12,c,exo=('CLc1', 'RMUNPTOTP... 3rd lowest sarima1,0,4,0,0,0,12,c,exo=('RMUNPTOTP', 'RMXR... sarima1,0,4,0,0,0,12,c,exo=('RMUNPTOTP',) sarima1,0,1,0,0,0,12,c,exo=('RMXRUSD.',) sarima1,0,1,0,0,0,12,c,exo=('CLc1',) sarima1,0,4,4,0,3,12,c,exo=('CLc1', 'RMUNPTOTP... sarima1,0,4,4,0,3,12,c,exo=('CLc1', 'RMUNPTOTP...
##### Pickle

It is quite time consuming to request DSWS for all these codes. Furthermore, the table test1_df is rather large, and we only accounted for p, q, P, Q values up to (but not including) 3. What we could do is save our table in a file outside our code, then delete our table in our code, continue our analysis with test2_df and compare the two afterwards. Let's save our progress this far using Pickle. Since this is a step that can overwrite existing files, I commented them out. Feel free to uncomment them back in if you would like to save your data:

    	
# need to ' pip install pickle-mixin '
import pickle

# To save data out:
pickle_out = open("Regression_Permutation_Table1.pickle","wb") # This creates the '.pickle' file where our data of choice will be saved. ' wb ' stand for 'write bytes'.
pickle.dump((test1_df, test1_lowest), # ' (test1_df, test1_lowest) ' are the two data-frames we choose to save.
pickle_out) # ' pickle_out ' specifies the '.pickle' file in which we want to write (actually: overwrite - everything previously in that file will be overwritten)
pickle_out.close() # We need to close this '.pickle' file; leaving it open could corrupt it.

# # To load data in:
# pickle_in = open("Regression_Permutation_Table2.pickle","rb") # ' rb ' stand for 'read bytes'.
# pickle_in.close() # We ought to close the file we opened to allow any other programs access if they need it.





As aforementioned, let's delete our tables in our code for now - after having saved them outside our code:

    	
del test1_df
del test1_lowest





Now let's continue with our examination:

    	
test2_df, test2_lowest = Reg_Perm_Table(p = )
pickle_out = open("Regression_Permutation_Table2.pickle","wb")
pickle.dump((test2_df, test2_lowest), pickle_out)
pickle_out.close()
del test2_df

test3_df, test3_lowest = Reg_Perm_Table(p = )
pickle_out = open("Regression_Permutation_Table3.pickle","wb")
pickle.dump((test3_df, test3_lowest), pickle_out)
pickle_out.close()
del test3_df

test4_df, test4_lowest = Reg_Perm_Table(p = , try_loop = True) # We tripped on some problems with this one, so used the ' try_loop '.
pickle_out = open("Regression_Permutation_Table4.pickle","wb")
pickle.dump((test4_df, test4_lowest), pickle_out)
pickle_out.close()
del test4_df

test5_df, test5_lowest = Reg_Perm_Table(p = )
pickle_out = open("Regression_Permutation_Table5.pickle","wb")
pickle.dump((test5_df, test5_lowest), pickle_out)
pickle_out.close()
del test5_df




##### pmdarima

If we knew which exogenous variables to use, we could have used pmdarima to find the best model specification too:

    	
import pmdarima # Install via ' pip install pmdarima '



    	
pmdarima.arima.auto_arima(
y = np.array(df.dropna()["RMCPANNL"]["yoy1d"][1:]),
X = None,
d = 1, # The order of first-differencing. If None (by default), the value will automatically be selected based on the results of the test (i.e., either the Kwiatkowski–Phillips–Schmidt–Shin, Augmented Dickey-Fuller or the Phillips–Perron test will be conducted to find the most probable value). Must be a positive integer or None. Note that if d is None, the runtime could be significantly longer.
start_p = 0, max_p = 7,
start_q = 0, max_q = 7,
D = None, max_D = 1,
start_P = 0, max_P = 2,
start_Q = 0, max_Q = 2,
m = 12, # The period for seasonal differencing, m refers to the number of periods in each season. For example, m is 4 for quarterly data, 12 for monthly data, or 1 for annual (non-seasonal) data. Default is 1. Note that if m == 1 (i.e., is non-seasonal), seasonal will be set to False. For more information on setting this parameter, see Setting m.
seasonal = True,
stationary = False,
information_criterion = 'aic',
test = 'kpss', # Type of unit root test to use in order to detect stationarity if stationary is False and d is None. Default is ‘kpss’ (Kwiatkowski–Phillips–Schmidt–Shin).
seasonal_test = 'ocsb', # This determines which seasonal unit root test is used if seasonal is True and D is None. Default is ‘OCSB’.
stepwise = True, # Whether to use the stepwise algorithm outlined in Hyndman and Khandakar (2008) to identify the optimal model parameters. The stepwise algorithm can be significantly faster than fitting all (or a random subset of) hyper-parameter combinations and is less likely to over-fit the model.
method = 'lbfgs', # The method determines which solver from scipy.optimize is used
maxiter = 40)





ARIMA(maxiter=40, order=(0, 1, 1), scoring_args={},
seasonal_order=(0, 0, 1, 12), suppress_warnings=True)

Unfortunately, we don't know what permutation of exogenous variables is best, and Information-Criteria-optimisation methods are thus not sufficient.

### Optimal model

Finally, let's compare our two tables to check the lowest MSE and SSE:

    	
for j in range(1,6):
exec(f"pickle_in = open('Regression_Permutation_Table{j}.pickle','rb')") # ' rb ' stand for 'read bytes'.
exec(f"pickle_in.close()") # We ought to close the file we opened to allow any other programs access if they need it.

for j in range(1,6):
# ' list(set( ' is there to return unique elements in case the same model has both the lowest MSE and SSE
#   if you think about it, they should always be the same.
exec(f"display(test{j}_df.loc[list(set([i for i in test{j}_lowest[['MSE', 'SSE']].loc['1st lowest']]))])")




 AIC AICC BIC HQIC MSE SSE sarima1,0,4,4,0,0,12,c,exo=('RMUNPTOTP', 'RMXRUSD.') -1070.6 -1068.1 -1030.7 -1054.4 5.9E-05 0.00946 AIC AICC BIC HQIC MSE SSE sarima2,0,4,1,0,4,12,c,exo=('CLc1', 'RMUNPTOTP', 'RMXRUSD.') -1064.8 -1060.9 -1015.7 -1044.8 5.9E-05 0.00944 AIC AICC BIC HQIC MSE SSE sarima3,0,4,4,0,0,12,c,exo=('CLc1', 'RMUNPTOTP', 'RMXRUSD.') -1073.9 -1070.1 -1024.8 -1054 5.6E-05 0.00896 AIC AICC BIC HQIC MSE SSE sarima4,0,4,4,0,0,12,c,exo=('CLc1', 'RMUNPTOTP', 'RMXRUSD.') -1070 -1065.7 -1017.8 -1048.8 5.7E-05 0.00904 AIC AICC BIC HQIC MSE SSE sarima5,0,4,4,0,0,12,c,exo=('RMUNPTOTP', 'RMXRUSD.') -1072.2 -1067.9 -1020.1 -1051.1 5.6E-05 0.00892

Thus, the best model out of our choices is the 'sarima5,0,4,4,0,0,12,c,exo=('RMUNPTOTP', 'RMXRUSD.')' model (best as in SSE and MSE reducing):

    	
s4arima504c = ARIMA(endog = np.array(df.dropna()["RMCPANNL"]["yoy1d"][1:].values.tolist()),
exog = np.array([df.dropna()[i]["yoy1d"].shift(1)[1:].values.tolist()
for i in ["RMUNPTOTP", "RMXRUSD."]]).T,
order = (5, 0, 4),
trend = "c",
seasonal_order = (4, 0, 0, 12))
s4arima504c_fit = s4arima504c.fit()
s4arima504c_fit.summary()




 SARIMAX Results Dep. Variable: y No. Observations: 240 Model: ARIMA(5, 0, 4)x(4, 0, [], 12) Log Likelihood 892.165 Date: Mon, 19 Apr 2021 AIC -1750.3 Time: 14:33:44 BIC -1691.2 Sample: 0 HQIC -1726.5 -240 Covariance Type: opg coef std err z P>|z| [0.025 0.975] const -0.0013 0.001 -1.947 0.052 -0.003 8.87E-06 x1 -0.0066 0.014 -0.484 0.629 -0.033 0.02 x2 0.0167 0.014 1.188 0.235 -0.011 0.044 ar.L1 0.4023 0.445 0.905 0.366 -0.469 1.274 ar.L2 0.264 0.596 0.443 0.658 -0.904 1.432 ar.L3 0.1468 0.31 0.473 0.636 -0.461 0.755 ar.L4 -0.1802 0.435 -0.415 0.678 -1.032 0.672 ar.L5 0.2342 0.103 2.266 0.023 0.032 0.437 ma.L1 -0.2364 0.446 -0.53 0.596 -1.11 0.637 ma.L2 -0.2758 0.555 -0.497 0.619 -1.363 0.812 ma.L3 -0.0235 0.327 -0.072 0.943 -0.664 0.617 ma.L4 0.1263 0.429 0.294 0.768 -0.714 0.967 ar.S.L12 -0.8138 0.08 -10.118 0 -0.971 -0.656 ar.S.L24 -0.6277 0.101 -6.204 0 -0.826 -0.429 ar.S.L36 -0.3844 0.103 -3.746 0 -0.586 -0.183 ar.S.L48 -0.1285 0.085 -1.517 0.129 -0.295 0.038 sigma2 3.25E-05 2.55E-06 12.775 0 2.75E-05 3.75E-05 Ljung-Box (L1) (Q): 0.07 Jarque-Bera (JB): 166.83 Prob(Q): 0.8 Prob(JB): 0 Heteroskedasticity (H): 0.54 Skew: -0.63 Prob(H) (two-sided): 0.01 Kurtosis: 6.88

Warnings:
 Covariance matrix calculated using the outer product of gradients (complex-step).

Looking at P-Values, we can see that the only coefficients with statistical significance are ar.L5 and ar.S.L12 to ar.S.L36.

#### Fitted values of our 'Optimal Model'

    	
s4arima504c_fit_plot_dic = {
'residuals' : s4arima504c_fit.resid,
'fitted values' : s4arima504c_fit.fittedvalues,
'endogenous variable (RMCPANNL)' : np.array(df.dropna()["RMCPANNL"]["yoy1d"][1:].values.tolist())}

exogenous_variables = ["RMUNPTOTP", "RMXRUSD."]
for j,k in zip([df.dropna()[i]["yoy1d"].shift(1)[1:].values.tolist()
for i in exogenous_variables],
exogenous_variables):
s4arima504c_fit_plot_dic[k] = np.array(j).T

s4arima504c_fit_plot_df = pd.DataFrame(data = s4arima504c_fit_plot_dic,
index = df.dropna()["RMCPANNL"]["yoy1d"][1:].index)

s4arima504c_fit_plot_df.iplot(theme = "solar",
title = "s4arima504c model",
xTitle = "date")





Now we can simply apply our model but train it on a smaller dataset (say 80% of our data, i.e.: approx. 200 first observations) to simulate an out-of-sample test:

    	
s4arima504c_osaoosf = []

for t in range(200, len(df.dropna()["RMCPANNL"]["yoy1d"][1:])):
if t%10 == 0 : print(str(t)) # This line is to see the progression of our code.
s4arima504c = ARIMA(
endog = np.array(df.dropna()["RMCPANNL"]["yoy1d"][1:].values.tolist())[0:t-1],
exog = np.array([df.dropna()[i]["yoy1d"].shift(1)[1:].values.tolist()[0:t-1]
for i in ["RMUNPTOTP", "RMXRUSD."]]).T,
order = (5, 0, 4),
trend = "c",
seasonal_order = (4, 0, 0, 12))

s4arima504c_isf = s4arima504c.fit() # 'isf' for in-sample-fit

s4arima504c_osaoosf.append( # 'osaoosf' as in One-Step-Ahead Out-of-Sample Forecasts
s4arima504c_isf.forecast(
exog = np.array([df.dropna()[i]["yoy1d"].shift(1)[1:].values.tolist()[t]
for i in ["RMUNPTOTP", "RMXRUSD."]]).T))





200
210
220
230

Let's save our data in s4arima504c_osaoosf:

    	
# saving data into a pickle file::
pickle_out = open("s4arima504c_osaoosf.pickle","wb")
pickle.dump(s4arima504c_osaoosf, pickle_out)
pickle_out.close()

# pickle_in = open('s4arima504c_osaoosf.pickle','rb') # ' rb ' stand for 'read bytes'.
# pickle_in.close() # We ought to close the file we opened to allow any other programs access if they need it.




    	
RMCPANNL_array_wna = np.append( # 'wna' as in 'with na' value added at the end.
np.array(df.dropna()["RMCPANNL"]["yoy1d"][1:].values.tolist())[200 + 1:],
[np.nan])




    	
from datetime import datetime



    	
# add a month to the index:
index_in_qu = df.dropna()["RMCPANNL"]["yoy1d"][1:][200:].index
index = [i for i in index_in_qu][1:] + [
str(datetime.strptime(str(pd.to_datetime(
[str(i) for i in  index_in_qu])[-1].replace(
month = pd.to_datetime(str(df.index[-1])).month+1)),
'%Y-%m-%d %H:%M:%S'))[:10]]




    	
s4arima504c_osaoosf_plot_dic = {
'Recursive One-Step-Ahead Out-of-Sample Forecasts' : np.array(s4arima504c_osaoosf),
'endogenous variable (RMCPANNL)' : RMCPANNL_array_wna}

exogenous_variables = ["RMUNPTOTP", "RMXRUSD."]
for j,k in zip([df.dropna()[i]["yoy1d"].shift(1)[1:].values.tolist()[200:]
for i in exogenous_variables],
exogenous_variables):
s4arima504c_osaoosf_plot_dic[k] = np.array(j).T

s4arima504c_osaoosf_plot_df = pd.DataFrame(
data = s4arima504c_osaoosf_plot_dic,
index = index)




    	
s4arima504c_osaoosf_plot_df.iplot(
theme = "solar", xTitle = "date",
title = "s4arima504c Recursive One-Step-Ahead Out-of-Sample Forecasts")




    	
df2 = df.copy()



    	
s4arima504c_osaoosf_plot_df.tail()



 Recursive One-Step-Ahead Out-of-Sample Forecasts endogenous variable (RMCPANNL) RMUNPTOTP RMXRUSD. 15/11/2020 0.00258 -0.001 0.01044 -0.0042 15/12/2020 0.00089 -0.0008 0.00959 0.00465 15/01/2021 -0.0003 0.0093 0.0163 -0.0069 15/02/2021 -0.0039 0.0017 0.02746 -0.0257 15/03/2021 -0.0025 NaN -0.0195 -0.0022
    	
df2["RMCPANNL", "s4arima504c Recursive OSAOOSF"] = s4arima504c_osaoosf_plot_df["Recursive One-Step-Ahead Out-of-Sample Forecasts"]

df2 = df2.sort_index(axis=1) # This rearranges the columns and merges Fields under the same Instruments.

df2




 Instrument CLc1 RMCPANNL RMGOVBALA RMUNPTOTP RMXREUR. RMXRUSD. Field Close Price Close Price 30D Moving Average yoy yoy1d s4arima504c Recursive OSAOOSF yoy yoy1d raw yoy yoy1d raw yoy yoy1d raw yoy yoy1d raw yoy yoy1d 15/01/2001 30.05 28.535 0.09981 NaN NaN 0.399 NaN -306.1 0.88022 NaN 1032.9 -0.1209 NaN 2.4646 0.32249 NaN 2.62 0.42391 NaN 15/02/2001 28.8 29.8113 0.07816 -0.0217 NaN 0.4 0.001 -601.2 0.30158 -0.5786 1032.3 -0.1373 -0.0164 2.4729 0.34244 0.01994 2.68 0.43316 0.00924 15/03/2001 26.58 28.976 -0.0423 -0.1205 NaN 0.403 0.003 -865.2 0.04746 -0.2541 992.8 -0.1491 -0.0117 2.4849 0.34044 -0.002 2.73 0.42188 -0.0113 15/04/2001 28.2 27.25 -0.0318 0.01048 NaN 0.375 -0.028 -1087.5 -0.1019 -0.1494 948.4 -0.1675 -0.0184 2.488 0.32956 -0.0109 2.79 0.40909 -0.0128 15/05/2001 28.98 27.9327 0.0575 0.08932 NaN 0.374 -0.001 -1404.5 0.02683 0.12874 890.8 -0.1883 -0.0208 2.491 0.34598 0.01642 2.85 0.39706 -0.012 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 15/10/2020 40.92 39.4793 -0.2909 -0.0379 -0.0019 0.0224 -0.0021 -76512 1.81579 -0.0034 285.7 0.10437 0.00959 4.8733 0.02514 -0.0004 4.14 -0.0372 0.00465 15/11/2020 40.12 39.5773 -0.2836 0.00733 0.00258 0.0214 -0.001 -85982 1.47431 -0.3415 290.7 0.12066 0.0163 4.8699 0.02131 -0.0038 4.12 -0.0441 -0.0069 15/12/2020 47.59 43.203 -0.2489 0.03465 0.00089 0.0206 -0.0008 -105907 1.20275 -0.2716 296.1 0.14812 0.02746 4.8707 0.01955 -0.0018 4 -0.0698 -0.0257 15/01/2021 52.04 48.7013 -0.1892 0.05969 -0.0003 0.0299 0.0093 -112.3 -0.8426 -2.0454 292.2 0.12862 -0.0195 4.8728 0.01973 0.00018 4 -0.0719 -0.0022 15/02/2021 59.73 53.606 -0.0281 0.16113 -0.0039 0.0316 0.0017 -8430.8 -0.199 0.64362 293.5 0.14336 0.01474 4.8741 0.01909 -0.0006 4.03 -0.0799 -0.008

242 rows × 19 columns

As you can see, we are missing the last row of data in df2. We can remedy this by appending it with the last row of s4arima504c_osaoosf_plot_df:

    	
df3 = df2.copy() # just to delimitate from before and after this step.

index_of_interest = s4arima504c_osaoosf_plot_df.tail(1).index
df3 = df3.append(pd.Series(name = index_of_interest))

df3['RMCPANNL', 's4arima504c Recursive OSAOOSF'].loc[index_of_interest] = s4arima504c_osaoosf_plot_df["Recursive One-Step-Ahead Out-of-Sample Forecasts"][-1]

df3




 Instrument CLc1 RMCPANNL RMGOVBALA RMUNPTOTP RMXREUR. RMXRUSD. Field Close Price Close Price 30D Moving Average yoy yoy1d s4arima504c Recursive OSAOOSF yoy yoy1d raw yoy yoy1d raw yoy yoy1d raw yoy yoy1d raw yoy yoy1d 15/01/2001 30.05 28.535 0.09981 NaN NaN 0.399 NaN -306.1 0.88022 NaN 1032.9 -0.1209 NaN 2.4646 0.32249 NaN 2.62 0.42391 NaN 15/02/2001 28.8 29.8113 0.07816 -0.0217 NaN 0.4 0.001 -601.2 0.30158 -0.5786 1032.3 -0.1373 -0.0164 2.4729 0.34244 0.01994 2.68 0.43316 0.00924 15/03/2001 26.58 28.976 -0.0423 -0.1205 NaN 0.403 0.003 -865.2 0.04746 -0.2541 992.8 -0.1491 -0.0117 2.4849 0.34044 -0.002 2.73 0.42188 -0.0113 15/04/2001 28.2 27.25 -0.0318 0.01048 NaN 0.375 -0.028 -1087.5 -0.1019 -0.1494 948.4 -0.1675 -0.0184 2.488 0.32956 -0.0109 2.79 0.40909 -0.0128 15/05/2001 28.98 27.9327 0.0575 0.08932 NaN 0.374 -0.001 -1404.5 0.02683 0.12874 890.8 -0.1883 -0.0208 2.491 0.34598 0.01642 2.85 0.39706 -0.012 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 15/11/2020 40.12 39.5773 -0.2836 0.00733 0.00258 0.0214 -0.001 -85982 1.47431 -0.3415 290.7 0.12066 0.0163 4.8699 0.02131 -0.0038 4.12 -0.0441 -0.0069 15/12/2020 47.59 43.203 -0.2489 0.03465 0.00089 0.0206 -0.0008 -105907 1.20275 -0.2716 296.1 0.14812 0.02746 4.8707 0.01955 -0.0018 4 -0.0698 -0.0257 15/01/2021 52.04 48.7013 -0.1892 0.05969 -0.0003 0.0299 0.0093 -112.3 -0.8426 -2.0454 292.2 0.12862 -0.0195 4.8728 0.01973 0.00018 4 -0.0719 -0.0022 15/02/2021 59.73 53.606 -0.0281 0.16113 -0.0039 0.0316 0.0017 -8430.8 -0.199 0.64362 293.5 0.14336 0.01474 4.8741 0.01909 -0.0006 4.03 -0.0799 -0.008 15/03/2021 NaN NaN NaN NaN -0.0025 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

243 rows × 19 columns

### Corresponding Month Model

Many in the industry only focus on the same month of previous years in their predicting models (e.g.: only predict this year's March inflation value with previous March values, similarly so with each month of the year). There is some merit to the thought, as other datapoints may sway our data with irrelevant variables. The reason is that demand/supply profiles might be different each month of the day and it would be incorrect to compare months in a continuous series (i.e.: January to February et cetera (etc.)). To do so, we can exploit the SARIMA model and only use Seasonal AutoRegressive variables (in addition to our exogenous variables):

1st: Let's check the best SAMIMA model with ARMA(0,0) imposed:

    	
test0_df, test0_lowest = Reg_Perm_Table(p = )

# To save data out:
pickle_out = open("Regression_Permutation_Table0.pickle","wb")
pickle.dump((test0_df, test0_lowest), pickle_out)
pickle_out.close()

# pickle_in = open("Regression_Permutation_Table0.pickle","rb") # ' rb ' stand for 'read bytes'.
# pickle_in.close() # We ought to close the file we opened to allow any other programs access if they need it.

test0_df[test0_df.index.str.contains("sarima0,0,0")]




 AIC AICC BIC HQIC MSE SSE sarima0,0,0,0,0,0,12,c,exo=('CLc1',) -1656.6 -1656.5 -1646.2 -1652.4 5.6E-05 0.01333 sarima0,0,0,0,0,0,12,c,exo=('RMGOVBALA',) -1656.3 -1656.2 -1645.8 -1652.1 5.6E-05 0.01335 sarima0,0,0,0,0,0,12,c,exo=('RMUNPTOTP',) -1657.4 -1657.2 -1646.9 -1653.1 5.6E-05 0.01329 sarima0,0,0,0,0,0,12,c,exo=('RMXREUR.',) -1665 -1664.9 -1654.6 -1660.8 5.4E-05 0.01287 sarima0,0,0,0,0,0,12,c,exo=('RMXRUSD.',) -1660.1 -1660 -1649.7 -1655.9 5.5E-05 0.01313 ... ... ... ... ... ... ... sarima0,0,0,4,0,4,12,c,exo=('CLc1', 'RMGOVBALA', 'RMUNPTOTP', 'RMXRUSD.') -1625.7 -1623.8 -1577 -1606.1 5.4E-05 0.01283 sarima0,0,0,4,0,4,12,c,exo=('CLc1', 'RMGOVBALA', 'RMXREUR.', 'RMXRUSD.') -1632.4 -1630.5 -1583.7 -1612.7 5.2E-05 0.01249 sarima0,0,0,4,0,4,12,c,exo=('CLc1', 'RMUNPTOTP', 'RMXREUR.', 'RMXRUSD.') -1669.6 -1667.7 -1620.9 -1650 0.00005 0.01196 sarima0,0,0,4,0,4,12,c,exo=('RMGOVBALA', 'RMUNPTOTP', 'RMXREUR.', 'RMXRUSD.') -1636.1 -1634.2 -1587.4 -1616.5 5.2E-05 0.0124 sarima0,0,0,4,0,4,12,c,exo=('CLc1', 'RMGOVBALA', 'RMUNPTOTP', 'RMXREUR.', 'RMXRUSD.') -1633.8 -1631.6 -1581.6 -1612.8 5.2E-05 0.01241

775 rows × 6 columns

    	
test0_df[test0_df.index.str.contains("sarima0,0,0")].nsmallest(1, 'SSE').index




Index(['sarima0,0,0,3,0,0,12,c,exo=('RMUNPTOTP', 'RMXREUR.', 'RMXRUSD.')'], dtype='object')

Note that our Reg_Perm_Table function's default Seasonal parameters (P and Q) range from 0 to 5 and that the optimum (i.e.: lowest SSE producing one) is lower than 5; this would indicate that we may very well have our optimal SAMIMA model with ARMA(0,0) imposed

2nd: Fit the sarima0,0,0,3,0,0,12 model:

    	
exogenous_variables = ['RMUNPTOTP', 'RMXREUR.', 'RMXRUSD.']

s3arima000c = ARIMA(
endog = np.array(df.dropna()["RMCPANNL"]["yoy1d"][1:].values.tolist())[0:200],
exog = np.array([df.dropna()[i]["yoy1d"].shift(1)[1:].values.tolist()[0:200]
for i in exogenous_variables]).T,
order = (0, 0, 0),
trend = "c",
seasonal_order = (3, 0, 0, 12))
s3arima000c_isf = s3arima000c.fit() # 'isf' for in-sample-fit

s3arima000c_isf_plot_dic = {
'residuals' : s3arima000c_isf.resid,
'fitted values' : s3arima000c_isf.fittedvalues,
'endogenous variable (RMCPANNL)' : np.array(
df.dropna()["RMCPANNL"]["yoy1d"][1:].values.tolist())[0:200]}

for j,k in zip([df.dropna()[i]["yoy1d"].shift(1)[1:].values.tolist()[0:200]
for i in exogenous_variables],
exogenous_variables):
s3arima000c_isf_plot_dic[k] = np.array(j).T

s3arima000c_isf_plot_df = pd.DataFrame(data = s3arima000c_isf_plot_dic,
index = df.dropna()["RMCPANNL"]["yoy1d"][1:][0:200].index)

s3arima000c_isf_plot_df.iplot(theme = "solar",
title = "s3arima000c in-sample-forecasts model fit",
xTitle = "date")





    	
s3arima000c_osaoosf = []

for t in range(200, len(df.dropna()["RMCPANNL"]["yoy1d"][1:])):
if t%10 == 0 : print("Our code has created one-step-ahead out-of-sample forecasts from 200 to " str(t)) # This line is to see the progression of our code.
s3arima000c = ARIMA(
endog = np.array(df.dropna()["RMCPANNL"]["yoy1d"][1:].values.tolist())[0:t-1],
exog = np.array([df.dropna()[i]["yoy1d"].shift(1)[1:].values.tolist()[0:t-1]
for i in exogenous_variables]).T,
order = (0, 0, 0),
trend = "c",
seasonal_order = (3, 0, 0, 12))

s3arima000c_isf = s3arima000c.fit() # 'isf' for in-sample-fit

s3arima000c_osaoosf.append( # 'osaoosf' as in One-Step-Ahead Out-of-Sample Forecasts
s3arima000c_isf.forecast(
exog = np.array([df.dropna()[i]["yoy1d"].shift(1)[1:].values.tolist()[t]
for i in exogenous_variables]).T))





200
210
220
230

Let's save our data in s3arima000c_osaoosf:

    	
# saving data into a pickle file::
pickle_out = open("s3arima000c_osaoosf.pickle","wb")
pickle.dump(s3arima000c_osaoosf, pickle_out)
pickle_out.close()

# pickle_in = open('s3arima000c_osaoosf.pickle','rb') # ' rb ' stand for 'read bytes'.
# pickle_in.close() # We ought to close the file we opened to allow any other programs access if they need it.




    	
# add a month to the index:
index_in_qu = df.dropna()["RMCPANNL"]["yoy1d"][1:][200:].index
index = [i for i in index_in_qu][1:] + [
str(datetime.strptime(str(pd.to_datetime(
[str(i) for i in  index_in_qu])[-1].replace(
month = pd.to_datetime(str(df.index[-1])).month+1)),
'%Y-%m-%d %H:%M:%S'))[:10]]




    	
s3arima000c_osaoosf_plot_dic = {
'Recursive One-Step-Ahead Out-of-Sample Forecasts' : np.array(s3arima000c_osaoosf),
'endogenous variable (RMCPANNL)' : np.append(
np.array(df.dropna()["RMCPANNL"]["yoy1d"][1:].values.tolist())[200 + 1:],
[np.nan])}

for j,k in zip([df.dropna()[i]["yoy1d"].shift(1)[1:].values.tolist()[200:]
for i in exogenous_variables],
exogenous_variables):
s3arima000c_osaoosf_plot_dic[k] = np.array(j).T

s3arima000c_osaoosf_plot_df = pd.DataFrame(
data = s3arima000c_osaoosf_plot_dic,
index = index)




    	
s3arima000c_osaoosf_plot_df.iplot(
theme = "solar", xTitle = "date",
title = "s3arima000c Recursive One-Step-Ahead Out-of-Sample Forecasts")





### Comparing Models

    	
for i,j in zip([s3arima000c_osaoosf_plot_df], [s4arima504c_osaoosf_plot_df]):
s3arima000c_resid = i["endogenous variable (RMCPANNL)"] - i["Recursive One-Step-Ahead Out-of-Sample Forecasts"][:-1]
s4arima504c_resid = j["endogenous variable (RMCPANNL)"] - j["Recursive One-Step-Ahead Out-of-Sample Forecasts"][:-1]
s3arima000c_sse = (s3arima000c_resid**2).sum()
s4arima504c_sse = (s4arima504c_resid**2).sum()




    	
print(s3arima000c_sse, s4arima504c_sse)




0.0013175028279920704 0.0012139805753142835

From the two SSE values above, we can determine that our s4arima504c model is the better of the two.

## Conclusion

The fact that a model with seasonal components (SARIMA(5,0,4)(4,0,0)1212) was the optimal one (even before we investigate Corresponding Month Models) suggests that there was a lot of information in corresponding months in previous years. As a matter of fact, even the directly previous months' coefficients seemed of little relevance. Indeed, s4arima504c_fit P-values of these cofactors from lag 1 to 4 (ar.L1 to ar.L4) never went below 0.1; i.e.: they were never statistically significant. The 5th lag however (ar.L5, with value 0.2342) had a P-value lower than 0.05, suggesting it to be statistically significant at the 5% Confidence Level (CL). Moving Average elements (ma.L1 to ma.L4) never showed statistical significance (as per their P-values). (Statistical significance is only really considered at the 10% CL, when the P-value is lower than 0.1.)

Seasonal coefficients 1 to 3 (ar.S.L12 to ar.S.L36) proved significant at the 5% CL. The 4th (ar.S.L48) did not.

It would be interesting to look at a non-consecutive lags model with only ar.L5 & ar.S.L12 to ar.S.L36 and investigate it's out-of-sample one-step-ahead forecasts as a back-test.

Additionally, what make this model interesting is that one can use it for more than one-step-ahead forecasts depending on the frequency of exogenous data provision. Say that the exogenous RMXRUSD variable is only published on the 30th of the month (showing data for the 15th of that month); than one can use data published a month prior to forecast two-steps-ahead, alleviating this issue - although not solving it per se (by itself).

Romanian inflation is a particular case study. High inflation levels a the start of our data-set might have skewed results significantly. A look at models such as the ones above in predicting other countries' inflation levels could be more fruitful. Alternatively, looking into more differenced data could be equally interesting. (Indeed, using Stationarity_Table with more differenced data seem to yield more stationary data.)