Overview
This is a quick guide for those interested in using the large and varied economic timeseries offered within Eikon - using the Data API. In ths article I will show what economic data is available in Eikon, how to navigate it and download it and finally put it into use with a simple machine learning example using an XGBoost model.
Sections
Non Real-time Economic Indicators
Putting it all together - a basic machine learning sample
Adding Target Variables or Labels
Pre-requisites:
Refinitiv Eikon / Workspace with access to Eikon Data APIs (Free Trial Available)
Python 2.x/3.x
Required Python Packages: eikon, pandas, numpy, xgboost, sci-kit Learn
First of all Eikon carries around 1800 realtime economic indicators from around the globe. The are economic indicators where we track the releases in realtime over our realtime network and provide additional value-added information such as Reuters Polls as well as other useful information such as significance of the economic indicator. Addtionally, Eikon carries over 400,000 non-realtime economic indicators globally. In the near future, the number of non-realtime indicators will more than double to 1 million timeseries. This is a great deal of economic timeseries for you to potentially work with to generate insight and models. This guide is to help developers and quant analysts to navigate this datascape.
A popular previous way to navigate economic data was to use the speed guide in Eikon. However, this has been removed in favour of an Eikon Search based workflow. Further, support for the realtime economic chain RICS such as <0#ECONALLUS>, <0#ECONALLJP> etc have been removed - so whilst these chain RICS still work - they are not being maintained.
Eikon Search Workflow
I mentioned above that the Eikon Search Workflow is now the main way to discover what economic indicator data is available. So lets see how this works. Please try typing 'UK Economic Indicators' into the Eikon search bar and hitting 'more search results' at the bottom. Please see the result below.
From this if we look at the Economic Indicators section we can see that over 4000 items are returned. So lets click on that to discover more.
We are now at the main filter page so we can filter from the 10 columns that are there. The most useful will probably be source, periodicity and perhaps most importantly Indicator Type. Please see the types listed below:
Multiple Economic Indicator Data Sources
Types of Economic Indicator | Types of Economic Indicator | Types of Economic Indicator |
---|---|---|
Business Surveys | Industrial Production & Utilisation | Population |
Imports & Exports | Energy & Environment | Consumer Prices / Inflation |
GDP/GVA by Industry | Workforce & Unemployment | Domestic Finance |
Automobiles & Transport | Macro Forecasts | Balance of Payments |
GDP by Expenditure | Sales, Orders & Inventories | Stocks, Bonds, Futures |
Banking | Agriculture | Consumer Surveys |
Money Supply | Exchange Rates and Operations | Other Commodities |
Housing & Real Estate Prices | Housing & Construction | Consumption |
International Reserves | Producer Prices | Employment & Hours |
External Debt | Investment / Capital Formation | Government Accounts |
Retail Sales | Financial & Flow of Funds Accounts | Other Industries |
Interest Rates | Foreign Transactions | Incomes & Savings |
Import & Export Price Indices | Wages & Earnings | Government Debt & Borrowing |
GDP Deflators | Other National Accounts | International Investment Position |
Cyclical & Activity Indices | Commodity Fundamentals | Productivity & Labour Costs |
Consumer Finance | Central Banking | Corporate Accounts & Actions |
Sector Accounts | Other Prices | Personal Expenditures |
Other Surveys | Commodity Prices & Indices | Industry Balance Sheets |
Regional Accounts | Credit Risk & Ratings |
Once we narrow down what we are targetting we can then go to the page view and then download a list of appropriate RICS to excel. From there we can copy and paste these as an instrument list for use in our API Call.
import eikon as ek
import pandas as pd
import numpy as np
import xgboost as xgb
from xgboost import plot_importance, plot_tree
from sklearn.metrics import mean_squared_error, mean_absolute_error
ek.set_app_key('YOUR APP KEY HERE')
Real-time Economic Indicators
As mentioned earlier Eikon carries about 1800 real-time economic indicators. Lets see what we are taking about by looking at the economic monitor (type EM into Eikon Search) for say the US (filter on US). Notice here we can see the recent and upcoming economic indicators for the next week or so with a date as well as time, importance of release, polls to see what expectations are, as well as details about these polls such as min, max, median.
For realtime indicators lets make a simple call to the Data API get_data service. Here we are using the no longer maintained economic chain RIC '0#ECONALLUS'. These still work for backward compatibility but are no longer maintained from a content perspective. This can give us a programmatic version of the Economic Monitor App above. We are using the realtime quote model for economic indicators here and you can see this by typing a realtime RIC - say 'USADP=ECI' (US ADP payrolls) and hitting quote. If you hover the mouse over a particular number you will get the field name.
data, err = ek.get_data(['0#ECONALLUS'],['DSPLY_NMLL','RELEVANCE','ACT_REL_ST','ECI_ACT_DT','ACT_VAL_NS','UNIT_PREFX','FCAST_PRD','RPT_UNITS','ECON_ACT','ECON_PRIOR','RTR_POLL', 'FCAST_SEST','FCAST_ACRY','NDOR_1'])
data = data[data['ECI_ACT_DT']>='2020-06-09'] #from today forward (note DateTimes are in GMT!)
data.sort_values(['ECI_ACT_DT','ACT_VAL_NS'],ascending=True,inplace=True) # Sort on release date & Time
data
Instrument | DSPLY_NMLL | RELEVANCE | ACT_REL_ST | ECI_ACT_DT | ACT_VAL_NS | UNIT_PREFX | FCAST_PRD | RPT_UNITS | ECON_ACT | ECON_PRIOR | RTR_POLL | FCAST_SEST | FCAST_ACRY | NDOR_1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
58 | USFFP1=ECI | FFR Projection-1st Yr* | 0.0 | CONF | 2020-06-10 | 18:00:00 | Q2 2020 | Percent | 0.100 | 1.600 | NaN | NaN | NaN | ||
59 | USFFP2=ECI | FFR Projection-2nd Yr* | 0.0 | CONF | 2020-06-10 | 18:00:00 | Q2 2020 | Percent | 0.100 | 1.900 | NaN | NaN | NaN | ||
61 | USFFPL=ECI | FFR Projection-Longer* | 0.0 | CONF | 2020-06-10 | 18:00:00 | Q2 2020 | Percent | 2.500 | 2.500 | NaN | NaN | NaN | ||
62 | USFFPQ=ECI | FFR Projection-Current* | 0.0 | CONF | 2020-06-10 | 18:00:00 | Q2 2020 | Percent | 0.100 | 1.600 | NaN | NaN | NaN | ||
63 | USFOMC=ECI | Fed Funds Target Rate | 3.0 | CONF | 2020-06-10 | 18:00:00 | None | 11 Jun | Percent | 0.125 | 0.125 | NaN | NaN | NaN | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
186 | USUM5P=ECI | U Mich 5-Yr Inf Prelim* | 1.0 | CONF | 2020-07-17 | 14:00:00 | Jul 2020 | Percent | NaN | 2.500 | NaN | NaN | NaN | ||
188 | USUMCP=ECI | U Mich Conditions Prelim | 1.0 | CONF | 2020-07-17 | 14:00:00 | Jul 2020 | Index | NaN | 87.100 | NaN | NaN | NaN | ||
189 | USUMEP=ECI | U Mich Expectations Prelim | 1.0 | CONF | 2020-07-17 | 14:00:00 | Jul 2020 | Index | NaN | 72.300 | NaN | NaN | NaN | ||
192 | USUMSP=ECI | U Mich Sentiment Prelim | 3.0 | CONF | 2020-07-17 | 14:00:00 | Jul 2020 | Index | NaN | 78.100 | NaN | NaN | NaN | ||
14 | USCFNA=ECI | National Activity Index | 1.0 | CONF | 2020-07-21 | 12:30:00 | None | Jun 2020 | Index | NaN | 2.610 | NaN | NaN | NaN | |
Real-time Indicator retrieval from the timeseries service is different from other RIC instruments in that we DO NOT request field types (eg Open High Low Close) and we set the interval to monthly. Lets see how it works using the ADP payroll number for the last 10 years of monthly releases. Note that I have also added 3 other RICs that seem to be derived from the same underlying RIC. These are the poll values associated with the monthly release (hence the lowercase p preceding the RIC) and then we have Median, Low & High Values (the =M,=L,=H respectively succeding the RIC)
df=ek.get_timeseries(['USADP=ECI','pUSADP=M','pUSADP=L','pUSADP=H'],start_date='2010-03-01',end_date='2020-04-26',interval='monthly',)
df
VALUE | USADP=ECI | pUSADP=M | pUSADP=L | pUSADP=H |
---|---|---|---|---|
Date | ||||
2010-01-31 | -30000.0 | -30000.0 | -110000.0 | 50000.0 |
2010-02-28 | -38000.0 | -20000.0 | -100000.0 | 25000.0 |
2010-03-31 | -11000.0 | 40000.0 | -20000.0 | 180000.0 |
2010-04-30 | 132000.0 | 30000.0 | 0.0 | 80000.0 |
2010-05-31 | 211000.0 | 60000.0 | 32000.0 | 180000.0 |
... | ... | ... | ... | ... |
2019-11-30 | 161000.0 | 140000.0 | 120000.0 | 188000.0 |
2019-12-31 | 167000.0 | 160000.0 | 77000.0 | 225000.0 |
2020-01-31 | 205000.0 | 156000.0 | 140000.0 | 215000.0 |
2020-02-29 | 147000.0 | 170000.0 | 140000.0 | 250000.0 |
2020-03-31 | -302000.0 | -150000.0 | -1500000.0 | 19000.0 |
123 rows × 4 columns
Non Real-time Economic Indicators
As mentioned earlier Eikon carries just shy of 400,000 non-realtime economic indicators covering a whole host of indicator types listed above. Once you have found the list that you wish to download - just copy and paste those into a historical timeseries call this time for Monthly German New Car Production and registration:
df=ek.get_timeseries(['aDEPROCARP','aDEACECARP','aDECOMPLIDX/A'],start_date='2010-03-01',end_date='2020-04-26',interval='monthly')
df
VALUE | aDEPROCARP | aDEACECARP | aDECOMPLIDX/A |
---|---|---|---|
Date | |||
2010-01-31 | 376937.0 | 181189.0 | 100.40760 |
2010-02-28 | 450355.0 | 194846.0 | 100.70590 |
2010-03-31 | 560055.0 | 294374.0 | 100.99160 |
2010-04-30 | 464591.0 | 259414.0 | 101.25310 |
2010-05-31 | 470130.0 | 249708.0 | 101.47920 |
... | ... | ... | ... |
2019-11-30 | 423400.0 | 299127.0 | 99.12325 |
2019-12-31 | 276400.0 | 283380.0 | 99.23273 |
2020-01-31 | 341600.0 | 246300.0 | 99.34171 |
2020-02-29 | 388900.0 | 239943.0 | 99.42706 |
2020-03-31 | 287900.0 | 215119.0 | 96.90305 |
123 rows × 3 columns
Putting it all together - a basic machine learning sample
So given our new found economic indicator discovery workflow - lets try to put this data to work to see if we can come up with something of interest. In this case lets try to download all available auto industry indicators for all countries in the world and see if this can help explain movement in some global automakers. In order to do this we will be using an XGBoost model which a gradient boosted decision-tree class of algorithm.
From the Economic Search app I have identified a list of global economic indicators from the automobile industry and have downloaded these rics to an excel file - which I can now read into pandas and also use in my get_timeseries call.
riclist = pd.read_excel('/Users/jasonram/Downloads/autoric1.xlsx') #or the location of your saved Excel file
allRics = riclist['RIClist'].astype(str).values.tolist()
allRics
['IDMBKY=ECI',
'IDCARY=ECI',
'USCARS=ECI',
'ESCARM=ECI',
'ESCARY=ECI',
'ITCARM=ECI',
...
'aJPPNISS',
'aCATRKSBSP',
'aUSAUIMMXP',
'aGBACECARP',
'aLUNCAR',
'aISNVEHREG/A',
'aROACECARP']
start='2010-03-01'
end='2020-04-26'
ts = pd.DataFrame()
df = pd.DataFrame()
for r in allRics:
try:
ts = ek.get_timeseries(r,start_date=start,end_date=end,interval='monthly')
ts.rename(columns = {'VALUE': r}, inplace = True)
if len(ts):
df = pd.concat([df, ts], axis=1)
else:
df = ts
except:
pass
df
IDMBKY=ECI | IDCARY=ECI | USCARS=ECI | ESCARM=ECI | ESCARY=ECI | ITCARM=ECI | DECARM=ECI | DECARY=ECI | GBCARY=ECI | ZAVEHY=ECI | ... | aLUNHTREG | aLTNHTREG | aSGMANPOP | aJPPNISS | aCATRKSBSP | aUSAUIMMXP | aGBACECARP | aLUNCAR | aISNVEHREG/A | aROACECARP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Date | |||||||||||||||||||||
2010-03-31 | 39.6 | 92.1 | 5630000.0 | 36.7 | 63.1 | 28.5 | 51.1 | -26.6 | 26.6 | 24.7 | ... | 77.0 | 43.0 | 933225.0 | 99903.0 | 77800.0 | 93816.0 | 397383.0 | 3231.0 | 224.010250 | |
2010-04-30 | 70.0 | 88.5 | 5480000.0 | -24.9 | 39.3 | -37.9 | -11.9 | -31.7 | 11.5 | 38.8 | ... | 54.0 | 86.0 | 935024.0 | 86180.0 | 81675.0 | 75715.0 | 148793.0 | 3307.0 | 126.812220 | |
2010-05-31 | 39.9 | 69.0 | 5720000.0 | 9.9 | 44.6 | 2.4 | -3.7 | -35.1 | 13.5 | 40.5 | ... | 42.0 | 85.0 | 936311.0 | 79430.0 | 82402.0 | 73661.0 | 153095.0 | 2947.0 | 238.749611 | |
2010-06-30 | 34.7 | 77.9 | 5640000.0 | 18.1 | 25.6 | 4.3 | 15.8 | -32.3 | 10.8 | 19.7 | ... | 61.0 | 74.0 | 937860.0 | 100471.0 | 84009.0 | 79809.0 | 195226.0 | 2792.0 | 525.231951 | |
2010-07-31 | 28.4 | 72.0 | 5740000.0 | -32.4 | -24.1 | -10.4 | -17.9 | -30.2 | -13.2 | 23.4 | ... | 76.0 | 139.0 | 938922.0 | 96956.0 | 83218.0 | 76143.0 | 136446.0 | 2312.0 | 472.853610 | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
2019-11-30 | -8.3 | -9.4 | 4380000.0 | -0.9 | 2.3 | -4.0 | 5.1 | 9.7 | -1.3 | -5.8 | ... | NaN | NaN | 972224.0 | 65092.0 | 114988.0 | 75933.0 | 156621.0 | 1511.0 | 1604.147896 | |
2019-12-31 | -6.8 | -0.2 | 4490000.0 | 13.6 | 6.6 | -7.0 | -5.3 | 19.5 | 3.4 | 4.2 | ... | NaN | NaN | 973101.0 | 61201.0 | 91368.0 | 96974.0 | 148997.0 | 1248.0 | 1386.820577 | |
2020-01-31 | -18.7 | -2.1 | 4370000.0 | -18.3 | -7.6 | 11.0 | -13.1 | -7.3 | -7.3 | -8.1 | ... | NaN | NaN | 973534.0 | 65812.0 | 90008.0 | 56975.0 | 149279.0 | 1640.0 | NaN | |
2020-02-29 | 2.6 | -2.6 | 4250000.0 | 9.5 | -6.0 | 4.7 | -2.6 | -10.8 | -2.9 | -0.7 | ... | NaN | NaN | 974205.0 | 51673.0 | 100464.0 | 94819.0 | 79594.0 | 1766.0 | NaN | |
2020-03-31 | -3.2 | -15.0 | 2880000.0 | -60.2 | -69.3 | -82.6 | -10.3 | -37.7 | -44.4 | -29.7 | ... | NaN | NaN | 973761.0 | 55233.0 | 75791.0 | 114056.0 | 254684.0 | 1122.0 | NaN |
df.describe()
IDMBKY=ECI | IDCARY=ECI | USCARS=ECI | ESCARM=ECI | ESCARY=ECI | ITCARM=ECI | DECARM=ECI | DECARY=ECI | GBCARY=ECI | ZAVEHY=ECI | ... | aLUNHTREG | aLTNHTREG | aSGMANPOP | aJPPNISS | aCATRKSBSP | aUSAUIMMXP | aGBACECARP | aLUNCAR | aISNVEHREG/A | aROACECARP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 121.000000 | 121.000000 | 1.210000e+02 | 121.000000 | 121.000000 | 121.000000 | 121.000000 | 121.000000 | 121.000000 | 121.000000 | ... | 70.000000 | 70.000000 | 121.000000 | 121.000000 | 121.000000 | 121.000000 | 121.000000 | 121.000000 | 118.000000 | |
mean | 1.989256 | 8.971901 | 6.429669e+06 | 1.719835 | 3.452893 | 4.186777 | 1.361157 | 0.227273 | 1.167769 | 2.982645 | ... | 77.742857 | 219.414286 | 960183.859504 | 81079.975207 | 99080.884298 | 100680.198347 | 194813.760331 | 1760.975207 | 1297.845786 | |
std | 18.529849 | 23.429430 | 1.102691e+06 | 20.667017 | 20.016614 | 35.981592 | 16.818400 | 11.737554 | 10.016605 | 11.891570 | ... | 20.483776 | 92.862606 | 9862.880360 | 16003.024972 | 23965.353378 | 17986.371146 | 112509.688567 | 451.704466 | 723.767425 | |
min | -36.800000 | -39.100000 | 2.880000e+06 | -60.200000 | -69.300000 | -82.600000 | -36.700000 | -37.700000 | -44.400000 | -29.700000 | ... | 36.000000 | 43.000000 | 933225.000000 | 44193.000000 | 54137.000000 | 56975.000000 | 55305.000000 | 933.000000 | 126.812220 | |
25% | -10.300000 | -5.300000 | 5.660000e+06 | -10.600000 | -8.100000 | -12.700000 | -9.600000 | -4.700000 | -4.100000 | -4.500000 | ... | 62.500000 | 148.750000 | 955156.000000 | 70698.000000 | 81675.000000 | 89349.000000 | 137746.000000 | 1418.000000 | 734.942203 | |
50% | 1.800000 | 4.700000 | 6.550000e+06 | 5.400000 | 5.200000 | -2.300000 | -2.300000 | 2.600000 | 2.400000 | 0.300000 | ... | 76.500000 | 215.500000 | 959314.000000 | 79267.000000 | 93633.000000 | 100294.000000 | 163357.000000 | 1756.000000 | 1128.116730 | |
75% | 10.500000 | 17.300000 | 7.480000e+06 | 15.000000 | 15.200000 | 13.500000 | 7.900000 | 7.400000 | 8.000000 | 8.200000 | ... | 93.000000 | 265.500000 | 969728.000000 | 93607.000000 | 116544.000000 | 115554.000000 | 195226.000000 | 2038.000000 | 1969.269372 | |
max | 76.400000 | 92.100000 | 8.170000e+06 | 36.700000 | 63.100000 | 124.100000 | 51.200000 | 24.700000 | 26.600000 | 40.500000 | ... | 127.000000 | 574.000000 | 974495.000000 | 120217.000000 | 151810.000000 | 135565.000000 | 562337.000000 | 3307.000000 | 2695.983608 |
Importantly, we now want to create some lags for our data to see if there is any lagged effect for our data.
dfs = df.copy()
lags = 6
for r in dfs.columns.values:
for lag in range(1, lags + 1):
dfs[r +'_lag_{}'.format(lag)] = dfs[r].shift(lag)
dfs
IDMBKY=ECI | IDCARY=ECI | USCARS=ECI | ESCARM=ECI | ESCARY=ECI | ITCARM=ECI | DECARM=ECI | DECARY=ECI | GBCARY=ECI | ZAVEHY=ECI | ... | aISNVEHREG/A_lag_3 | aISNVEHREG/A_lag_4 | aISNVEHREG/A_lag_5 | aISNVEHREG/A_lag_6 | aROACECARP_lag_1 | aROACECARP_lag_2 | aROACECARP_lag_3 | aROACECARP_lag_4 | aROACECARP_lag_5 | aROACECARP_lag_6 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Date | |||||||||||||||||||||
2010-03-31 | 39.6 | 92.1 | 5630000.0 | 36.7 | 63.1 | 28.5 | 51.1 | -26.6 | 26.6 | 24.7 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
2010-04-30 | 70.0 | 88.5 | 5480000.0 | -24.9 | 39.3 | -37.9 | -11.9 | -31.7 | 11.5 | 38.8 | ... | NaN | NaN | NaN | NaN | 9923.0 | NaN | NaN | NaN | NaN | |
2010-05-31 | 39.9 | 69.0 | 5720000.0 | 9.9 | 44.6 | 2.4 | -3.7 | -35.1 | 13.5 | 40.5 | ... | NaN | NaN | NaN | NaN | 10376.0 | 9923.0 | NaN | NaN | NaN | |
2010-06-30 | 34.7 | 77.9 | 5640000.0 | 18.1 | 25.6 | 4.3 | 15.8 | -32.3 | 10.8 | 19.7 | ... | 224.010250 | NaN | NaN | NaN | 10758.0 | 10376.0 | 9923.0 | NaN | NaN | |
2010-07-31 | 28.4 | 72.0 | 5740000.0 | -32.4 | -24.1 | -10.4 | -17.9 | -30.2 | -13.2 | 23.4 | ... | 126.812220 | 224.010250 | NaN | NaN | 13892.0 | 10758.0 | 10376.0 | 9923.0 | NaN | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
2019-11-30 | -8.3 | -9.4 | 4380000.0 | -0.9 | 2.3 | -4.0 | 5.1 | 9.7 | -1.3 | -5.8 | ... | 1426.873182 | 1214.162249 | 1058.299246 | 1219.266184 | 10649.0 | 6046.0 | 23177.0 | 23206.0 | 14082.0 | |
2019-12-31 | -6.8 | -0.2 | 4490000.0 | 13.6 | 6.6 | -7.0 | -5.3 | 19.5 | 3.4 | 4.2 | ... | 1444.425137 | 1426.873182 | 1214.162249 | 1058.299246 | 13091.0 | 10649.0 | 6046.0 | 23177.0 | 23206.0 | |
2020-01-31 | -18.7 | -2.1 | 4370000.0 | -18.3 | -7.6 | 11.0 | -13.1 | -7.3 | -7.3 | -8.1 | ... | 1497.699721 | 1444.425137 | 1426.873182 | 1214.162249 | 13773.0 | 13091.0 | 10649.0 | 6046.0 | 23177.0 | |
2020-02-29 | 2.6 | -2.6 | 4250000.0 | 9.5 | -6.0 | 4.7 | -2.6 | -10.8 | -2.9 | -0.7 | ... | 1604.147896 | 1497.699721 | 1444.425137 | 1426.873182 | 12489.0 | 13773.0 | 13091.0 | 10649.0 | 6046.0 | |
2020-03-31 | -3.2 | -15.0 | 2880000.0 | -60.2 | -69.3 | -82.6 | -10.3 | -37.7 | -44.4 | -29.7 | ... | 1386.820577 | 1604.147896 | 1497.699721 | 1444.425137 | 8836.0 | 12489.0 | 13773.0 | 13091.0 | 10649.0 |
Adding Target Variables or Labels
So we now have our feature matrix with all our raw and lagged data. Now lets add some target variables or labels. We will be adding two types of label that I think encompass many of the use cases - firstly a timeseries of monthly closing prices as a label - and secondly a company fundamental monthly timeseries - in our case the IBES consensus revenue estimate for FY1. You will see its very straightforward to download and then concatenate these types of label data to your feature matrix. Here we use BMW as an example as it is a global automaker.
df1 = ek.get_timeseries(['BMWG.DE'],['CLOSE'],start_date='2010-03-01',end_date='2020-04-26',interval='monthly')
df1
BMWG.DE | CLOSE |
---|---|
Date | |
2010-03-31 | 34.180 |
2010-04-30 | 37.130 |
2010-05-31 | 37.965 |
2010-06-30 | 40.065 |
2010-07-31 | 41.310 |
... | ... |
2019-11-30 | 73.300 |
2019-12-31 | 73.140 |
2020-01-31 | 64.340 |
2020-02-29 | 58.930 |
2020-03-31 | 47.115 |
df2, err = ek.get_data('BMWG.DE',['TR.RevenueMean(SDate=-3,EDate=-123,Period=FY1,Frq=CM).calcdate','TR.RevenueMean(SDate=-3,EDate=-123,Period=FY1,Frq=M)'])
df2['Calc Date'] = pd.to_datetime(df2['Calc Date'].str[:10])
df2.set_index('Calc Date', inplace=True)
df2.sort_index(ascending=True, inplace=True)
df2
Instrument | Revenue - Mean | |
---|---|---|
Calc Date | ||
2010-03-31 | BMWG.DE | 52395861610 |
2010-04-30 | BMWG.DE | 52566096720 |
2010-05-31 | BMWG.DE | 53593869130 |
2010-06-30 | BMWG.DE | 53757373350 |
2010-07-31 | BMWG.DE | 55526963730 |
... | ... | ... |
2019-11-30 | BMWG.DE | 100616380300 |
2019-12-31 | BMWG.DE | 100905726390 |
2020-01-31 | BMWG.DE | 101207227710 |
2020-02-29 | BMWG.DE | 101460894380 |
2020-03-31 | BMWG.DE | 92875810320 |
121 rows × 2 columns
dfs = pd.concat([dfs, df1], axis=1)
dfs = pd.concat([dfs, df2['Revenue - Mean']], axis=1)
dfs
IDMBKY=ECI | IDCARY=ECI | USCARS=ECI | ESCARM=ECI | ESCARY=ECI | ITCARM=ECI | DECARM=ECI | DECARY=ECI | GBCARY=ECI | ZAVEHY=ECI | ... | aISNVEHREG/A_lag_5 | aISNVEHREG/A_lag_6 | aROACECARP_lag_1 | aROACECARP_lag_2 | aROACECARP_lag_3 | aROACECARP_lag_4 | aROACECARP_lag_5 | aROACECARP_lag_6 | CLOSE | Revenue - Mean | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2010-03-31 | 39.6 | 92.1 | 5630000.0 | 36.7 | 63.1 | 28.5 | 51.1 | -26.6 | 26.6 | 24.7 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 34.180 | |
2010-04-30 | 70.0 | 88.5 | 5480000.0 | -24.9 | 39.3 | -37.9 | -11.9 | -31.7 | 11.5 | 38.8 | ... | NaN | NaN | 9923.0 | NaN | NaN | NaN | NaN | NaN | 37.130 | |
2010-05-31 | 39.9 | 69.0 | 5720000.0 | 9.9 | 44.6 | 2.4 | -3.7 | -35.1 | 13.5 | 40.5 | ... | NaN | NaN | 10376.0 | 9923.0 | NaN | NaN | NaN | NaN | 37.965 | |
2010-06-30 | 34.7 | 77.9 | 5640000.0 | 18.1 | 25.6 | 4.3 | 15.8 | -32.3 | 10.8 | 19.7 | ... | NaN | NaN | 10758.0 | 10376.0 | 9923.0 | NaN | NaN | NaN | 40.065 | |
2010-07-31 | 28.4 | 72.0 | 5740000.0 | -32.4 | -24.1 | -10.4 | -17.9 | -30.2 | -13.2 | 23.4 | ... | NaN | NaN | 13892.0 | 10758.0 | 10376.0 | 9923.0 | NaN | NaN | 41.310 | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
2019-11-30 | -8.3 | -9.4 | 4380000.0 | -0.9 | 2.3 | -4.0 | 5.1 | 9.7 | -1.3 | -5.8 | ... | 1058.299246 | 1219.266184 | 10649.0 | 6046.0 | 23177.0 | 23206.0 | 14082.0 | 13011.0 | 73.300 | |
2019-12-31 | -6.8 | -0.2 | 4490000.0 | 13.6 | 6.6 | -7.0 | -5.3 | 19.5 | 3.4 | 4.2 | ... | 1214.162249 | 1058.299246 | 13091.0 | 10649.0 | 6046.0 | 23177.0 | 23206.0 | 14082.0 | 73.140 | |
2020-01-31 | -18.7 | -2.1 | 4370000.0 | -18.3 | -7.6 | 11.0 | -13.1 | -7.3 | -7.3 | -8.1 | ... | 1426.873182 | 1214.162249 | 13773.0 | 13091.0 | 10649.0 | 6046.0 | 23177.0 | 23206.0 | 64.340 | |
2020-02-29 | 2.6 | -2.6 | 4250000.0 | 9.5 | -6.0 | 4.7 | -2.6 | -10.8 | -2.9 | -0.7 | ... | 1444.425137 | 1426.873182 | 12489.0 | 13773.0 | 13091.0 | 10649.0 | 6046.0 | 23177.0 | 58.930 | |
2020-03-31 | -3.2 | -15.0 | 2880000.0 | -60.2 | -69.3 | -82.6 | -10.3 | -37.7 | -44.4 | -29.7 | ... | 1497.699721 | 1444.425137 | 8836.0 | 12489.0 | 13773.0 | 13091.0 | 10649.0 | 6046.0 | 47.115 | |
Next we need to standardise our data before we feed it to the XGBoost routine and that means both the factor matrix and the labels. In our case we will use a simple to calculate z-score. As you will see this is applying to all our current feature and label concatenated frame.
b = list(dfs.columns[0:].values)
for col in b:
col_zscore = col + '_zscore'
dfs[col_zscore] = (dfs[col] - dfs[col].mean())/dfs[col].std(ddof=0)
dfs
IDMBKY=ECI | IDCARY=ECI | USCARS=ECI | ESCARM=ECI | ESCARY=ECI | ITCARM=ECI | DECARM=ECI | DECARY=ECI | GBCARY=ECI | ZAVEHY=ECI | ... | aISNVEHREG/A_lag_5_zscore | aISNVEHREG/A_lag_6_zscore | aROACECARP_lag_1_zscore | aROACECARP_lag_2_zscore | aROACECARP_lag_3_zscore | aROACECARP_lag_4_zscore | aROACECARP_lag_5_zscore | aROACECARP_lag_6_zscore | CLOSE_zscore | Revenue - Mean_zscore | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2010-03-31 | 39.6 | 92.1 | 5630000.0 | 36.7 | 63.1 | 28.5 | 51.1 | -26.6 | 26.6 | 24.7 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | -2.546434 | |
2010-04-30 | 70.0 | 88.5 | 5480000.0 | -24.9 | 39.3 | -37.9 | -11.9 | -31.7 | 11.5 | 38.8 | ... | NaN | NaN | 0.440206 | NaN | NaN | NaN | NaN | NaN | -2.359657 | |
2010-05-31 | 39.9 | 69.0 | 5720000.0 | 9.9 | 44.6 | 2.4 | -3.7 | -35.1 | 13.5 | 40.5 | ... | NaN | NaN | 0.558189 | 0.439728 | NaN | NaN | NaN | NaN | -2.306790 | |
2010-06-30 | 34.7 | 77.9 | 5640000.0 | 18.1 | 25.6 | 4.3 | 15.8 | -32.3 | 10.8 | 19.7 | ... | NaN | NaN | 0.657681 | 0.557231 | 0.449537 | NaN | NaN | NaN | -2.173830 | |
2010-07-31 | 28.4 | 72.0 | 5740000.0 | -32.4 | -24.1 | -10.4 | -17.9 | -30.2 | -13.2 | 23.4 | ... | NaN | NaN | 1.473930 | 0.656318 | 0.567156 | 0.464146 | NaN | NaN | -2.095004 | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
2019-11-30 | -8.3 | -9.4 | 4380000.0 | -0.9 | 2.3 | -4.0 | 5.1 | 9.7 | -1.3 | -5.8 | ... | -0.325140 | -0.100668 | 0.629292 | -0.565924 | 3.890867 | 3.929603 | 1.564926 | 1.287398 | -0.069580 | |
2019-12-31 | -6.8 | -0.2 | 4490000.0 | 13.6 | 6.6 | -7.0 | -5.3 | 19.5 | 3.4 | 4.2 | ... | -0.110532 | -0.321422 | 1.265310 | 0.628044 | -0.557105 | 3.922037 | 3.952333 | 1.566968 | -0.079711 | |
2020-01-31 | -18.7 | -2.1 | 4370000.0 | -18.3 | -7.6 | 11.0 | -13.1 | -7.3 | -7.3 | -8.1 | ... | 0.182350 | -0.107668 | 1.442937 | 1.261473 | 0.638039 | -0.547340 | 3.944745 | 3.948669 | -0.636876 | |
2020-02-29 | 2.6 | -2.6 | 4250000.0 | 9.5 | -6.0 | 4.7 | -2.6 | -10.8 | -2.9 | -0.7 | ... | 0.206517 | 0.184048 | 1.108520 | 1.438376 | 1.272091 | 0.653555 | -0.537793 | 3.941099 | -0.979406 | |
2020-03-31 | -3.2 | -15.0 | 2880000.0 | -60.2 | -69.3 | -82.6 | -10.3 | -37.7 | -44.4 | -29.7 | ... | 0.279871 | 0.208119 | 0.157097 | 1.105320 | 1.449169 | 1.290659 | 0.666639 | -0.530724 | -1.727464 |
Next we need to make sure we identify the right features to pass to the XGBoost model - we only wish to pass the zscore features.
mycols = dfs.columns
mycols[1815:-2]
Index(['IDMBKY=ECI_zscore', 'IDCARY=ECI_zscore', 'USCARS=ECI_zscore',
'ESCARM=ECI_zscore', 'ESCARY=ECI_zscore', 'ITCARM=ECI_zscore',
'DECARM=ECI_zscore', 'DECARY=ECI_zscore', 'GBCARY=ECI_zscore',
'ZAVEHY=ECI_zscore',
...
'aISNVEHREG/A_lag_3_zscore', 'aISNVEHREG/A_lag_4_zscore',
'aISNVEHREG/A_lag_5_zscore', 'aISNVEHREG/A_lag_6_zscore',
'aROACECARP_lag_1_zscore', 'aROACECARP_lag_2_zscore',
'aROACECARP_lag_3_zscore', 'aROACECARP_lag_4_zscore',
'aROACECARP_lag_5_zscore', 'aROACECARP_lag_6_zscore'],
dtype='object', length=1813)
We are now getting into the machine learning section. So we need to split our data in train and test sets.
split_date = '01-01-2015'
pri_train = dfs.loc[dfs.index <= split_date].copy()
pri_test = dfs.loc[dfs.index > split_date].copy()
We also need to a function to create our feature and label sets and then we go ahead and create our X_train (training features set), y_train (training label set), X_test (test features set), y_test (test label set).
def create_features(df, label=None):
collist = df.columns.tolist()
relcols = collist[1815:-2] #len(collist)-1]
X = df[relcols]
if label:
y = df[label].to_frame()
return X, y
return X
X_train, y_train = create_features(pri_train, label='CLOSE_zscore')
X_test, y_test = create_features(pri_test, label='CLOSE_zscore')
Finally I usually just do a date check - to see all dates are properly aligned.
print(X_train.index.min(),y_train.index.min(),X_train.index.max(),y_train.index.max())
2010-03-31 00:00:00 2010-03-31 00:00:00 2014-12-31 00:00:00 2014-12-31 00:00:00
print(X_test.index.min(),y_test.index.min(),X_test.index.max(),y_test.index.max())
2015-01-31 00:00:00 2015-01-31 00:00:00 2020-03-31 00:00:00 2020-03-31 00:00:00
Define and fit XGBoost model
Finally we start defining oour XGBoost model (in our case we use a simple vanilla setup) - then we pass our training set and test set to fit our model. Once we have done that we can have visualise our feature importance as a plot.
reg = xgb.XGBRegressor(objective ='reg:squarederror',n_estimators=1000)
reg.fit(X_train, y_train,
eval_set=[(X_train, y_train), (X_test, y_test)],
early_stopping_rounds=50,
verbose=False)
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0,
importance_type='gain', learning_rate=0.1, max_delta_step=0,
max_depth=3, min_child_weight=1, missing=None, n_estimators=1000,
n_jobs=1, nthread=None, objective='reg:squarederror',
random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
seed=None, silent=None, subsample=1, verbosity=1)
_ = plot_importance(reg, height=0.9)
Finally - we can use our model to create an ML_Prediction for our label set given our X_test set and then compare that to actual observations of the label set.
pri_test1 = pri_test.copy()
pri_test1['ML_Prediction'] = reg.predict(X_test)
pri_all = pd.concat([pri_test1, pri_train], sort=False)
_ = pri_all[['CLOSE_zscore','ML_Prediction']].plot(figsize=(15, 5))
Conclusion
In this article we have discussed the economic content available in Eikon. We talked about real-time and non-realtime economic indicators and how you can discover and use this data in your workflows using our search discovery tools. We also showed an example of real-time indicator monitoring - similar to the Economic Monitor (EM) app in Eikon.
Then we moved on to an example which puts a lot of this into practice using a simple machine learning example. We found a group of automobile industry data using our search discovery tools and then saved those down as a spreadsheet. The we used those RICs to download the timeseries for these economic indicators. We created some new (derived) features by creating 6 months worth of lagged data. Once we had completed our raw feature matrix - we showed how to get two types of typical label data - a price series (in our case monthly BMW closing prices) and also an example of some fundamental data for BMW (in our case Monthly IBES consenus Revenue estimates). We then standardised our data using z-scores, a very common method followed by splitting our data into training and test sets and created our feature and label sets for both. We then defined a simple XGBoost model and handed our data to it to fit. Once that was complete we then created a model prediction for our label set (in our case BMW monthly closing prices) and then visualised this is a chart.
The model was able to generate some kind of prediction - which doesn't seem that compelling - but is not entirely without merit. The model is broadly predicting lower prices for the stock based on our economic indicator factors and this has broadly been evidenced in the test set.
Whilst a simple example, I hope this article has demystified this type of workflow and provided you with a practical codebase to explore further.
Further Resources for Eikon Data API
For Content Navigation in Eikon - please use the Data Item Browser Application: Type 'DIB' into Eikon Search Bar.