SFI Snowflake Data Warehouse
Example Source Code |
Github files referenced: |
Last Update | Dec. 2024 |
Interpreter | Python 3.11.x or greater |
Prerequisites | Familiarity with Python and a basic understanding of Jupyter Notebook. The majority of the examples are built to work within Jupyter Notebook. Ensure this package is installed. |
In this Quick Start article, we will investigate Public Companies' Climate Data on Snowflake. If you are interested in a more in-depth use-case, please read the "LSEG's SFI Snowflake Datawarehouse: A use-case evaluating public companies' Climate track record using Python and SQL" article that investigates the targets and whether they are on track to meet them; all while using data on LSEG's Sustainable Finance Investing (SFI) Snowflake.
LSEG collects data from a range of sources; for climate data, we rely on our analysts to retrieve the information directly from companies disclosures. The most exciting data onboarded has, recently, related to ESG and Climate topics. From popular demand, we distribute Sustainable Finance and Investing (SFI) data via the Datawarehouse known as Snowflake. SFI Snowflake can be accessed via Snowflake's online platform or simply using Python via Snowflake's API wrapper named Snowpark. Datasets available include ESG, Climate, EU Taxonomy and Green Revenues data. Snowflake's API is a REST API, just like LSEG's Data Platform; similarly, Snowflake created a module to wrap their API into the Python (as well as Scala and Java) module snowpark (in their Python library snowflake), just as we wrapped LSEG's Data Platform REST API into LSEG's Data Library for Python. In this Quick Start, we will look into the Snowflake workflow up to 'Python Worksheets'; in total, there will be three topics: (i) Snowflake's Snowpark Python library, (ii) Snowflake SQL Worksheets and (iii) Snowflake Python Worksheets.
For more information on the data available on LSEG's SFI Snowflake, please look into myaccount.lseg.com/en/product/quantitative-analytics. Here you will find all the related documentation necessary to understand the data-model we are tapping. Please be aware, while reading this documentation, that the SFI datasets are surfaced via the QA Quantitative Analytics (QA) product; LSEG collates several sets of data in tables in QA; some are packaged commercially-speaking under the name of 'Datastream', others under the name 'Worldscope', among these many packages, we are only interested in 'SFI' which is easy to access using Snowflake. When logging into myaccount.lseg.com/en/product/quantitative-analytics, you can access PDF documentation files:
Personally, I'm a fan of accessing such information programmatically. In this article, I'll take you through how one can do just that.
Snowflake's Snowpark Python library
We first need to import the `snowflake-snowpark-python` library. This library allows us to connect to Snowflake's services without having to deal with things like authentication tokens ourselves:
## pip install snowflake-snowpark-python
## pip install "snowflake-connector-python[pandas]"
from snowflake import snowpark
import pandas as pd
import os # This is a native library that will allow us to collect authentication details from the machine where we're running our code.
import plotly.graph_objects as go # `plotly` and its modules are useful for us to plot graphs
import numpy as np # `numpy` in this case will be useful for us to plot graphs
In the cell below, we create the global variable g_db_session in which we will insert our authentication details. This will enable us to invoke it like an object that we can use to access SFI data. To "insert our authentication details", we will create the get_session function:
# Global variable
g_db_session = None
def get_session(user, password):
global g_db_session
# Establish a connection and assign it to the global session variable
connection_parameters = {
"account": "rddpreprod1.east-us-2.azure",
"user": user,
"password": password,
"warehouse": "SFI_READER_S_WH",
"database": "SHAREDB_SFI_MVP",
"role": "SNO_SNOREFDBD_SFI_READER_ROLE",
"schema": "DBO",
"authenticator": "externalbrowser"
}
g_db_session = snowpark.Session.builder.configs(connection_parameters).create()
Let's pick an Organisation's Permanent IDentifier (OrgPermID) at random for now:
OrgPermID = "4295875633"
Note that we do not want to show our passwords, or enter it anywhere viewable to anyone reading our code. Therefore, instead of entering our password in a string, we'll create a session in the global g_db_session object that opens a browser window where you can enter your credentials and log into Snowflake using your password:
get_session("john.doe@domain.com", os.getenv('SSO_PASSWD'))
Now we can send out SQL query:
# Retrieve the data from Snowflake
query_esg = f"""
SELECT S.FY, S.SCORECALCDT, S.SCORE, I.TITLE
FROM ESG2SCORES S
JOIN ESG2ITEM I ON S.ITEM = I.ITEM
WHERE S.ORGPERMID = {OrgPermID} and I.ITEM IN (2,4,5,6,7,8,9,10,11,12,13,14,15,16,65)
"""
# Execute the query and get the DataFrame
df_esg = g_db_session.sql(query_esg).toPandas()
display(df_esg)
FY | SCORECALCDT | SCORE | TITLE | |
0 | 2008 | 12:43.4 | 0.950658 | Workforce Score |
1 | 2008 | 12:43.4 | 0.900794 | Emissions Score |
... | ... | ... | ... | ... |
306 | 2017 | 10:38.0 | 0.898148 | CSR Strategy Score |
307 | 2017 | 10:38.0 | 0.422414 | Shareholders Score |
308 rows × 4 columns |
# Pivot the DataFrame to get the desired format
pivot_df_esg = df_esg.pivot(index='FY', columns='TITLE', values='SCORE')
# Return most recent FY
pivot_df_esg = pivot_df_esg.iloc[[-1]].reset_index(drop=True)
pivot_df_esg.columns.name = None
display(pivot_df_esg)
CSR Strategy Score | Community Score | ESG Combined Score | Emissions Score | Environment Pillar Score | Environmental Innovation Score | Governance Pillar Score | Human Rights Score | Management Score | Product Responsibility Score | Resource Use Score | Shareholders Score | Social Pillar Score | Workforce Score | |
0 | 0.95 | 0.968165 | 0.481272 | 0.806513 | 0.827386 | 0.761364 | 0.84 | 0.909524 | 0.956 | 0.787698 | 0.895062 | 0.38 | 0.925768 | 0.990637 |
Snowflake SQL Worksheets
Above, we picked a random OrgPermID, but how can we find the company it relates to? As it turns out, the answer to this question using SFI and SQL code allows us to go quite far.
When you open your Snowflake homepage, you ought to be greeted with something looking like:
The main difference should be that you have no ongoing projects.
What we're trying to do here is to find the place where company information lies, then maybe we can find out which company has OrgID '4295875633'. To do this, let's first select our Role (for more info on Roles, please read Snowflake's documentaiton):
Now that we have selected our role, let's dig through the databases we have access to:
In this article, we focus on SFI databases available on Snowflake, specifically "SHAREDB_SFI_MVP" & "SFI_MASTER_MVP":
I would advise starting with the SHAREDB_SFI_MVP.DBO.QASchTableInfo table (i.e.: the SHAREDB_SFI_MVP parent database, the DBO child database, the QASchTableInfo table):
We can see what data lives in this table with the 'Open in Worksheet' option:
We, unfortunately have to pick our role again:
I would advise changing the code too, since we don't have permission to change any of the data in tables. I go for:
SELECT TOP 10 * FROM SHAREDB_SFI_MVP.DBO.QASCHFIELDINFO
Snowflake Python Worksheets
Let's see what company is linked with this OrgPermID '4295875633'. To do so, we'll need to venture in the world of Python and Snowflake Worksheets. Open a new Python Worksheet:
Let's run the code we had above, but in Snowflake's Worksheets:
# The Snowpark package is required for Python Worksheets.
# You can add more packages by selecting them using the Packages control and then importing them.
import snowflake.snowpark as snowpark
from snowflake.snowpark.functions import col
import pandas as pd
def main(session: snowpark.Session):
OrgPermID = "4295875633"
query = f"""
SELECT TABLE_NAME
FROM SHAREDB_SFI_MVP.INFORMATION_SCHEMA.TABLES
WHERE TABLE_NAME ILIKE '%Perm%'
"""
df = session.sql(query).toPandas()
print(df)
- In line 9, `OrgPermID = "4295875633"` precises the company for which we were doing our search.
- Line 11 through to 15 was our SQL query which we will send to Snowflake to run as if we were simply in a SQL Worksheet as before. This code selects the name of all the tables that can be found in the SHAREDB_SFI_MVP database.
- Line 16's `session.sql(query).toPandas()` uses the session created in lines 1 to 8 to send the `query` and transforms it into a pandas dataframe. Pandas dataframes are easy to manipulate in Python, although you can choose other types from which to work. We output this dataframe in the object `df`.
- In line 18, we simply `print` this dataframe so we can see it in the Output window below the code window in Snowflake's browser window.
You may wonder "why did we create python code simply just to send SQL code to Snowflake, when we could do this directly in Snowflake SQL Worksheets?". Great question; the reason is that we can't use loops in SQL Worksheets! Whereas in Python Worksheets, now we can use the table names to search for the ORGPERMID '4295875633' by using the below in the `main` function:
query = f"""
SELECT TABLE_NAME
FROM SHAREDB_SFI_MVP.INFORMATION_SCHEMA.TABLES
WHERE TABLE_NAME ILIKE '%Perm%'
"""
df = session.sql(query).toPandas()
for table in df.TABLE_NAME:
try:
query = f"""
SELECT * FROM SHAREDB_SFI_MVP.DBO.{table}
WHERE ORGPERMID = '4295875633'
"""
_df = session.sql(query).toPandas()
if len(_df) > 0:
print("\n")
print(f"{table}")
print(_df.T)
except:
print("\n")
print(f"{table} didn't have a 'ORGPERMID' column")
- In line 9 to 13, we create our SQL query, embedding our OrgPermID "4295875633", selecting the table names that were outputted before, all to have this list of tables in the dataframe `df` on line 14.
- In line 16, we start a loop for each table in our `df` and, from line 18 to 22, create & run queries to collect data from these tables.
- Note that, in line 17, we start a try loop. This is due to the fact that some tables will not have an ORGPERMID column, and the code will fail. In line with this, we create an except statement from line 27 to 29.
- Finally, from line 23 to 36, we make sure that if we receive data from our SQL query, we display this data with `print` statements.
We can now see from the `PERMINSTRREF` table that the company probably is "ENI" (as highlighted in red boxes in the image above); but let's check. With this code below, I only output the table and columns names for which "ENI" is found, fully or partially:
query = f"""
SELECT TABLE_NAME
FROM SHAREDB_SFI_MVP.INFORMATION_SCHEMA.TABLES
WHERE TABLE_NAME ILIKE '%Perm%'
"""
df = session.sql(query).toPandas()
for table in df.TABLE_NAME:
try:
query = f"""
SELECT * FROM SHAREDB_SFI_MVP.DBO.{table}
WHERE ORGPERMID = '4295875633'
"""
_df = session.sql(query).toPandas()
columns_with_name = [col for col in _df.columns if _df[col].astype(str).str.contains('ENI', case=False, na=False).any()]
if len(_df) > 0 and len(columns_with_name) > 0:
print("\n")
print(f"Table {table}'s columns where 'ENI' appears:", columns_with_name)
except:
print("\n")
print(f"{table} didn't have a 'ORGPERMID' column")
I like the sound of 'LEGALNAME'; let's look into it. First let's return to our SQL worksheet:
and let's use
SELECT *
FROM SHAREDB_SFI_MVP.DBO.PERMORGINFO
LIMIT 10;
LEGALNAME is indeed quite a lot better; we were right to check all the table and columns names for which "ENI" is found, fully or partially, since it looks like we should focus on table SHAREDB_SFI_MVP.DBO.PERMORGINFO.
While we looked for this OrgId, you saw that data is split in between all the different databases. Looking for the correct database is a little difficult, but once you get a handle of the Python Worksheets, it's easier.
Now, going forward, we are looking for climate data. For this article, I was after Climate data, specifically Green House Gasses Emissions; which is often abbreviated to GHG (i.e.: GHGEmission). Looking into the myaccount.lseg.com/en/product/quantitative-analytics files related to Climate Data, I found promising information about the SFI_MASTER_MVP.DBO.CLMDataPoint database:
Let's look into SFI_MASTER_MVP.DBO.CLMDataPoint
Here's ORGPERMID again, let's use this. If you look through the documentation as we did to find SFI_MASTER_MVP.DBO.CLMDataPoint, you'll see that there is a table called SFI_MASTER_MVP.DBO.CLMITEM with item names in full; there I found 'ShortTermSet1GHGEmissionPercentageReductionTargeted', which was a good candidate for being the item I'm interested in. We can pick the data in our CLMDataPoint that only relate to this item:
SELECT *
FROM SFI_MASTER_MVP.DBO.CLMDataPoint
WHERE ORGPERMID = 4295875633
AND ITEM IN (
SELECT ITEM
FROM SFI_MASTER_MVP.DBO.CLMITEM
WHERE FEEDFIELDNAME = 'ShortTermSet1GHGEmissionPercentageReductionTargeted'
);
Let's join our tables to return only the data we're after:
SELECT
p.ORGPERMID,
p.LEGALNAME,
c.FY,
c.VALUEPRCNT
FROM
SHAREDB_SFI_MVP.DBO.PERMORGINFO p
JOIN
SFI_MASTER_MVP.DBO.CLMDataPoint c
ON p.ORGPERMID = c.ORGPERMID
WHERE
c.ORGPERMID = 4295875633
AND c.ITEM IN (
SELECT ITEM
FROM SFI_MASTER_MVP.DBO.CLMITEM
WHERE FEEDFIELDNAME = 'ShortTermSet1GHGEmissionPercentageReductionTargeted'
);
Conclusion
In this Quick Start, we saw how we can retrieve data from LSEG's SFI Snowflake Datawarehouse. As you could see, there are several ways to do this, via (i) Snowflake's Snowpark Python library, (ii) Snowflake SQL Worksheets and (iii) Snowflake Python Worksheets. One may believe that using the Snowpark Python library is easier than the Snowflake SQL Worksheets, or vice versa; and this really speaks volumes about the versatility of Snowflake, and the number of ways you can retrieve data in the manner that is most efficient for your workflow.
For a longer and more wholesome use-case example, please read the "LSEG's SFI Snowflake Datawarehouse: A use-case evaluating public companies' Climate track record using Python and SQL" article.