SFI Snowflake Data Warehouse

Getting Started with Python
Example Source Code

Python

Github files referenced:

LSEG_SFI_Snowflake_Warehouse_Quick_Start.ipynb

Last Update Dec. 2024
Interpreter Python 3.11.x or greater
Prerequisites

Familiarity with Python and a basic understanding of Jupyter Notebook.

The majority of the examples are built to work within Jupyter Notebook.  Ensure this package is installed.

 

In this Quick Start article, we will investigate Public Companies' Climate Data on Snowflake. If you are interested in a more in-depth use-case, please read the "LSEG's SFI Snowflake Datawarehouse: A use-case evaluating public companies' Climate track record using Python and SQL" article that investigates the targets and whether they are on track to meet them; all while using data on LSEG's Sustainable Finance Investing (SFI) Snowflake.

LSEG collects data from a range of sources; for climate data, we rely on our analysts to retrieve the information directly from companies disclosures. The most exciting data onboarded has, recently, related to ESG and Climate topics. From popular demand, we distribute Sustainable Finance and Investing (SFI) data via the Datawarehouse known as Snowflake. SFI Snowflake can be accessed via Snowflake's online platform or simply using Python via Snowflake's API wrapper named Snowpark. Datasets available include ESG, Climate, EU Taxonomy and Green Revenues data. Snowflake's API is a REST API, just like LSEG's Data Platform; similarly, Snowflake created a module to wrap their API into the Python (as well as Scala and Java) module snowpark (in their Python library snowflake), just as we wrapped LSEG's Data Platform REST API into LSEG's Data Library for Python. In this Quick Start, we will look into the Snowflake workflow up to 'Python Worksheets'; in total, there will be three topics: (i) Snowflake's Snowpark Python library, (ii) Snowflake SQL Worksheets and (iii) Snowflake Python Worksheets.

For more information on the data available on LSEG's SFI Snowflake, please look into myaccount.lseg.com/en/product/quantitative-analytics. Here you will find all the related documentation necessary to understand the data-model we are tapping. Please be aware, while reading this documentation, that the SFI datasets are surfaced via the QA Quantitative Analytics (QA) product; LSEG collates several sets of data in tables in QA; some are packaged commercially-speaking under the name of 'Datastream', others under the name 'Worldscope', among these many packages, we are only interested in 'SFI' which is easy to access using Snowflake. When logging into myaccount.lseg.com/en/product/quantitative-analytics, you can access PDF documentation files:

Personally, I'm a fan of accessing such information programmatically. In this article, I'll take you through how one can do just that.

Snowflake's Snowpark Python library

We first need to import the `snowflake-snowpark-python` library. This library allows us to connect to Snowflake's services without having to deal with things like authentication  tokens ourselves:

    	
            

## pip install snowflake-snowpark-python

## pip install "snowflake-connector-python[pandas]"

from snowflake import snowpark

import pandas as pd

import os # This is a native library that will allow us to collect authentication  details from the machine where we're running our code.

import plotly.graph_objects as go # `plotly` and its modules are useful for us to plot graphs

import numpy as np # `numpy` in this case will be useful for us to plot graphs

In the cell below, we create the global variable g_db_session in which we will insert our authentication details. This will enable us to invoke it like an object that we can use to access SFI data. To "insert our authentication details", we will create the get_session function:

    	
            

# Global variable

g_db_session = None

 

def get_session(user, password):

    global g_db_session

    

    # Establish a connection and assign it to the global session variable

    connection_parameters = { 

        "account": "rddpreprod1.east-us-2.azure",

        "user": user,

        "password": password,

        "warehouse": "SFI_READER_S_WH", 

        "database": "SHAREDB_SFI_MVP",

        "role": "SNO_SNOREFDBD_SFI_READER_ROLE",

        "schema": "DBO",

        "authenticator": "externalbrowser"

    }

    

    g_db_session = snowpark.Session.builder.configs(connection_parameters).create() 

Let's pick an Organisation's Permanent IDentifier (OrgPermID) at random for now:

    	
            OrgPermID = "4295875633"
        
        
    

Note that we do not want to show our passwords, or enter it anywhere viewable to anyone reading our code. Therefore, instead of entering our password in a string, we'll create a session in the global g_db_session object that opens a browser window where you can enter your credentials and log into Snowflake using your password:

    	
            get_session("john.doe@domain.com", os.getenv('SSO_PASSWD'))
        
        
    

Now we can send out SQL query:

    	
            

# Retrieve the data from Snowflake

query_esg = f"""

    SELECT S.FY, S.SCORECALCDT, S.SCORE, I.TITLE

    FROM ESG2SCORES S 

    JOIN ESG2ITEM I ON S.ITEM = I.ITEM 

    WHERE S.ORGPERMID = {OrgPermID} and I.ITEM IN (2,4,5,6,7,8,9,10,11,12,13,14,15,16,65)

    """

 

# Execute the query and get the DataFrame

df_esg = g_db_session.sql(query_esg).toPandas()

 

display(df_esg)

  FY SCORECALCDT SCORE TITLE
0 2008 12:43.4 0.950658 Workforce Score
1 2008 12:43.4 0.900794 Emissions Score
... ... ... ... ...
306 2017 10:38.0 0.898148 CSR Strategy Score
307 2017 10:38.0 0.422414 Shareholders Score
         
308 rows × 4 columns        
    	
            

# Pivot the DataFrame to get the desired format

pivot_df_esg = df_esg.pivot(index='FY', columns='TITLE', values='SCORE')

 

# Return most recent FY

pivot_df_esg = pivot_df_esg.iloc[[-1]].reset_index(drop=True)

pivot_df_esg.columns.name = None

 

display(pivot_df_esg)

  CSR Strategy Score Community Score ESG Combined Score Emissions Score Environment Pillar Score Environmental Innovation Score Governance Pillar Score Human Rights Score Management Score Product Responsibility Score Resource Use Score Shareholders Score Social Pillar Score Workforce Score
0 0.95 0.968165 0.481272 0.806513 0.827386 0.761364 0.84 0.909524 0.956 0.787698 0.895062 0.38 0.925768 0.990637

Snowflake SQL Worksheets

Above, we picked a random OrgPermID, but how can we find the company it relates to? As it turns out, the answer to this question using SFI and SQL code allows us to go quite far.

When you open your Snowflake homepage, you ought to be greeted with something looking like:

The main difference should be that you have no ongoing projects.

What we're trying to do here is to find the place where company information lies, then maybe we can find out which company has OrgID '4295875633'. To do this, let's first select our Role (for more info on Roles, please read Snowflake's documentaiton):

Now that we have selected our role, let's dig through the databases we have access to:

In this article, we focus on SFI databases available on Snowflake, specifically "SHAREDB_SFI_MVP" & "SFI_MASTER_MVP":

I would advise starting with the SHAREDB_SFI_MVP.DBO.QASchTableInfo table (i.e.: the SHAREDB_SFI_MVP parent database, the DBO child database, the QASchTableInfo table):

We can see what data lives in this table with the 'Open in Worksheet' option:

We, unfortunately have to pick our role again:

I would advise changing the code too, since we don't have permission to change any of the data in tables. I go for:

SELECT TOP 10 * FROM SHAREDB_SFI_MVP.DBO.QASCHFIELDINFO

Snowflake Python Worksheets

Let's see what company is linked with this OrgPermID '4295875633'. To do so, we'll need to venture in the world of Python and Snowflake Worksheets. Open a new Python Worksheet:

Let's run the code we had above, but in Snowflake's Worksheets:

    	
            

# The Snowpark package is required for Python Worksheets. 

# You can add more packages by selecting them using the Packages control and then importing them.

import snowflake.snowpark as snowpark

from snowflake.snowpark.functions import col

import pandas as pd

 

def main(session: snowpark.Session): 

 

    OrgPermID = "4295875633"

    

    query = f"""

    SELECT TABLE_NAME

    FROM SHAREDB_SFI_MVP.INFORMATION_SCHEMA.TABLES

    WHERE TABLE_NAME ILIKE '%Perm%'

    """

    df = session.sql(query).toPandas()

    

    print(df)

- In line 9, `OrgPermID = "4295875633"` precises the company for which we were doing our search.

- Line 11 through to 15 was our SQL query which we will send to Snowflake to run as if we were simply in a SQL Worksheet as before. This code selects the name of all the tables that can be found in the SHAREDB_SFI_MVP database.

- Line 16's `session.sql(query).toPandas()` uses the session created in lines 1 to 8 to send the `query` and transforms it into a pandas dataframe. Pandas dataframes are easy to manipulate in Python, although you can choose other types from which to work. We output this dataframe in the object `df`.

- In line 18, we simply `print` this dataframe so we can see it in the Output window below the code window in Snowflake's browser window.

You may wonder "why did we create python code simply just to send SQL code to Snowflake, when we could do this directly in Snowflake SQL Worksheets?". Great question; the reason is that we can't use loops in SQL Worksheets! Whereas in Python Worksheets, now we can use the table names to search for the ORGPERMID '4295875633' by using the below in the `main` function:

    	
            

    query = f"""

        SELECT TABLE_NAME

        FROM SHAREDB_SFI_MVP.INFORMATION_SCHEMA.TABLES

        WHERE TABLE_NAME ILIKE '%Perm%'

        """

    df = session.sql(query).toPandas()

    

    for table in df.TABLE_NAME:

        try:

            query = f"""

                SELECT * FROM SHAREDB_SFI_MVP.DBO.{table}

                WHERE ORGPERMID = '4295875633'

                """

            _df = session.sql(query).toPandas()

            if len(_df) > 0:

                print("\n")

                print(f"{table}")

                print(_df.T)

        except:

            print("\n")

            print(f"{table} didn't have a 'ORGPERMID' column")

- In line 9 to 13, we create our SQL query, embedding our OrgPermID "4295875633", selecting the table names that were outputted before, all to have this list of tables in the dataframe `df` on line 14.

- In line 16, we start a loop for each table in our `df` and, from line 18 to 22, create & run queries to collect data from these tables.

- Note that, in line 17, we start a try loop. This is due to the fact that some tables will not have an ORGPERMID column, and the code will fail. In line with this, we create an except statement from line 27 to 29.

- Finally, from line 23 to 36, we make sure that if we receive data from our SQL query, we display this data with `print` statements.

We can now see from the `PERMINSTRREF` table that the company probably is "ENI" (as highlighted in red boxes in the image above); but let's check. With this code below, I only output the table and columns names for which "ENI" is found, fully or partially:

    	
            

    query = f"""

        SELECT TABLE_NAME

        FROM SHAREDB_SFI_MVP.INFORMATION_SCHEMA.TABLES

        WHERE TABLE_NAME ILIKE '%Perm%'

        """

    df = session.sql(query).toPandas()

    

    for table in df.TABLE_NAME:

        try:

            query = f"""

                SELECT * FROM SHAREDB_SFI_MVP.DBO.{table}

                WHERE ORGPERMID = '4295875633'

                """

            _df = session.sql(query).toPandas()

            columns_with_name = [col for col in _df.columns if _df[col].astype(str).str.contains('ENI', case=False, na=False).any()]

            if len(_df) > 0 and len(columns_with_name) > 0:

                print("\n")

                print(f"Table {table}'s columns where 'ENI' appears:", columns_with_name)

        except:

            print("\n")

            print(f"{table} didn't have a 'ORGPERMID' column")

I like the sound of 'LEGALNAME'; let's look into it. First let's return to our SQL worksheet:

and let's use

    	
            

SELECT *

FROM SHAREDB_SFI_MVP.DBO.PERMORGINFO

LIMIT 10;

LEGALNAME is indeed quite a lot better; we were right to check all the table and columns names for which "ENI" is found, fully or partially, since it looks like we should focus on table SHAREDB_SFI_MVP.DBO.PERMORGINFO.

While we looked for this OrgId, you saw that data is split in between all the different databases. Looking for the correct database is a little difficult, but once you get a handle of the Python Worksheets, it's easier.

Now, going forward, we are looking for climate data. For this article, I was after Climate data, specifically Green House Gasses Emissions; which is often abbreviated to GHG (i.e.: GHGEmission). Looking into the myaccount.lseg.com/en/product/quantitative-analytics files related to Climate Data, I found promising information about the SFI_MASTER_MVP.DBO.CLMDataPoint database:

Let's look into SFI_MASTER_MVP.DBO.CLMDataPoint

Here's ORGPERMID again, let's use this. If you look through the documentation as we did to find SFI_MASTER_MVP.DBO.CLMDataPoint, you'll see that there is a table called SFI_MASTER_MVP.DBO.CLMITEM with item names in full; there I found 'ShortTermSet1GHGEmissionPercentageReductionTargeted', which was a good candidate for being the item I'm interested in. We can pick the data in our CLMDataPoint that only relate to this item:

    	
            

SELECT *

FROM SFI_MASTER_MVP.DBO.CLMDataPoint

WHERE ORGPERMID = 4295875633

AND ITEM IN (

    SELECT ITEM

    FROM SFI_MASTER_MVP.DBO.CLMITEM

    WHERE FEEDFIELDNAME = 'ShortTermSet1GHGEmissionPercentageReductionTargeted'

);

Let's join our tables to return only the data we're after:

    	
            

SELECT 

    p.ORGPERMID,

    p.LEGALNAME,

    c.FY,

    c.VALUEPRCNT

FROM 

    SHAREDB_SFI_MVP.DBO.PERMORGINFO p

JOIN 

    SFI_MASTER_MVP.DBO.CLMDataPoint c

    ON p.ORGPERMID = c.ORGPERMID

WHERE 

    c.ORGPERMID = 4295875633

    AND c.ITEM IN (

        SELECT ITEM

        FROM SFI_MASTER_MVP.DBO.CLMITEM

        WHERE FEEDFIELDNAME = 'ShortTermSet1GHGEmissionPercentageReductionTargeted'

    );

 

 

Conclusion

 

In this Quick Start, we saw how we can retrieve data from LSEG's SFI Snowflake Datawarehouse. As you could see, there are several ways to do this, via (i) Snowflake's Snowpark Python library, (ii) Snowflake SQL Worksheets and (iii) Snowflake Python Worksheets. One may believe that using the Snowpark Python library is easier than the Snowflake SQL Worksheets, or vice versa; and this really speaks volumes about the versatility of Snowflake, and the number of ways you can retrieve data in the manner that is most efficient for your workflow.

 

For a longer and more wholesome use-case example, please read the "LSEG's SFI Snowflake Datawarehouse: A use-case evaluating public companies' Climate track record using Python and SQL" article.