SFI Snowflake Datawarehouse: A use-case evaluating public companies' Climate track record using Python and SQL

LSEG's SFI Snowflake Data Warehouse: A use-case evaluating public companies' Climate track record using Python and SQL

December 21, 2024

Authors:

Jonathan Legrand

In this article, we will investigate Public Companies' Climate targets and whether they are on track to meet them. The data needed for such an investigation is not onboarded on LSEG's Data Platform yet; but, thankfully, it is on LSEG's Sustainable Finance Investing (SFI) Snowflake.

LSEG collects data from a range of sources; for climate data, we rely on our analysts to retrieve the information directly from companies’ disclosures. The most exciting data onboarded has, recently, related to ESG and Climate topics. From popular demand, we distribute Sustainable Finance and Investing (SFI) data via the Datawarehouse known as Snowflake. SFI Snowflake can be accessed via Snowflake's online platform or simply using Python via Snowflake's API wrapper named Snowpark. Datasets available include ESG, Climate, EU Taxonomy and Green Revenues data. Snowflake's API is a REST API, just like LSEG's Data Platform; similarly, Snowflake created a module to wrap their API into the Python (as well as Scala and Java) module snowpark (in their Python library snowflake), just as we wrapped LSEG's Data Platform REST API into LSEG's Data Library for Python.

For more information on the data available on LSEG's SFI Snowflake, please look into myaccount.lseg.com/en/product/quantitative-analytics. Here you will find all the related documentation necessary to understand the data-model we are leveraging. Please be aware that SFI is a set of data the SFI datasets are surfaced via the QA Quantitative Analytics (QA) product; LSEG collates several sets of data in tables in QA; some are packaged commercially-speaking under the name of 'Datastream', others under the name 'Worldscope', among these many packages, we are only interested in 'SFI' which is easy to access using Snowflake. When logging into myaccount.lseg.com/en/product/quantitative-analytics, you can access PDF documentation files:

Personally, I'm a fan of accessing such information programmatically. In this article, I'll take you through how one can do just that.

Beginner's steps: SFI Snowflake using Snowpark to collect ESG data

We first need to import the snowflake-snowpark-python library. This library allows us to connect to Snowflake's services without having to deal with things like authentication tokens ourselves:

    	
            ## pip install snowflake-snowpark-python
## pip install "snowflake-connector-python[pandas]"
from snowflake import snowpark
import pandas as pd
import os # This is a native library that will allow us to collect authentication  details from the machine where we're running our code.
import plotly.graph_objects as go # `plotly` and its modules are useful for us to plot graphs
import numpy as np # `numpy` in this case will be useful for us to plot graphs

In the cell below, we create the global variable g_db_session in which we will insert our authentication details. This will enable us to invoke it like an object that we can use to access SFI data. To "insert our authentication details", we will create the get_session function:

    	
            # Global variable
g_db_session = None
 
def get_session(user, password):
    global g_db_session
    
    # Establish a connection and assign it to the global session variable
    connection_parameters = { 
        "account": "rddpreprod1.east-us-2.azure",
        "user": user,
        "password": password,
        "warehouse": "SFI_READER_S_WH", 
        "database": "SHAREDB_SFI_MVP",
        "role": "SNO_SNOREFDBD_SFI_READER_ROLE",
        "schema": "DBO",
        "authenticator": "externalbrowser"
    }
    
    g_db_session = snowpark.Session.builder.configs(connection_parameters).create()

Let's pick an Organisation's Permanent IDentifier (OrgPermID) at random for now:

    	
            OrgPermID = "4295875633"

Note that we do not want to show our passwords, or enter it anywhere viewable to anyone reading our code. Therefore, instead of entering our password in a string, we'll create a session in the global g_db_session object that opens a browser window where you can enter your credentials and log into Snowflake using your password:

    	
            get_session("john.doe@domain.com", os.getenv('SSO_PASSWD'))

Now we can send out SQL query:

    	
            # Retrieve the data from Snowflake
query_esg = f"""
    SELECT S.FY, S.SCORECALCDT, S.SCORE, I.TITLE
    FROM ESG2SCORES S 
    JOIN ESG2ITEM I ON S.ITEM = I.ITEM 
    WHERE S.ORGPERMID = {OrgPermID} and I.ITEM IN (2,4,5,6,7,8,9,10,11,12,13,14,15,16,65)
    """
 
# Execute the query and get the DataFrame
df_esg = g_db_session.sql(query_esg).toPandas()
 
display(df_esg)

	FY	SCORECALCDT	SCORE	TITLE
0	2008	12:43.4	0.950658	Workforce Score
1	2008	12:43.4	0.900794	Emissions Score
...	...	...	...	...
306	2017	10:38.0	0.898148	CSR Strategy Score
307	2017	10:38.0	0.422414	Shareholders Score

308 rows × 4 columns

    	
            # Pivot the DataFrame to get the desired format
pivot_df_esg = df_esg.pivot(index='FY', columns='TITLE', values='SCORE')
 
# Return most recent FY
pivot_df_esg = pivot_df_esg.iloc[[-1]].reset_index(drop=True)
pivot_df_esg.columns.name = None
 
display(pivot_df_esg)

	CSR Strategy Score	Community Score	ESG Combined Score	Emissions Score	Environment Pillar Score	Environmental Innovation Score	Governance Pillar Score	Human Rights Score	Management Score	Product Responsibility Score	Resource Use Score	Shareholders Score	Social Pillar Score	Workforce Score
0	0.95	0.968165	0.481272	0.806513	0.827386	0.761364	0.84	0.909524	0.956	0.787698	0.895062	0.38	0.925768	0.990637

Intermediary steps: SFI Snowflake using the browser to collect Company data

Above, we picked a random OrgPermID, but how can we find the company it relates to? As it turns out, the answer to this question using SFI and SQL code allows us to go quite far.

When you open your Snowflake homepage, you ought to be greeted with something looking like:

The main difference should be that you have no ongoing projects.

What we're trying to do here is to find the place where company information lies, then maybe we can find out which company has OrgID '4295875633'. To do this, let's first select our Role (for more info on Roles, please read Snowflake's documentaiton):

Now that we have selected our role, let's dig through the databases we have access to:

In this article, we focus on SFI databases available on Snowflake, specifically "SHAREDB_SFI_MVP" & "SFI_MASTER_MVP":

I would advise starting with the SHAREDB_SFI_MVP.DBO.QASchTableInfo table (i.e.: the SHAREDB_SFI_MVP parent database, the DBO child database, the QASchTableInfo table):

We can see what data lives in this table with the 'Open in Worksheet' option:

We, unfortunately have to pick our role again:

I would advise changing the code too, since we don't have permission to change any of the data in tables. I go for:

SELECT TOP 10 * FROM SHAREDB_SFI_MVP.DBO.QASCHFIELDINFO

Snowflake Python Worksheets

Let's see what company is linked with this OrgPermID '4295875633'. To do so, we'll need to venture in the world of Python and Snowflake Worksheets. Open a new Python Worksheet:

Let's run the code we had above, but in Snowflake's Worksheets:

    	
            # The Snowpark package is required for Python Worksheets. 
# You can add more packages by selecting them using the Packages control and then importing them.
import snowflake.snowpark as snowpark
from snowflake.snowpark.functions import col
import pandas as pd
 
def main(session: snowpark.Session): 
 
    OrgPermID = "4295875633"
    
    query = f"""
    SELECT TABLE_NAME
    FROM SHAREDB_SFI_MVP.INFORMATION_SCHEMA.TABLES
    WHERE TABLE_NAME ILIKE '%Perm%'
    """
    df = session.sql(query).toPandas()
    
    print(df)

- In line 9, OrgPermID = "4295875633" precises the company for which we were doing our search.

- Line 11 through to 15 was our SQL query which we will send to Snowflake to run as if we were simply in a SQL Worksheet as before. This code selects the name of all the tables that can be found in the SHAREDB_SFI_MVP database.

- Line 16's session.sql(query).toPandas() uses the session created in lines 1 to 8 to send the query and transforms it into a pandas dataframe. Pandas dataframes are easy to manipulate in Python, although you can choose other types from which to work. We output this dataframe in the object df.

- In line 18, we simply print this dataframe so we can see it in the Output window below the code window in Snowflake's browser window.

You may wonder "why did we create python code simply just to send SQL code to Snowflake, when we could do this directly in Snowflake SQL Worksheets?". Great question; the reason is that we can't use loops in SQL Worksheets! Whereas in Python Worksheets, now we can use the table names to search for the ORGPERMID '4295875633' by using the below in the main function:

    	
                query = f"""
        SELECT TABLE_NAME
        FROM SHAREDB_SFI_MVP.INFORMATION_SCHEMA.TABLES
        WHERE TABLE_NAME ILIKE '%Perm%'
        """
    df = session.sql(query).toPandas()
    
    for table in df.TABLE_NAME:
        try:
            query = f"""
                SELECT * FROM SHAREDB_SFI_MVP.DBO.{table}
                WHERE ORGPERMID = '4295875633'
                """
            _df = session.sql(query).toPandas()
            if len(_df) > 0:
                print("\n")
                print(f"{table}")
                print(_df.T)
        except:
            print("\n")
            print(f"{table} didn't have a 'ORGPERMID' column")

- In line 9 to 13, we create our SQL query, embedding our OrgPermID "4295875633", selecting the table names that were outputted before, all to have this list of tables in the dataframe df on line 14.

- In line 16, we start a loop for each table in our df and, from line 18 to 22, create & run queries to collect data from these tables.

- Note that, in line 17, we start a try loop. This is due to the fact that some tables will not have an ORGPERMID column, and the code will fail. In line with this, we create an except statement from line 27 to 29.

- Finally, from line 23 to 36, we make sure that if we receive data from our SQL query, we display this data with print statements.

We can now see from the PERMINSTRREF table that the company probably is "ENI" (as highlighted in red boxes in the image above); but let's check. With this code below, I only output the table and columns names for which "ENI" is found, fully or partially:

    	
                query = f"""
        SELECT TABLE_NAME
        FROM SHAREDB_SFI_MVP.INFORMATION_SCHEMA.TABLES
        WHERE TABLE_NAME ILIKE '%Perm%'
        """
    df = session.sql(query).toPandas()
    
    for table in df.TABLE_NAME:
        try:
            query = f"""
                SELECT * FROM SHAREDB_SFI_MVP.DBO.{table}
                WHERE ORGPERMID = '4295875633'
                """
            _df = session.sql(query).toPandas()
            columns_with_name = [col for col in _df.columns if _df[col].astype(str).str.contains('ENI', case=False, na=False).any()]
            if len(_df) > 0 and len(columns_with_name) > 0:
                print("\n")
                print(f"Table {table}'s columns where 'ENI' appears:", columns_with_name)
        except:
            print("\n")
            print(f"{table} didn't have a 'ORGPERMID' column")

I like the sound of LEGALNAME; let's look into it. First let's return to our SQL worksheet:

and let's use

    	
            SELECT *
FROM SHAREDB_SFI_MVP.DBO.PERMORGINFO
LIMIT 10;

LEGALNAME is indeed quite a lot better; we were right to check all the table and columns names for which "ENI" is found, fully or partially, since it looks like we should focus on table SHAREDB_SFI_MVP.DBO.PERMORGINFO.

While we looked for this OrgId, you saw that data is split in between all the different databases. Looking for the correct database is a little difficult, but once you get a handle of the Python Worksheets, it's easier.

Now, going forward, we are looking for climate data. For this article, I was after Climate data, specifically Green House Gasses Emissions; which is often abbreviated to GHG (i.e.: GHGEmission). Looking into the myaccount.lseg.com/en/product/quantitative-analytics files related to Climate Data, I found promising information about the SFI_MASTER_MVP.DBO.CLMDataPoint database:

Let's look into SFI_MASTER_MVP.DBO.CLMDataPoint

Here's ORGPERMID again, let's use this. If you look through the documentation as we did to find SFI_MASTER_MVP.DBO.CLMDataPoint, you'll see that there is a table called SFI_MASTER_MVP.DBO.CLMITEM with item names in full; there I found 'ShortTermSet1GHGEmissionPercentageReductionTargeted', which was a good candidate for being the item I'm interested in. We can pick the data in our CLMDataPoint that only relate to this item:

    	
            SELECT *
FROM SFI_MASTER_MVP.DBO.CLMDataPoint
WHERE ORGPERMID = 4295875633
AND ITEM IN (
    SELECT ITEM
    FROM SFI_MASTER_MVP.DBO.CLMITEM
    WHERE FEEDFIELDNAME = 'ShortTermSet1GHGEmissionPercentageReductionTargeted'
);

Let's join our tables to return only the data we're after:

    	
            SELECT 
    p.ORGPERMID,
    p.LEGALNAME,
    c.FY,
    c.VALUEPRCNT
FROM 
    SHAREDB_SFI_MVP.DBO.PERMORGINFO p
JOIN 
    SFI_MASTER_MVP.DBO.CLMDataPoint c
    ON p.ORGPERMID = c.ORGPERMID
WHERE 
    c.ORGPERMID = 4295875633
    AND c.ITEM IN (
        SELECT ITEM
        FROM SFI_MASTER_MVP.DBO.CLMITEM
        WHERE FEEDFIELDNAME = 'ShortTermSet1GHGEmissionPercentageReductionTargeted'
    );

Advanced steps: SFI Snowflake using Python and Snowpark

We can get inthe meat of the subject now, and use Python to fetch the data we're after in SFI Snowflake to then create the output we're after: investigate Public Companies' Climate targets and whether they are on track to meet them. To do this, we'll try and create a graph exemplifying our investigation results.

We can do everything we did above, but using Python, e.g.: get the tables of SHAREDB_SFI_MVP:

    	
            query = f"""
    SELECT TABLE_NAME
    FROM SHAREDB_SFI_MVP.INFORMATION_SCHEMA.TABLES
    WHERE TABLE_NAME ILIKE '%Perm%'
    """
 
df = g_db_session.sql(query).toPandas()
 
display(df)

	TABLE_NAME
0	CLMORGPERMAP
1	ESG2ORGPERMAP
2	ESG2ORGPERMAP_CHANGES
3	PERMINSTRINFO
4	PERMINSTRREF
5	PERMORGINFO
6	PERMORGREF
7	PERMQUOTEREF
8	PERMSECMAPX
9	PERMCODE
10	PERMQUOTEINFO

Look through these tables to find the table we're after:

    	
            dfs = {}
for table in df.TABLE_NAME:
    try:
        query = f"""
            SELECT * FROM SHAREDB_SFI_MVP.DBO.{table}
            WHERE ORGPERMID = '4295875633'
            """
        _df = g_db_session.sql(query).toPandas()
        dfs[f"{table}"] = _df
        if len(_df) > 0:
            print(f"{table}")
            display(_df)
    except:
        print(f"{table} didn't have a 'ORGPERMID' column")

CLMORGPERMAP

ORGPERMID

PERID

PERENDDT

4295875633

31-12-11

4295875633

438

31-12-20

…

4295875633

456

31-12-21

4295875633

123

31-12-12

ESG2ORGPERMAP

ORGPERMID

PERID

PERENDDT

4295875633

31-12-09

4295875633

31-12-04

…

4295875633

382

31-12-17

4295875633

438

31-12-20

PERMINSTRINFO didn't have a 'ORGPERMID' column

PERMINSTRREF

INSTRPERMID

ISPRIMARY

STATUS

TYPECODE

COMNAME

CURRPERMID

ORGPERMID

PRITRDQUOTEPERMID

VALQUOTEPERMID

WORLDSCOPECMPID

PRIRPTENTITYCODE

PRIESTPERMID

LIPPERID

EJVASSETID

ADRRATIO

21475235587

FALSE

EQUITY

ENI Spa Common Stock

500110

4295875633

NaN

2.15E+10

None

A49DF

30064794631

NaN

None

8590925021

FALSE

EQUITY

ENI ADR

500110

4295875633

5.58E+10

26874R108

A12C4

30064836885

NaN

…

21475351836

FALSE

EQUITY

ENI Preferred Stock

500110

4295875633

NaN

2.15E+10

None

A49DF

30064794631

NaN

None

8590568485

FALSE

EQUITY

ENI CEDEAR

500114

4295875633

NaN

5.58E+10

A49DF

30064794631

NaN

0.25

PERMORGINFO

ORGPERMID

ADDRTYPECODE

ISPUBLIC

IPODATE

ISO2CNTRYCODE

LEGALNAME

ADDRLINE1

ADDRLINE2

CITY

STATEPROV

POSTALCODE

PRITRBCINDPERMID

CNTRYGEOID

4295875633

TRUE

28-11-95

ENI S.P.A.

Piazzale Enrico Mattei 1

ROMA

144

4294951998

100145

4295875633

TRUE

28-11-95

ENI S.P.A.

Piazza Ezio Vanoni, 1

SAN DONATO MILANESE

MILANO

20097

4294951998

100145

PERMORGREF

ORGPERMID

STATUS

TYPECODE

DOMCNTRYPERMID

INCCNTRYPERMID

PRIINSTRPERMID

VALQUOTEPERMID

COMNAME

IMMEDIATEPARENTORGPERMID

ULTIMATEPARENTORGPERMID

PRIESTPERMID

WORLDSCOPECMPID

PRIRPTENTITYCODE

LEI

CIK

LIPPERID

ORGANIZATIONPI

4295875633

Act

COM

100145

8590929140

55837437646

Eni SpA

4295875633

30064794631

C380C0390

A49DF

BUCRF72VH5RBN7X3VL35

NaN

31644

PERMQUOTEREF didn't have a 'ORGPERMID' column

PERMSECMAPX didn't have a 'ORGPERMID' column

PERMCODE didn't have a 'ORGPERMID' column

PERMQUOTEINFO didn't have a 'ORGPERMID' column

    	
            dfs = {}
for table in df.TABLE_NAME:
    try:
        query = f"""
            SELECT * FROM SHAREDB_SFI_MVP.DBO.{table}
            WHERE ORGPERMID = '4295875633'
            """
        _df = g_db_session.sql(query).toPandas()
        dfs[f"{table}"] = _df
        columns_with_name = [col for col in _df.columns if _df[col].astype(str).str.contains('ENI', case=False, na=False).any()]
        if len(_df) > 0 and len(columns_with_name) > 0:
            print(f"Table {table}'s columns where 'ENI' appears:", columns_with_name)
    except:
        print(f"{table} didn't have a 'ORGPERMID' column")

PERMINSTRINFO didn't have a 'ORGPERMID' column
Table PERMINSTRREF's columns where 'ENI' appears: ['COMNAME']
Table PERMORGINFO's columns where 'ENI' appears: ['LEGALNAME']
Table PERMORGREF's columns where 'ENI' appears: ['COMNAME']
PERMQUOTEREF didn't have a 'ORGPERMID' column
PERMSECMAPX didn't have a 'ORGPERMID' column
PERMCODE didn't have a 'ORGPERMID' column
PERMQUOTEINFO didn't have a 'ORGPERMID' column

We can look for Climate data as we did before too:

    	
            query_climate = f"""
    SELECT *
    FROM SFI_MASTER_MVP.DBO.CLMDataPoint
    WHERE ORGPERMID = {OrgPermID}
    AND ITEM IN (
        SELECT ITEM
        FROM SFI_MASTER_MVP.DBO.CLMITEM
        WHERE FEEDFIELDNAME = 'ShortTermSet1GHGEmissionPercentageReductionTargeted'
    );
    """
 
# Execute the query and get the DataFrame
df_climate1 = g_db_session.sql(query_climate).toPandas()
 
display(df_climate1)

ORGPERMID

ITEM

VALUERECDT

UNITPCT

UNITPERMID

ISOCURRCODE

VALUE_

VALUEDATE

VALUEFLG

VALUETEXT

VALUEPRCNT

4.3E+09

212

2022

28:33.6

NaN

None

NaN

NaT

None

5000

4.3E+09

212

2021

11:48.1

NaN

None

NaN

NaT

None

5000

4.3E+09

212

2020

42:59.0

NaN

None

NaN

NaT

None

5000

Let's collect only the data we're after:

    	
            df_climate2 = df_climate1[['ORGPERMID', 'FY', 'VALUERECDT', 'VALUEPRCNT']]
df_climate2

	ORGPERMID	FY	VALUERECDT	VALUEPRCNT
0	4.3E+09	2022	28:33.6	5000
1	4.3E+09	2021	11:48.1	5000
2	4.3E+09	2020	42:59.0	5000

Now let's collect company info too, as we did above, in Snowflake's SQL Worksheets:

    	
            query_climate = f"""
    SELECT
        c.FY,
        p.ORGPERMID,
        p.LEGALNAME,
        c.VALUEPRCNT
    FROM 
        SHAREDB_SFI_MVP.DBO.PERMORGINFO p
    JOIN 
        SFI_MASTER_MVP.DBO.CLMDataPoint c
        ON p.ORGPERMID = c.ORGPERMID
    WHERE 
        c.ORGPERMID = 4295875633
        AND c.ITEM IN (
            SELECT ITEM
            FROM SFI_MASTER_MVP.DBO.CLMITEM
            WHERE FEEDFIELDNAME = 'ShortTermSet1GHGEmissionPercentageReductionTargeted'
        );
    """
 
# Execute the query and get the DataFrame
df_climate1 = g_db_session.sql(query_climate).toPandas()
 
display(df_climate1)

	FY	ORGPERMID	LEGALNAME	VALUEPRCNT
0	2021	4.3E+09	ENI S.P.A.	5000
1	2021	4.3E+09	ENI S.P.A.	5000
2	2022	4.3E+09	ENI S.P.A.	5000
3	2022	4.3E+09	ENI S.P.A.	5000
4	2020	4.3E+09	ENI S.P.A.	5000
5	2020	4.3E+09	ENI S.P.A.	5000

Note now that we look for companies using their LEGALNAME with the below (note how, in line 13, we're looking for the name of a company, not the OrgId any more):

    	
            query_climate = f"""
    SELECT
        c.FY,
        p.ORGPERMID,
        p.LEGALNAME,
        c.VALUEPRCNT
    FROM 
        SHAREDB_SFI_MVP.DBO.PERMORGINFO p
    JOIN 
        SFI_MASTER_MVP.DBO.CLMDataPoint c
        ON p.ORGPERMID = c.ORGPERMID
    WHERE 
        p.LEGALNAME ILIKE '%Total%'
        AND c.ITEM IN (
            SELECT ITEM
            FROM SFI_MASTER_MVP.DBO.CLMITEM
            WHERE FEEDFIELDNAME = 'ShortTermSet1GHGEmissionPercentageReductionTargeted'
        );
    """
 
# Execute the query and get the DataFrame
df_climate1 = g_db_session.sql(query_climate).toPandas()
# df_climate1.drop_duplicates(subset=['FY']).sort_values(by=['FY'])
display(df_climate1)

	FY	ORGPERMID	LEGALNAME	VALUEPRCNT
0	2022	5E+09	TotalEnergies SE	1500
1	2022	5E+09	TotalEnergies SE	1500

Now I'm after another field of interest that I think ought to be in the same table, let's JOIN with SFI_MASTER_MVP.DBO.CLMITEM for all relevent FEEDFIELDNAMEs:

    	
            query_climate = f"""
    SELECT *
    FROM 
        SHAREDB_SFI_MVP.DBO.PERMORGINFO p
    JOIN 
        SFI_MASTER_MVP.DBO.CLMDataPoint c
        ON p.ORGPERMID = c.ORGPERMID
    JOIN
        SFI_MASTER_MVP.DBO.CLMITEM k
        ON c.ITEM = k.ITEM
    WHERE 
        p.LEGALNAME ILIKE '%TotalEnergies%'
        AND c.ITEM IN (
            SELECT ITEM
            FROM SFI_MASTER_MVP.DBO.CLMITEM
        );
    """
 
# Execute the query and get the DataFrame
df_climate2 = g_db_session.sql(query_climate).toPandas()
# df_climate2.drop_duplicates(subset=['FY']).sort_values(by=['FY'])
display(df_climate2)

	ORGPERMID	ADDRTYPECODE	ISPUBLIC	IPODATE	ISO2CNTRYCODE	LEGALNAME	ADDRLINE1	ADDRLINE2	CITY	STATEPROV	...	TITLE	DESCRIPTION	DATATYPECODE	TABLECODE	HEIRARCHYCODE	UNITS	POLARITY	INCLUDEDINSCORES	PILLARITEM	CATITEM
0	5E+09	2	TRUE	29-05-15	MA	TotalEnergies Marketing Maroc SA	146, boulevard Zerktouni		CASABLANCA	GRAND CASABLANCA	...	Short Term Set 1 Upstream Fuel and Energy Rela...	Short-Term Targets Set 1: Does the company rep...	1	3	1	Y/N	Not applicable	None	4	1
1	5E+09	1	TRUE	26-09-73	FR	TotalEnergies SE	2 Place Jean Millier	Paris la Defense cedex	PARIS	ILE-DE-FRANCE	...	Long Term Set 2 Downstream Investments	Long-Term Targets Set 2: Does the company repo...	1	3	1	Y/N	Not applicable	None	4	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2724	5E+09	1	TRUE	26-09-73	FR	TotalEnergies SE	2 Place Jean Millier	Paris la Defense cedex	PARIS	ILE-DE-FRANCE	...	Short Term Set 2 Upstream Other	Short-Term Targets Set 2: Does the company rep...	1	3	1	Y/N	Not applicable	None	4	1
2725	5E+09	1	TRUE	26-09-73	FR	TotalEnergies SE	2 Place Jean Millier	Paris la Defense cedex	PARIS	ILE-DE-FRANCE	...	Long Term Set 3 GHG Emission Target Type	Long-Term Targets Set 3: What is the long-term...	6	3	1	Name	Not applicable	None	4	1

2726 rows × 37 columns

I can't see all the column names like this, let's look at the Transpose:

    	
            df_climate2

	ORGPERMID	ADDRTYPECODE	ISPUBLIC	IPODATE	ISO2CNTRYCODE	LEGALNAME	ADDRLINE1	ADDRLINE2	CITY	STATEPROV	POSTALCODE	PRITRBCINDPERMID	CNTRYGEOID	ORGPERMID	ITEM	FY	VALUERECDT	UNITPCT	UNITPERMID	ISOCURRCODE	VALUE_	VALUEDATE	VALUEFLG	VALUETEXT	VALUEPRCNT	ITEM	FEEDFIELDNAME	TITLE	DESCRIPTION	DATATYPECODE	TABLECODE	HEIRARCHYCODE	UNITS	POLARITY	INCLUDEDINSCORES	PILLARITEM	CATITEM
0	5000180598	2	TRUE	29-05-15 00:00	MA	TotalEnergies Marketing Maroc SA	146, boulevard Zerktouni		CASABLANCA	GRAND CASABLANCA	20000	4294951991	100197	5000180598	190	2022	NaT	NaN	NaN	None	NaN	NaT	FALSE	None	NaN	190	ShortTermSet1UpstreamFuelandEnergyRelatedActiv...	Short Term Set 1 Upstream Fuel and Energy Rela...	Short-Term Targets Set 1: Does the company rep...	1	3	1	Y/N	Not applicable	None	4	1
1	5001170594	1	TRUE	26-09-73 00:00	FR	TotalEnergies SE	2 Place Jean Millier	Paris la Defense cedex	PARIS	ILE-DE-FRANCE	92078	4294951998	100105	5001170594	287	2022	NaT	NaN	NaN	None	NaN	NaT	FALSE	None	NaN	287	LongTermSet2DownstreamInvestments	Long Term Set 2 Downstream Investments	Long-Term Targets Set 2: Does the company repo...	1	3	1	Y/N	Not applicable	None	4	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2724	5001170594	1	TRUE	26-09-73 00:00	FR	TotalEnergies SE	2 Place Jean Millier	Paris la Defense cedex	PARIS	ILE-DE-FRANCE	92078	4294951998	100105	5001170594	231	2022	NaT	NaN	NaN	None	NaN	NaT	FALSE	None	NaN	231	ShortTermSet2UpstreamOther	Short Term Set 2 Upstream Other	Short-Term Targets Set 2: Does the company rep...	1	3	1	Y/N	Not applicable	None	4	1
2725	5001170594	1	TRUE	26-09-73 00:00	FR	TotalEnergies SE	2 Place Jean Millier	Paris la Defense cedex	PARIS	ILE-DE-FRANCE	92078	4294951998	100105	5001170594	327	2022	32:16.5	NaN	NaN	None	NaN	NaT	None	Intensity	NaN	327	LongTermSet3GHGEmissionTargetType	Long Term Set 3 GHG Emission Target Type	Long-Term Targets Set 3: What is the long-term...	6	3	1	Name	Not applicable	None	4	1

Looks like there are many duplicates; let's remove them, along with rows where ITEM and VALUEPRCNT are empty:

    	
            df_climate2.drop_duplicates().sort_values(by=['FY']).dropna(subset=['ITEM', 'VALUEPRCNT']).T

	ORGPERMID	ADDRTYPECODE	ISPUBLIC	IPODATE	ISO2CNTRYCODE	LEGALNAME	ADDRLINE1	ADDRLINE2	CITY	STATEPROV	POSTALCODE	PRITRBCINDPERMID	CNTRYGEOID	ORGPERMID	ITEM	FY	VALUERECDT	UNITPCT	UNITPERMID	ISOCURRCODE	VALUE_	VALUEDATE	VALUEFLG	VALUETEXT	VALUEPRCNT	ITEM	FEEDFIELDNAME	TITLE	DESCRIPTION	DATATYPECODE	TABLECODE	HEIRARCHYCODE	UNITS	POLARITY	INCLUDEDINSCORES	PILLARITEM	CATITEM
2237	5001170594	2	TRUE	26-09-73 00:00	FR	TotalEnergies SE	La Defense 6	2 Pl Jean Millier	COURBEVOIE	ILE-DE-FRANCE	92400	4294951998	100105	5001170594	129	2017	06:29.7	NaN	NaN	None	NaN	NaT	None	None	80	129	EmissionReductionTargetPercentage	Emission Reduction Target Percentage	Percentage of emission reduction target set by...	3	3	1	Percentage	Positive	None	4	1
8	5001170594	1	TRUE	26-09-73 00:00	FR	TotalEnergies SE	2 Place Jean Millier	Paris la Defense cedex	PARIS	ILE-DE-FRANCE	92078	4294951998	100105	5001170594	129	2017	06:29.7	NaN	NaN	None	NaN	NaT	None	None	80	129	EmissionReductionTargetPercentage	Emission Reduction Target Percentage	Percentage of emission reduction target set by...	3	3	1	Percentage	Positive	None	4	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
743	5000180598	2	TRUE	29-05-15 00:00	MA	TotalEnergies Marketing Maroc SA	146, boulevard Zerktouni		CASABLANCA	GRAND CASABLANCA	20000	4294951991	100197	5000180598	269	2023	40:02.7	NaN	NaN	None	NaN	NaT	None	None	10000	269	LongTermSet1GHGEmissionPercentageReductionTarg...	Long Term Set 1 GHG Emission Percentage Reduct...	Long-Term Targets Set 1: Long-term GHG emissio...	3	3	1	Percentage	Positive	None	4	1
742	5000180598	1	TRUE	29-05-15 00:00	MA	TotalEnergies Marketing Maroc SA	146, boulevard Zerktouni		CASABLANCA	GRAND CASABLANCA	20000	4294951991	100197	5000180598	269	2023	40:02.7	NaN	NaN	None	NaN	NaT	None	None	10000	269	LongTermSet1GHGEmissionPercentageReductionTarg...	Long Term Set 1 GHG Emission Percentage Reduct...	Long-Term Targets Set 1: Long-term GHG emissio...	3	3	1	Percentage	Positive	None	4	1

'CO2e' and 'CO2e per Tonne of Revenue'

FEEDFIELDNAME ending with 'GHGEmissionCoveredbyTarget' show-case the amount that we are focussing on which we plan to reduce by a certain percentage, i.e.: if we reduce emissions by 50% on a target of 30%, in total we reduced by 15%, here the 'GHGEmissionCoveredbyTarget' is 30% and 'PercentageReductionTargeted' is 50% (such that 'PercentageReductionTargeted' is the amount by which it reduced).

This is all relative to the Scope; i.e.: in the example above, we reduced, in total, our 'Scope x' emissions by 15% for x = {1, 2, 3, 1&2, 1&2&3}.

To make things harder, this is all relative, also, to the value a company decides to focus on. In this article, we only focus on 2: 'CO2e' (i.e.: Tons Carbon Dioxide emissions) and 'CO2e per Tonne of Revenue' (i.e.: the Tons Carbon Dioxide emissions per million of revenue, say per 1 000 000 $ or 1 000 000 € of revenue).

We are only interested in 2 VALUETEXT, 'CO2' and 'CO2 Per Tonne of Revenue'. I happen to know that Microsoft has all types, so it's a good use case for now that encapsulates all permutations both in terms of VALUETEXT and Scope (we already know what columns are available in these tables as per the dataframe above, so let's only pick the ones we're after in lines 3 to 8):

    	
            query_climate = f"""
    SELECT
        c.FY, p.LEGALNAME, p.ORGPERMID, c.VALUEPRCNT,
        k.ITEM, c.VALUE_, k.INCLUDEDINSCORES,
        k.POLARITY, k.UNITS, k.TABLECODE,
        k.DATATYPECODE, k.DESCRIPTION,
        k.TITLE, k.FEEDFIELDNAME, c.VALUEPRCNT,
        c.VALUETEXT, c.VALUEFLG, c.VALUEDATE
    FROM 
        SHAREDB_SFI_MVP.DBO.PERMORGINFO p
    JOIN 
        SFI_MASTER_MVP.DBO.CLMDataPoint c
        ON p.ORGPERMID = c.ORGPERMID
    JOIN
        SFI_MASTER_MVP.DBO.CLMITEM k
        ON c.ITEM = k.ITEM
    WHERE 
        p.LEGALNAME ILIKE '%Microsoft%'
        AND c.ITEM IN (
            SELECT ITEM
            FROM SFI_MASTER_MVP.DBO.CLMITEM
            WHERE (VALUETEXT = 'CO2' OR VALUETEXT = 'CO2 Per Tonne of Revenue') -- This line makes sure we only have 'CO2' OR 'CO2 Per Tonne of Revenue' publishing companies
        );
    """
df_climate3 = g_db_session.sql(query_climate).toPandas()
display(df_climate3)

LEGALNAME

ORGPERMID

VALUEPRCNT

ITEM

VALUE_

INCLUDEDINSCORES

POLARITY

UNITS

TABLECODE

DATATYPECODE

DESCRIPTION

TITLE

FEEDFIELDNAME

VALUEPRCNT

VALUETEXT

VALUEFLG

VALUEDATE

2020

MICROSOFT CORPORATION

4295907168

NaN

354

NaN

None

Not applicable

Name

Long-Term Targets Set 4: What is the long-term...

Long Term Set 4 GHG Emission Target Unit

LongTermSet4GHGEmissionTargetUnit

NaN

CO2

None

NaT

2020

MICROSOFT CORPORATION

4295907168

NaN

354

NaN

None

Not applicable

Name

Long-Term Targets Set 4: What is the long-term...

Long Term Set 4 GHG Emission Target Unit

LongTermSet4GHGEmissionTargetUnit

NaN

CO2

None

NaT

...

2022

MICROSOFT CORPORATION

4295907168

NaN

382

NaN

None

Not applicable

Name

Long-Term Targets Set 5: What is the long-term...

Long Term Set 5 GHG Emission Target Unit

LongTermSet5GHGEmissionTargetUnit

NaN

CO2

None

NaT

2022

MICROSOFT CORPORATION

4295907168

NaN

382

NaN

None

Not applicable

Name

Long-Term Targets Set 5: What is the long-term...

Long Term Set 5 GHG Emission Target Unit

LongTermSet5GHGEmissionTargetUnit

NaN

CO2

None

NaT

    	
            df_climate3.VALUETEXT.unique()

array(['CO2', 'CO2 Per Tonne of Revenue'], dtype=object)

'CO2e'

We can access different datasets related to companies GHG emissions:

LSEG D&A: Detailed climate data collected, quality controlled and normalized by LSEG from companies' public disclosures, including the auditability back to source documents
CDP: Climate data as provided by the companies in response to CDP annual survey
Combined: “Combined” dataset is gathering LSEG data, CDP data and Estimated data based on our robust proprietary multi model estimation approach to provide the best available data for each company. Reported data can be discarded due to additional quality checks (including materiality filter for Scope 3).

For the present use case, as we’re relying on targets disclosed by the companies and collected by our analysts, let’s focus on LSEG D&A reported data.

Depending on the scope of the target (1, 2, 3, 1+2 or 1+2+3) and the unit (CO2e or CO2e per tonne of revenues), we won’t rely on the same datapoint to compute our analysis. The table below shows the direct datapoints to be used (LSEG Workspace code as well as LSEG Data Platform Bulk Field Names):

	CO2 'Workspace Code'	CO2 'Data Platform Bulk Field Names'	CO2 per Tonne of Revenue 'Workspace Code'	CO2 per Tonne of Revenue 'Data Platform Bulk Field Names'
Scope 1	TR.CO2DirectScope1	CO2EquivalentsEmissionDirectScope1	TR.AnalyticGHGEmissionsDirectScope1	GHGEmissionsDirectScope1toRevenue
Scope 2	TR.CO2IndirectScope2	CO2EquivalentsEmissionIndirectScope2	TR.AnalyticGHGEmissionsIndirectScope2	GHGEmissionsIndirectScope2toRevenue
Scope 3	TR.CO2IndirectScope3	CO2EquivalentsEmissionIndirectScope3	TR.AnalyticCO2IndirectScope3	Scope3ToRevenues
Scope 1 and 2	TR.CO2EmissionTotal	CO2EquivalentsEmissionTotal	TR.AnalyticCO2	TotalCO2EquivalentsEmissionsToRevenues
Scope 1 and 2 and 3	TR.TotalCO2EquivalentEmissionsScope1and2and3	TotalCO2EquivalentEmissionsScope1andScope2andScope3	TR.AnalyticGHGEmissionsScope1and2and3	GHGEmissionsScope1andScope2andScope3toRevenue

Let's focus on one use-case for this article; the Reported Direct companies that publish Long Term Set 1 GHG (green house gasses) Emissions; let's also only get the data poits we're interested in, listed in FEEDFIELDNAME_of_interest

    	
            FEEDFIELDNAME_of_interest = ['CO2EquivalentsEmissionDirectScope1', 'CO2EquivalentsEmissionIndirectScope2', 'CO2EquivalentsEmissionIndirectScope3', 'CO2EquivalentsEmissionTotal', 'TotalCO2EquivalentEmissionsScope1andScope2andScope3']
FEEDFIELDNAME_of_interest_SQL_str = ' OR '.join([f"(FEEDFIELDNAME = '{i}')" for i in FEEDFIELDNAME_of_interest])
FEEDFIELDNAME_of_interest_SQL_str

"(FEEDFIELDNAME = 'CO2EquivalentsEmissionDirectScope1') OR (FEEDFIELDNAME = 'CO2EquivalentsEmissionIndirectScope2') OR (FEEDFIELDNAME = 'CO2EquivalentsEmissionIndirectScope3') OR (FEEDFIELDNAME = 'CO2EquivalentsEmissionTotal') OR (FEEDFIELDNAME = 'TotalCO2EquivalentEmissionsScope1andScope2andScope3')"

    	
            query_climate = f"""
    SELECT
        c.FY, p.LEGALNAME, p.ORGPERMID,
        k.ITEM, k.INCLUDEDINSCORES,
        k.POLARITY, k.UNITS, k.TABLECODE,
        k.DATATYPECODE, k.DESCRIPTION,
        k.TITLE, k.FEEDFIELDNAME, c.VALUE_, 
        c.VALUEPRCNT, c.VALUETEXT,
        c.VALUEFLG, c.VALUEDATE
    FROM 
        SHAREDB_SFI_MVP.DBO.PERMORGINFO p
    JOIN 
        SFI_MASTER_MVP.DBO.CLMDataPoint c
        ON p.ORGPERMID = c.ORGPERMID
    JOIN
        SFI_MASTER_MVP.DBO.CLMITEM k
        ON c.ITEM = k.ITEM
    WHERE 
        p.LEGALNAME ILIKE '%Microsoft%'
        AND c.ITEM IN (
            SELECT ITEM
            FROM SFI_MASTER_MVP.DBO.CLMITEM
            WHERE {FEEDFIELDNAME_of_interest_SQL_str}
        );
    """
df_climate4 = g_db_session.sql(query_climate).toPandas()
df_climate4.head(3)

LEGALNAME

ORGPERMID

ITEM

INCLUDEDINSCORES

POLARITY

UNITS

TABLECODE

DATATYPECODE

DESCRIPTION

TITLE

FEEDFIELDNAME

VALUE_

VALUEPRCNT

VALUETEXT

VALUEFLG

VALUEDATE

2010

MICROSOFT CORPORATION

4295907168

None

Negative

Tonnes

Indirect of CO2 and CO2 equivalents emission i...

CO2 Equivalent Emissions Indirect, Scope 2

CO2EquivalentsEmissionIndirectScope2

1035385

NaN

None

NaT

2010

MICROSOFT CORPORATION

4295907168

None

Negative

Tonnes

Indirect of CO2 and CO2 equivalents emission i...

CO2 Equivalent Emissions Indirect, Scope 2

CO2EquivalentsEmissionIndirectScope2

1035385

NaN

None

NaT

2013

MICROSOFT CORPORATION

4295907168

None

Negative

Tonnes

Direct of CO2 and CO2 equivalents emission in ...

CO2 Equivalent Emissions Direct, Scope 1

CO2EquivalentsEmissionDirectScope1

78116

NaN

None

NaT

    	
            df_climate4.pivot_table(
    index='FY',
    columns='FEEDFIELDNAME',
    values=['VALUE_', 'VALUEPRCNT', 'VALUEFLG', 'VALUEDATE', 'VALUETEXT'],
    aggfunc={
        'VALUE_': 'sum',  # for numeric columns, we can use sum, mean, etc.
        'VALUEPRCNT': 'mean',
        'VALUEFLG': 'sum',
        'VALUEDATE': 'first',  # Assuming we want the first value in each group
        'VALUETEXT': 'first'
    }
)

	VALUEFLG					VALUE_
FEEDFIELDNAME	CO2EquivalentsEmissionDirectScope1	CO2EquivalentsEmissionIndirectScope2	CO2EquivalentsEmissionIndirectScope3	CO2EquivalentsEmissionTotal	TotalCO2EquivalentEmissionsScope1andScope2andScope3	CO2EquivalentsEmissionDirectScope1	CO2EquivalentsEmissionIndirectScope2	CO2EquivalentsEmissionIndirectScope3	CO2EquivalentsEmissionTotal	TotalCO2EquivalentEmissionsScope1andScope2andScope3
FY
2005	0	0	NaN	0	NaN	26370	483460	NaN	509830	NaN
2006	0	0	NaN	0	NaN	29460	275740	NaN	305200	NaN
…	…	…	…	…	…	…	…	…	…	…
2021	0	0	0	0	0	247408	10021340	25022000	10268740	36311144
2022	0	0	0	0	0	278826	12762500	25142000	13041320	39665326

Clearly, the item of interest here is 'VALUE_':

    	
            _df_climate = df_climate4.pivot_table(
    index='FY',
    columns='FEEDFIELDNAME',
    values=['VALUE_'])['VALUE_']
_df_climate

FEEDFIELDNAME	CO2EquivalentsEmissionDirectScope1	CO2EquivalentsEmissionIndirectScope2	CO2EquivalentsEmissionIndirectScope3	CO2EquivalentsEmissionTotal	TotalCO2EquivalentEmissionsScope1andScope2andScope3
FY
2005	13185	241730	NaN	254915	NaN
2006	14730	137870	NaN	152600	NaN
…	…	…	…	…	…
2021	123704	5010670	12511000	5134370	18155572
2022	139413	6381250	12571000	6520660	19832663

let's add the 'GHGEmissionCoveredbyTarget' and the 'PercentageReductionTargeted' as well as the LongTermSet1GHGEmissionScope, LongTermSet2GHGEmissionScope, ...

    	
            query_climate = f"""
    SELECT
        c.FY, p.LEGALNAME, p.ORGPERMID, c.VALUEPRCNT,
        k.ITEM, c.VALUE_, k.INCLUDEDINSCORES,
        k.POLARITY, k.UNITS, k.TABLECODE,
        k.DATATYPECODE, k.DESCRIPTION,
        k.TITLE, k.FEEDFIELDNAME, c.VALUEPRCNT,
        c.VALUETEXT, c.VALUEFLG, c.VALUEDATE
    FROM
        SHAREDB_SFI_MVP.DBO.PERMORGINFO p
    JOIN
        SFI_MASTER_MVP.DBO.CLMDataPoint c
        ON p.ORGPERMID = c.ORGPERMID
    JOIN
        SFI_MASTER_MVP.DBO.CLMITEM k
        ON c.ITEM = k.ITEM
    WHERE
        p.LEGALNAME ILIKE '%Microsoft%'
        AND c.ITEM IN (
            SELECT ITEM
            FROM SFI_MASTER_MVP.DBO.CLMITEM
            WHERE (
                FEEDFIELDNAME ILIKE '%GHGEmissionCoveredbyTarget%')
                OR (FEEDFIELDNAME ILIKE '%PercentageReductionTargeted%')
                OR (FEEDFIELDNAME ILIKE '%LongTermSet%GHGEmissionScope%')
                OR (FEEDFIELDNAME ILIKE '%LongTermSet%GHGEmissionBaseYear%')
                OR (FEEDFIELDNAME ILIKE '%LongTermSet%GHGEmissionTargetYear%')
                OR (FEEDFIELDNAME ILIKE '%CO2EquivalentEmissionsTotalMarketbased%')
        );
    """
df_climate5 = g_db_session.sql(query_climate).toPandas()

Let's keep track of the LEGALNAME

    	
            legalName = df_climate5.LEGALNAME.iloc[0]
legalName

'MICROSOFT CORPORATION'

We're are interested in VALUETEXT figures because they let us know which Scopes we focus on for each emission data point.

We also need the base year from which we do our calculation, i.e.: when is the emission reduction calculated from? FEEDFIELDNAME ILIKE '%LongTermSet%GHGEmissionBaseYear%' fetches just this info. We can collect it via The VALUE_ values.

We also need the aim year from which we do our calculation, i.e.: when is the emission reduction set to be done by? FEEDFIELDNAME ILIKE '%LongTermSet%GHGEmissionTargetYear%' fetches just this info. We can collect it via The VALUE_ values.

The VALUEPRCNT values are the others we're interested in.

Let's concatenate all data in an easy-to-use dataframe df_climate6:

    	
            df_climate6 = pd.concat(
    [
        df_climate5.pivot_table(
            index='FY',
            columns='FEEDFIELDNAME',
            values=['VALUEPRCNT'],
        ),
        df_climate5.pivot_table(
            index='FY',
            columns='FEEDFIELDNAME',
            values=['VALUETEXT'],
            aggfunc={
                'VALUETEXT': 'first'
            }
        ),
        df_climate5.pivot_table(
            index='FY',
            columns='FEEDFIELDNAME',
            values=['VALUE_'],
        )
    ],
    axis=1)
df_climate6

	VALUEPRCNT										...	VALUE_
FEEDFIELDNAME	LongTermSet1GHGEmissionPercentageReductionTargeted	LongTermSet1PercentageofGHGEmissionCoveredbyTarget	LongTermSet2GHGEmissionPercentageReductionTargeted	LongTermSet2PercentageofGHGEmissionCoveredbyTarget	LongTermSet3GHGEmissionPercentageReductionTargeted	LongTermSet3PercentageofGHGEmissionCoveredbyTarget	LongTermSet4GHGEmissionPercentageReductionTargeted	LongTermSet4PercentageofGHGEmissionCoveredbyTarget	LongTermSet5GHGEmissionPercentageReductionTargeted	LongTermSet5PercentageofGHGEmissionCoveredbyTarget	...	LongTermSet1GHGEmissionBaseYear	LongTermSet1GHGEmissionTargetYear	LongTermSet2GHGEmissionBaseYear	LongTermSet2GHGEmissionTargetYear	LongTermSet3GHGEmissionBaseYear	LongTermSet3GHGEmissionTargetYear	LongTermSet4GHGEmissionBaseYear	LongTermSet4GHGEmissionTargetYear	LongTermSet5GHGEmissionBaseYear	LongTermSet5GHGEmissionTargetYear
FY
2019	7500	10000	7500	10000	NaN	NaN	NaN	NaN	NaN	NaN	...	2013	2045	2013	2030	NaN	NaN	NaN	NaN	NaN	NaN
2020	7500	10000	7500	10000	3000	10000	5000	10000	NaN	NaN	...	2013	2045	2013	2030	2017	2030	2020	2030	NaN	NaN
2021	7500	10000	7500	10000	3000	10000	5000	10000	NaN	NaN	...	2013	2045	2013	2030	2017	2030	2020	2030	NaN	NaN
2022	7500	10000	7500	10000	3000	10000	10000	NaN	5000	10000	...	2013	2045	2013	2030	2017	2030	2020	2030	2020	2030

    	
            for col in df_climate6.VALUEPRCNT.columns:
    _df_climate[col] = df_climate6.VALUEPRCNT[col].iloc[:,0]

    	
            for col in df_climate6.VALUETEXT.columns:
    _df_climate[col] = df_climate6.VALUETEXT[col]

    	
            for col in df_climate6.VALUE_.columns:
    _df_climate[col] = df_climate6.VALUE_[col]

    	
            _df_climate

FEEDFIELDNAME

CO2EquivalentsEmissionDirectScope1

CO2EquivalentsEmissionIndirectScope2

CO2EquivalentsEmissionIndirectScope3

CO2EquivalentsEmissionTotal

TotalCO2EquivalentEmissionsScope1andScope2andScope3

LongTermSet1GHGEmissionPercentageReductionTargeted

LongTermSet1PercentageofGHGEmissionCoveredbyTarget

LongTermSet2GHGEmissionPercentageReductionTargeted

LongTermSet2PercentageofGHGEmissionCoveredbyTarget

LongTermSet3GHGEmissionPercentageReductionTargeted

...

LongTermSet1GHGEmissionBaseYear

LongTermSet1GHGEmissionTargetYear

LongTermSet2GHGEmissionBaseYear

LongTermSet2GHGEmissionTargetYear

LongTermSet3GHGEmissionBaseYear

LongTermSet3GHGEmissionTargetYear

LongTermSet4GHGEmissionBaseYear

LongTermSet4GHGEmissionTargetYear

LongTermSet5GHGEmissionBaseYear

LongTermSet5GHGEmissionTargetYear

2005

13185

241730

NaN

254915

NaN

...

NaN

2006

14730

137870

NaN

152600

NaN

...

NaN

…

2021

123704

5010670

12511000

5134370

18155572

7500

10000

7500

10000

3000

...

2013

2045

2013

2030

2017

2030

2020

2030

NaN

2022

139413

6381250

12571000

6520660

19832663

7500

10000

7500

10000

3000

...

2013

2045

2013

2030

2017

2030

2020

2030

2020

2030

Some of the values are pre-multiplied by 100 when it's a percentage. I'm personally not a fan of that, so let's scale it back to 100 as a maximum:

    	
            for col in _df_climate.columns:
    # Check if column name contains the target substrings
    if ('GHGEmissionPercentageReductionTargeted' in col or 'GHGEmissionCoveredbyTarget' in col):
        # Select the column and check if all numeric values (ignoring NaNs) are greater than 100
        if _df_climate[col].apply(pd.to_numeric, errors='coerce').dropna().gt(100).all():
            # If condition is met, divide the values by 100
            _df_climate[col] = _df_climate[col] / 100
_df_climate

FEEDFIELDNAME	CO2EquivalentsEmissionDirectScope1	CO2EquivalentsEmissionIndirectScope2	CO2EquivalentsEmissionIndirectScope3	CO2EquivalentsEmissionTotal	TotalCO2EquivalentEmissionsScope1andScope2andScope3	LongTermSet1GHGEmissionPercentageReductionTargeted	LongTermSet1PercentageofGHGEmissionCoveredbyTarget	LongTermSet2GHGEmissionPercentageReductionTargeted	LongTermSet2PercentageofGHGEmissionCoveredbyTarget	LongTermSet3GHGEmissionPercentageReductionTargeted	...	LongTermSet1GHGEmissionBaseYear	LongTermSet1GHGEmissionTargetYear	LongTermSet2GHGEmissionBaseYear	LongTermSet2GHGEmissionTargetYear	LongTermSet3GHGEmissionBaseYear	LongTermSet3GHGEmissionTargetYear	LongTermSet4GHGEmissionBaseYear	LongTermSet4GHGEmissionTargetYear	LongTermSet5GHGEmissionBaseYear	LongTermSet5GHGEmissionTargetYear
FY
2005	13185	241730	NaN	254915	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2006	14730	137870	NaN	152600	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2021	123704	5010670	12511000	5134370	18155572	75	100	75	100	30	...	2013	2045	2013	2030	2017	2030	2020	2030	NaN	NaN
2022	139413	6381250	12571000	6520660	19832663	75	100	75	100	30	...	2013	2045	2013	2030	2017	2030	2020	2030	2020	2030

For Your information (FYI): CO2EquivalentsEmissionTotal stands for TotalCO2EquivalentEmissionsScope1andScope2. As we can see in the far right of _df_climate, Scope1 & Scope2 seems the most popular, surely because that’s where companies have operational control; let's stick with that. Let's also choose 'Set 1' for this article, but you can choose whichever suits your needs.

    	
            df_climate = _df_climate[[
    "CO2EquivalentsEmissionDirectScope1",
    "CO2EquivalentsEmissionIndirectScope2",
    "CO2EquivalentsEmissionTotal"]]

    	
            for col in ["CO2EquivalentEmissionsTotalMarketbased",
            "LongTermSet1GHGEmissionPercentageReductionTargeted",
            "LongTermSet1PercentageofGHGEmissionCoveredbyTarget",
            "LongTermSet1GHGEmissionBaseYear",
            "LongTermSet1GHGEmissionTargetYear"]:
    df_climate[col] = _df_climate[_df_climate['LongTermSet1GHGEmissionScope'] == "Scope 1 and 2"][col]
df_climate

FEEDFIELDNAME	CO2EquivalentsEmissionDirectScope1	CO2EquivalentsEmissionIndirectScope2	CO2EquivalentsEmissionTotal	CO2EquivalentEmissionsTotalMarketbased	LongTermSet1GHGEmissionPercentageReductionTargeted	LongTermSet1PercentageofGHGEmissionCoveredbyTarget	LongTermSet1GHGEmissionBaseYear	LongTermSet1GHGEmissionTargetYear
FY
2005	13185	241730	254915	NaN	NaN	NaN	NaN	NaN
2006	14730	137870	152600	NaN	NaN	NaN	NaN	NaN
…	…	…	…	…	…	…	…	…
2021	123704	5010670	5134370	553109	75	100	2013	2045
2022	139413	6381250	6520660	427442	75	100	2013	2045

    	
            if not df_climate['LongTermSet1PercentageofGHGEmissionCoveredbyTarget'].apply(pd.to_numeric, errors='coerce').dropna().gt(80).all():
    print("Be weary of the fact that 'GHGEmissionCoveredbyTarget' is below 80%. The lower this figure, the less reliable the results.")
else:
    print("'GHGEmissionCoveredbyTarget' figures are above 80%. The higher this figure, the more reliable the results.")
    print("In this case, we consider them to be high enough to be considered reliable.")

'GHGEmissionCoveredbyTarget' figures are above 80%. The higher this figure, the more reliable the results.
In this case, we consider them to be high enough to be considered reliable.

So, let's recap; the base emission is baseEmission and the target is targetEmission to be archeved by 2045:

    	
            baseEmissionYear = int(df_climate.LongTermSet1GHGEmissionBaseYear.iloc[-1])
targetEmissionReductionYear = int(df_climate.LongTermSet1GHGEmissionTargetYear.iloc[-1])
baseEmission = _df_climate.loc[baseEmissionYear].CO2EquivalentsEmissionTotal
targetReductionEmission = baseEmission * (df_climate['LongTermSet1GHGEmissionPercentageReductionTargeted'].iloc[-1] / 100) * (df_climate['LongTermSet1PercentageofGHGEmissionCoveredbyTarget'].iloc[-1] / 100)
targetEmission = baseEmission - targetReductionEmission
print(f"baseEmissionYear = {baseEmissionYear} & targetEmissionReductionYear = {targetEmissionReductionYear}")
print(f"baseEmission = {baseEmission} & targetReductionEmission = {targetReductionEmission} & targetEmission = {targetEmission}")

baseEmissionYear = 2013 & targetEmissionReductionYear = 2045
baseEmission = 1355480.0 & targetReductionEmission = 1016610.0 & targetEmission = 338870.0

It's always best to see it all in a pretty graph:

Exponential Decay Line Plot Explanation

To create an exponential decay plot, we use the following exponential decay function:

Steps:

Calculate the Decay Constant 𝑘 : We first calculate the decay constant 𝑘 based on the start and end values (1,355,480.0 at 2013 and 338,870.0 at 2045). We use the following formula for exponential decay:

This formula is derived by solving for 𝑘 in the exponential decay equation, where 𝑦_end is the value at the end year (2045), 𝑦_0 is the starting value (2013), and 𝑡end is the end year.

Generate the Exponential Values: With 𝑘 calculated, we use the exponential decay equation to generate values for each year between 2013 and 2045:

This generates values that decrease exponentially from 1,355,480.0 in 2013 to 338,870.0 in 2045.

    	
            start_year = baseEmissionYear
end_year = targetEmissionReductionYear
start_value = baseEmission
end_value = targetEmission
 
# Calculate the decay constant 'k' for exponential decay
# We use the formula for exponential decay: y(t) = y0 * exp(-k * (t - t0))
# Solve for 'k' using y(t) at the end point
k = -np.log(end_value / start_value) / (end_year - start_year)
 
# Generate years and corresponding values using exponential decay
years = np.arange(start_year, end_year + 1)
values = start_value * np.exp(-k * (years - start_year))
 
# Create the plot
fig = go.Figure()
 
# Add the line trace
fig.add_trace(
    go.Scatter(
        x=years, y=values, mode='lines',
        name=f'Exponential Decay to target of {int(targetEmission)}'))
 
# Set plot title and labels
fig.update_layout(
    title=f"{legalName}'s GHG Emissions (Base in {baseEmissionYear} and Target in {targetEmissionReductionYear}) in CO2 Tons",
    xaxis_title="Year",
    yaxis_title="Value",
    template="plotly_dark"  # Optional: You can use different templates, such as 'plotly', 'ggplot2', etc.
)
 
# Add the dots representing the actual CO2 data from df_clmt0
fig.add_trace(go.Scatter(
    x=df_climate.index, 
    y=df_climate["CO2EquivalentsEmissionTotal"], 
    mode='lines+markers', 
    name='CO2 Equivalents Emission Total',
    marker=dict(color='red', size=8)  # Customize the dot color and size
))
 
fig.add_trace(go.Scatter(
    x=df_climate.index, 
    y=df_climate["CO2EquivalentEmissionsTotalMarketbased"], 
    mode='lines+markers', 
    name='CO2 Equivalent Emissions Total Marketbased',
    marker=dict(color='grey', size=8)  # Customize the dot color and size
))
 
# Show the plot
fig.show()

All together now

You can find the whole code on GitHub in the link at the top of this articel.

    	
            dfDict = GHGEmissionsBaseVsTargetUsingLSEGSFISnowflake(
        g_db_session=g_db_session,
        companyLegalName="Shell",
        exactCompanyLegalName=False)

'GHGEmissionCoveredbyTarget' figures are above 80%. The higher this figure, the more reliable the results.
In this case, we consider them to be high enough to be considered reliable.
baseEmissionYear = 2016 & targetEmissionReductionYear = 2050
baseEmission = 83000000.0 & targetReductionEmission = 83000000.0 & targetEmission = 0.0

Conclusion

As you can see, once you know the name of the fields you are after, it becomes a simple matter of mix and match which SQL, along with Snowflake, can do for you! The code above is not fool-proof, and is not to be used in live environment, but it's a great example of how one can created automated processes allowing for anyone to gather insight behind SFI data and output useful graphics.

Github - LSEG SFI Snowflake Warehouse

LSEG's SFI Climate Data in Workspace: A use-case evaluating public companies' Climate track record using Python

Articles

Developer Tools

Use Cases

Recent and Upcoming

Past Events

About Us

Authors:

Beginner's steps: SFI Snowflake using Snowpark to collect ESG data

Intermediary steps: SFI Snowflake using the browser to collect Company data

Snowflake Python Worksheets

Advanced steps: SFI Snowflake using Python and Snowpark

'CO2e' and 'CO2e per Tonne of Revenue'

'CO2e'

Exponential Decay Line Plot Explanation

Steps:

All together now

Conclusion

Source Code

Related APIs

Related Articles