Context

Filtering ESG (Environmental, Social, and Governance) controversies from a company’s news feed is a highly labor-intensive task. ESG analysts must sift through a large number of articles, assess the nature of any controversies, and tag the relevant company (or companies) involved. This process is further complicated by the fact that companies may be directly or indirectly linked to the controversy. The purpose of this task is to maintain an up-to-date database of companies to inform investors and other stakeholders.

Project Scope

I selected this use case as capstone project for my graduation from the MIT Design and Build AI Products and Services course. Additionally, I sought to align it with the ongoing work of my colleagues at LSEG Innovation, who were developing a model that can:

  • Ingest news articles,
  • Map the severity of controversies over time, and
  • Categorize controversies by type, scale of impact, and other relevant factors.

My goal is to automate the process of attributing ESG controversies to reduce the human burden in determining which companies are involved and whether their involvement is direct or indirect. 

Operational Description

Currently, the entire process of handling ESG controversies is manual, involving several steps:

  1. Monitoring news sources,
  2. Performing searches using relevant keywords,
  3. Collecting and reading the returned articles,
  4. Flagging articles and entering them into an Excel sheet, categorizing them by severity level,
  5. Performing a quality check on the data, and
  6. Finalizing which controversies will be approved for publication and included in the news feed consumed by stakeholders.

Specifically, I want to focus on inserting an additional model just before step 4. This model would tag the company(ies) involved in the news controversies, and since multiple companies may be involved in a single controversy, the algorithm should also decide the “degree” of responsibility of each company. This would significantly streamline the analyst’s workflow.

ESG analysts would still have a significant role in data gathering, cleaning and approve controversies in bulk via the interface (wireframes to follow below).

Performance Metrics

Currently, the entire process is manual. On average, it takes approximately 1.8 minutes for an analyst just to read an article, not including the additional steps. While I am unable to provide more detailed figures at this time, my goal is to improve analyst productivity by at least 70%.

My Approach To The Problem

The whole idea is about building a POC and give it to some ESG analyst to test it. In facts, humans will still be involved in curating and prepare training data and review the model output. In a nutshell my approach involves:

  1. Processing the news to extract company names using spaCy (a python library used for entity recognition).
  2. Composing a prompt for Open AI adopting the definition of involvement described in the MSCI ESG Controversies and Global Norms Methodology.
  3. Using OpenAI APIs to classify the extracted companies as directly/indirectly involved using a prompt that I composed in the previous step.
  4. Calculating a confidence score in such classification.

Company Names Extraction

Extracting company names from a corpus of text is a very common use case nowadays, in fact NER (Name entity recognition) is a natural language processing (NLP) technique used to identify and classify named entities within a body of text. Named entities are specific pieces of information such as:
People (e.g., “Barack Obama”), Organizations (e.g., “Google”), Locations (e.g., “New York”), Dates (e.g., “October 22, 2024”), Monetary values (e.g., “$100”) etc.
Discussing the ways to optimize the usage of NER in general is not in scope for this project. In this instance it is sufficient to highlight that I installed the spaCy library available as a python package using conda install command lines and use it as it was.

Open AI Prompt Composing for direct/indirect involvement classification

Once the company(ies) names involved in the news have been extracted, its time to dig deeper to establish the level of involvement in a ESG controversy. In order to do this, I spent some time reviewing the Open AI prompt engineering guide to craft my long prompt. Giving the complex nature of ESG news, It didnt take long to realise that I needed to rely on a clear methodology to asses the level of involvement in any given dispute.
This brings me to the next step, where I delved in MSCI ESG Controversies and Global Norms Methodology paper.

This document does a superb job in drafting a series of guidelines to follow for any ESG controversy assessment.
In particular, chapter 1.3 illustrates how to discriminate between direct and indirect involvement of any company in a controversy. I took inspiration from their definitions to complete my Open AI prompt and see how my code would perform. In one of my early iterations it looked as rough as this:

Code Sections

After playing around with python and get some help from Gemini AI, I put together the essential sections of my code:
1- Importing library sources needed for basic text processing

2 – Load spaCy and categorize involvement function

3 – Processing news function

4 – Reading data from excel, processing articles and create a new data frame

Code Iterations

At this point, it was the right time to switch to jupyter notebook and hit my head around while experimenting. The model was pretty much returning what I needed, a company name, and the involvement: direct or indirect. However it was very imprecise. I therefore decided to proceed with some training data. To do so I put my ESG Analyst hat on and prepared an excel sheet with around 80 rows of news. I read thru all of those and label the companies myself as indirect or indirect, like in the screenshot below.

To feed this data to the model I modified my code and added this cell

After that, I modified the process news function to include some sort of confidence score (also added the “satistics” library among the imports at the top)

I also thought it was a good idea to break the flow into batches for optimal performance monitoring (and keep an eye to the Open AI billing! 😱)

Finally I needed to write the new results of training into a new excel.

At this point I spent several days running the model thru the data and see how the final excel sheet looked like. Every now and then I made some tweaks on the prompt to make the answer more precise and concise. After several iterations I eventually ended up with something more precise, in particular:

  • The model would now take the toy problem in the first 3 rows and answer accordingly (Apple is a company that makes computers is not a controversy)
  • Printing the involvement function would return Direct or Indirect Involvement ONLY instead of giving a super long answer

For the sake of brevity, I’ll show only how the first rough prompt evolved into the final version below (added parts are highlighted in yellow)

Wireframes

To complement this solution, we need a UI for the ESG analysts to review what the model returned, and approve ESG controversies in bulk (or not). Here, the analyst has the option to send the news, classified as ESG controversies, to the database where they will be further processed, to assess severity and scale of impact.
All this can be essentially addressed in one critical screen with the addition of a few contextual modals:

To speed up things, the analyst has the option to send further OR flag multiple news at once using the confidence level returned by the model.

The quick video below displays this simple interaction

https://vimeo.com/1023465113?share=copy

For the sake of completeness I also included a quick heuristic evaluation according to the 10 Usability Heuristics by Nielsen Norman, which is by now an industry standard. I think this is always a good strategy to follow when it comes to put the most skeptical stakeholders at ease… (have you ever heard: “yes, but why did you do THIS instead of THAT ?!?“)

Conclusions and next steps

After training the model thru the first batches the results are encouraging but there is still a lot to be done to increase accuracy:

Named entity recognition improvements, the spaCy library performance could be easily increased but this would be perhaps a separate project, a few ideas:

  • Providing lists of known companies and tweak the parameters a bit.
  • Swapping the en_core_web_sm for the larger or medium version.
  • Using a Custom Model: with a specific domain or language, training a custom model can significantly improve performance and accuracy.
  • Consider Alternative Libraries: explore other NLP libraries like Hugging Face Transformers or AllenNLP.

The confidence score methodology needs to be revised to take into account missing company names and still returning high score. Being this whole project a Proof of Concept exercise I didnt have the ambition of getting everything right by myself. Despite my will to keep going, I had to realize that this would be the point to pass the whole thing to a Data Scientist for proper development. I did learn a lot though!