Filtering Realtime News with RDP Filters

April 01, 2022

Zoya Farberov

Introduction to News Filtering

In this article we would like to discuss how to design and implement filtering of realtime Machine Readable News (MRN) using Refinitiv Data Platform (RDP) news filtering expressions.

Let us first look at an example of an RDP filtering expression:

Let us note how:

Each stanza of filtering can be either present or absent in the expression
The filtering stanzas are combined with logical ANDs and ORs as well as round parenthesis

And we are ready to put together a simple map:

consequently our design approach on receipt of a new tagged realtime story will be:

1. Processing of filtering stanzas against the story:

a) Find each filtering stanza in the filtering expression

b) Identify the corresponding field to be filtered in realtime news

c) Apply filtering to filtering stanza based on the contents of the field - resolve it into True or False conclusion

2. Replace the results into the original filtering expression

3. Use python regular expression module resolve the expression into a single True or False conclusion

4. Present the result as the output. Consequently, the results of the filtering can be processed per application requirement at hand, for example, only stories that pass the filtering can be propagated to the business tier for presentation.

Now let us take a more detailed look at the implementation of this approach:

Building up from the Existent RDP Examples

When working on this example, we have started with RDP Market Price Service discovery example https://github.com/Refinitiv/websocket-api/tree/master/Applications/Examples/RDP/python and

Example of outputting Machine Readable News JSON data using Websockets https://github.com/Refinitiv-API-Samples/Example.WebSocketAPI.Python.MRN/tree/ERT-in-Cloud,

next building up as https://github.com/Refinitiv-API-Samples/Example.WebsocketAPI.RealtimeNewsStreamingOptimizedWealth.Python and adding the filtering option in:

https://github.com/Refinitiv-API-Samples/Example.WebsocketAPI.MRNRDPGwServiceDiscoveryFilter.Python

Please note that Refinitiv Data Platform is an actively growing, evolving and improving product, and so is RDP news. Consequently, the specifics of filtering that are discussed next may also change and improve over time.

In the next sections, we are going to focus on new filtering.

Applyting Filtering

At the point when a story is assembled out of fragments and decompressed is when we apply the filtering criteria to determine the content of the story:

    	
                            news_decompressed = json.loads(decompressed_data)
                print("News = "+json.dumps(news_decompressed, indent=2, separators=(',', ':')))
                sys.stdout.flush()
                print("<<<verify RDP filter result>>>= " +
                str(self.verifyNewsAgainstFilter(news_decompressed,filter_expression)))

Filtering Audiences

Audiences filtering stanza can be parenthesized or not, and in the former case, there can be nested parenthesis. We take the approach of evaluating audiences filtering stanza against the audiences field in news as a whole, from the first to the last parenthesis, if required:

    	
            # audience  there can be single NP, comma-separated = OR, explicit OR or AND or combo
        audience_begin = filter_expr_.find('NP:')
        if audience_begin !=-1:
            print('\t~~~Audience filtering:~~~')
            if filter_expr_.find('(') != -1 and filter_expr_.find('(') < audience_begin:
                audience_begin = filter_expr_.find('(')
            audience_end = filter_expr_.find('AND',audience_begin + 2)
            if audience_end != -1:
                additional_audience_begin = filter_expr_.find('NP:',audience_end + 3)
                while additional_audience_begin != -1 and audience_end != -1:
                    audience_end = filter_expr_.find('AND',additional_audience_begin + 2)
                    additional_audience_begin = filter_expr_.find('NP:',audience_end + 3)
                if audience_end == -1:
                    audience_end = len(filter_expr_)
            else:
                audience_end = len(filter_expr_)
            filter_expr_audience = filter_expr_[audience_begin:audience_end]
           audience_res = self.audienceCheck(str(news_['audiences']), filter_expr_audience)
           filter_expr_ = filter_expr_.replace(filter_expr_[0:audience_end],str(audience_res)+' ')
           current_pos = audience_end + 1

and we are going to define helper function audienceCheck to handle the audience verification, that can potentially be called recursively:

    	
            def audienceCheck(self, _audiences_stanza, filter_expr_audience):
        filter_expr_audience = filter_expr_audience.replace(","," OR NP:")
        filter_expr_audience = filter_expr_audience.replace("&"," AND ")
        if filter_expr_audience.find('AND') == -1:   #only ORs or single - quick shortcut
            orAudiences = filter_expr_audience.split(' OR ')
            if 'True' in orAudiences:
                return True
            elif any(x in _audiences_stanza for x in orAudiences):
                print('<TRUE> for audiences')
                return True
            else:
        #       print('<FALSE> for audiences')
                return False
        elif filter_expr_audience.find('OR') == -1:   #only ANDs - quick shortcut
            andAudiences = filter_expr_audience.split(' AND ')
            if 'False' in andAudiences:
                return False
            elif all(x in _audiences_stanza for x in andAudiences):
                print('<TRUE> for audiences')
                return True
            else:
        #        print('<FALSE> for audiences')
                return False
        else:  #it's a parenthesis with OR and AND, for example (NP:RNP OR NP:DNP) AND NP:M, mixed always with parenthesis
            parent_begin = filter_expr_audience.find('(')  
            while parent_begin != -1:   #evaluate parenthesis
                parent_end = filter_expr_audience.find(')')
                filter_sub_expr_audience = filter_expr_audience[(parent_begin+1):(parent_end-1)]
                parentheses_result = str(self.audienceCheck(_audiences_stanza, filter_sub_expr_audience))
                filter_expr_audience = filter_expr_audience.replace(filter_expr_audience[parent_begin:parent_end+1],str(parentheses_result))
                parent_begin = filter_expr_audience.find('(')  
            NP_begin = filter_expr_audience.find('NP:')
            while NP_begin != -1:   #evaluate individual NPs
                NP_end = filter_expr_audience.find(' ', NP_begin)
                filter_sub_expr_audience = filter_expr_audience[(NP_begin):(NP_end)]
                if NP_end == -1:
                    NP_end = len(filter_expr_audience)
                NP_result = str(self.audienceCheck(_audiences_stanza, filter_sub_expr_audience))
                filter_expr_audience = filter_expr_audience.replace(filter_expr_audience[(NP_begin):(NP_end)],str(NP_result))
                NP_begin = filter_expr_audience.find('NP:')
            filter_expr_audience = filter_expr_audience.replace('OR','or').replace('AND','and')
            return eval(filter_expr_audience)  #python eval loglcal and and or mix
        return True

Filtering Subjects

Filtering subjects is much more straightforward, and we take the uniform approach:

    	
                    subjects_begin = filter_expr_.find('G:')
        if subjects_begin !=-1:
            print('\t^^^Subjects filtering:^^^')
            subjects_end = self.findStanzaEndPos("AND|OR",filter_expr_,subjects_begin)
            filter_expr_subjects = filter_expr_[subjects_begin:subjects_end]
            print('\t\tFound in news: '+ str(news_['subjects']))
            print('\t\tRequired in filter: ' + filter_expr_subjects)
            subjects_res = self.simpleCheck(str(news_['subjects']), filter_expr_subjects, ' OR G:')
            print('\t^^^Subjects result = '+str(subjects_res)+'^^^')
            # replace subjects in the expression
            filter_expr_ = filter_expr_.replace(filter_expr_[subjects_begin:subjects_end],str(subjects_res)+' ')
            print('\t^^^Filtering expression progressed = '+filter_expr_+'^^^')

defining a helper function simpleCheck to be used in the verification of subjects, provider, instanceOf and language filtering stanzas:

    	
            def simpleCheck(self, _subjects_stanza, filter_expr_subjects, replaceForComma):  
    #use for subjects and provider instanceOf check
        filter_expr_subjects = filter_expr_subjects.replace(",",replaceForComma)
        # always OR
        orSubjects = filter_expr_subjects.split(' OR ')
        if any(x in _subjects_stanza for x in orSubjects):
            return True
        else:
            return False

Filtering Provider

Is done in s similar way:

    	
                    provider_begin = filter_expr_.find('NS:')
        if provider_begin !=-1:
            print('\t***Provider filtering:***')
            provider_end = self.findStanzaEndPos("AND|OR|\)",filter_expr_,provider_begin)
            filter_expr_provider = filter_expr_[provider_begin:provider_end]
            print('\t\tFound in news: '+ str(news_['provider']))
            print('\t\tRequired in filter: ' + filter_expr_provider)
            provider_res = self.simpleCheck(str(news_['provider']), filter_expr_provider, ' OR NS:')
            print('\t***Provider result = '+str(provider_res)+'***')
            # replace provider in the expression
            filter_expr_ = filter_expr_.replace(filter_expr_[provider_begin:provider_end],str(provider_res)+' ')
            print('\t***Filtering expression progressed = '+filter_expr_+'***')

Filtering Named Items

We look for the beginning of the filtering stanza and the end of it the same way:

    	
                   instanceOf_begin = filter_expr_.find('RR:')
        if instanceOf_begin !=-1:
            print('\t+++instanceOf filtering:+++')
            instanceOf_end = self.findStanzaEndPos("AND|OR|\)",filter_expr_,instanceOf_begin)
            filter_expr_instanceOf = filter_expr_[instanceOf_begin:instanceOf_end]
            print('\t\tFound in news: '+ str(news_['instancesOf']))
            print('\t\tRequired in filter: ' + filter_expr_instanceOf)
            instanceOf_res = self.simpleCheck(str(news_['instancesOf']), filter_expr_instanceOf, ' OR RR:')
            print('\t+++instanceOf result = '+str(instanceOf_res)+'+++')
            # replace instanceOf in the expression
            filter_expr_ = filter_expr_.replace(filter_expr_[instanceOf_begin:instanceOf_end],str(instanceOf_res)+' ')
            print('\t+++Filtering expression progressed = '+filter_expr_+'+++')

Language Filtering

Language filtering uses the same approach with a slight caveat as some replacement may be required prior to bring the language filtering stanza to the uniform regular expression required:

    	
                   language_begin = filter_expr_.find('&language=')
        if language_begin != -1:
            filter_expr_ = filter_expr_.replace('&language=','AND L:')
            language_begin = language_begin + 4
        if language_begin == -1:
            language_begin = filter_expr_.find('L:')
        if language_begin != -1:
            print('\t===Language fitering:===')
            filter_expr_language = filter_expr_[language_begin:len(filter_expr_)]
            filter_expr_language = filter_expr_language.lower()
            print('\t\tFound in news: '+ str(news_['language']))
            print('\t\tRequired in filter: ' + filter_expr_language)
            language_res = self.simpleCheck(str(news_['language']), filter_expr_language[2:], ' OR ')
            print('\t===language result = '+str(language_res)+'===')
            # replace language in the expression
            filter_expr_ = filter_expr_.replace(filter_expr_[language_begin:len(filter_expr_)],str(language_res)+' ')
        filter_expr_ = filter_expr_.replace('OR','or').replace('AND','and')

Putting It All Together

This part is very simple and I find it elegant- we use Python's capability to evaluate the resulting regular expression:

    	
            return eval(filter_expr_)

this gives us the answer of whether the news story received has fully passed the filtering requiriments specified by RDP filter expression.

I would like to include here examples of RDP filtering expressions that can be used for testing:

    	
                            "--filter_expression",
                "NP:DNP,RNP,MF,CDS AND ((G:B1 AND NS:RTRS)  OR (RR:1300,1200,1105,850,321,347,529,266,1554,1536,1697,2096,1255,1728,1680,2123,2224,2266,1908,846) AND L:EN,ZH",
                "--filter_expressionN",
                "NP:RNP OR NP:DNP OR NP:MF",
                "--filter_expression",
                "NP:U AND NP:MNI",
                "--filter_expression",
                "(NP:RNP OR NP:DNP)&NP:M  ",
                "--filter_expression",
                "NP:U",
                "--filter_expression",
                "NP:DNP,RNP,MF,OMXN,NBD,OE,BL,BXB,CZS,DA,FN,FA,GER,GNG,HX,I,N,NW,PX,P,RDN,SP,SW,SWF,SWI,TS,UKI,RX AND (G:4,1F,4R,8B,7N,85,8R,8M,9Y,1W,8T,2E,19,90,5M,3D,6A,46,6I,5J,6X,7M,4G,3N,5Y,A3,2Z,1C,55,6V,30,8Z,7J,4H,38,71 AND NS:RTRS)&language=en",
                "--filter_expression",
                "NP:DNP,RNP,MF,J AND ((G:41 AND NS:RTRS)  OR  (RR:1561,1707,1716,2810,2796,1263,2543,2796,2803)) AND L:JA",
                "--filter_expression",
                "NP:DNP,RNP,MF AND ((G:2V AND NS:RTRS)  OR  (RR:1200,1300,1105,850,321,347,529,266,1558,654,1538,647,1547,646)) AND L:EN,ES"