A Practical Approach to Understanding and Ingesting Intelligent Tagging Output for Your Use Case
In this article we will go over some general best practices when using Intelligent Tagging output for the first time. You will want to take a small sample of text that you will pass into the Intelligent Tagging engine to learn about the types and quality of the metadata returned. After you have learned from this sample you can then apply that to larger volumes of text and finally into production. The reason I call that out is depending on your use case and your ability to stitch everything together with a well refined/trained algorithm, you may be able to handle noise without it affecting your precision and recall or users' experience. Before you can get to that stage you must first take iterative steps to understanding the Intelligent Tagging output - we will now take you through some of the more salient points.
First up are some General Rules that I feel apply to all metadata/entities recognized by Intelligent Tagging:
- All entities should have independent configurable Blacklists and Whitelists.
- Make sure to extract the relevance, confidence and score with every entity.
- With whatever system you are using that calls and receives the Intelligent Tagging data, put variables in place so as you learn and refine your process, you can easily adjust the Intelligent Tagging API's levers.
Intelligent Tagging will return output for any text input - this obviously varies per document - and we need to handle flexible output for each document input.
Intelligent Tagging Provides quite a few Entities/Metadata, but for today’s conversation we are covering what is below:
- Resolved Companies
- Unresolved Companies
- Social Tags
- Topics
- Resolved People
- Unresolved People
- Industries
- Technologies
- Industry Terms
- Country
- Organization
- Entity Relationships will be discussed in a later article.
You will see a lot of duplicate recommendations below so feel free to skim the next part as needed.
Resolved Companies
- Discard Companies based off of a configurable Blacklist.
- Also extract the relevance, confidence and score
- Discard any Companies below a confidence score less than .5
- Any Companies with a relevance score of .0 should be categorized as the publisher of the document, or provided a quoted rating within the document.
- Additionally add Company to High Value Company field when the Relevance score is >= .8
- Additionally add Company to Medium Value Company field when the Relevance score is between .2 < and <.8
- Additionally add Company to Low Value Company field when the Relevance score is <=.2
- Use the common name from the resolution
- Grab all resolutions associated with the company entity
- Additionally add the top most parent company to the filter fields (All and (High or Medium or Low)) following the same confidence and relevance rules as above.
Sample JSON response from Intelligent Tagging:
"http://d.opencalais.com/comphash-1/8bda654b-0dc9-31e9-8fe9-5fd6a99cc329":{
"_typeGroup":"entities",
"_type":"Company",
"forenduserdisplay":"false",
"name":"Volkswagen",
"nationality":"N/A",
"confidencelevel":"0.882",
"resolutions":[
{
"name":"Porsche Automobil Holding SE",
"permid":"4295869130",
"primaryric":"PSHG_p.DE",
"ispublic":"true",
"commonname":"Porsche Hldg",
"topmostPublicParent":true,
"id":https://permid.org/1-4295869130
},
{
"name":"Volkswagen AG",
"permid":"4295869244",
"primaryric":"VOWG_p.DE",
"ispublic":"true",
"commonname":"Volkswagen",
"score":0.9871257,
"id":https://permid.org/1-4295869244
}
],
"_typeReference":"http://s.opencalais.com/1/type/em/e/Company",
"instances":[
{
"detection":"[]Volkswagen[ found itself embroiled in controversy after it]",
"exact":"Volkswagen",
"suffix":" found itself embroiled in controversy after it",
"offset":0,
"length":10
}
],
"relevance":0.8,
"confidence":{
"statisticalfeature":"0.905",
"dblookup":"0.0",
"resolution":"0.9871257",
"aggregate":"0.882“
}
}
Unresolved Companies
- Use the markup name
- Additionally add Company to High Value Unresolved Company field when the Relevance score is >= .8
- Additionally add Company to Medium Value Unresolved Company field when the Relevance score is between .2 < and <.8
- Additionally add Company to Low Value Unresolved Company field when the Relevance score is <=.2
- Discard Companies based off of a configurable Blacklist.
Sample JSON response from Intelligent Tagging:
"http://d.opencalais.com/comphash-1/78076b5f-9c68-397b-83fd-207daf70ee1d":{
"_typeGroup":"entities",
"_type":"Company",
"forenduserdisplay":"true",
"name":"Blocker Excavation Corp.",
"nationality":"N/A",
"confidencelevel":"1.0",
"_typeReference":"http://s.opencalais.com/1/type/em/e/Company",
"instances":[
{
"detection":"[Smith is President and CEO of ]Blocker Excavation Corp.[ in Oswego IL.]",
"prefix":"Smith is President and CEO of ",
"exact":"Blocker Excavation Corp.",
"suffix":" in Oswego IL.",
"offset":35,
"length":24
} ],
"relevance":0.8,
"confidence":{
"statisticalfeature":"0.834",
"dblookup":"0.0",
"resolution":"0.0",
"aggregate":"1.0“
}
}
Social Tags
- You should not add Social tags based off of a configurable Blacklist.
- Additionally add Social tags to High Value Social tags field when the score is 1 stars
- Additionally add Social tags to Medium Value Social tags field when the score is 2 stars
- Additionally add Social tags to Low Value Social tags field when the score is 3 star
- Only add documents when ForEndUserDisplay attribute is true
Sample JSON response from Intelligent Tagging:
"http://d.opencalais.com/dochash-1/c178fd67-5c21-3259-9aad-8b42821635a8/SocialTag/7":{
"_typeGroup":"socialTag",
"id":"http://d.opencalais.com/dochash-1/c178fd67-5c21-3259-9aad-8b42821635a8/SocialTag/7",
"socialTag":"http://d.opencalais.com/genericHasher-1/7d05dccf-3345-37b6-9d1c-304c78678bcd",
"forenduserdisplay":"true",
"name":"Station wagons",
"importance":"2",
"originalValue":"Station wagons"
}
Topics
- Only add documents when ForEndUserDisplay attribute is true
- Discard topics with a score < .4
- Additionally add Topics to High Value Topics field when the Relevance score is >= .8
- Additionally add Topics to Medium Value topics field when the Relevance score is between .4 < and <.8
- Additionally add Topics to Low Value Topics field when the Relevance score is <= .4
- You should not add Topics based off of a configurable Blacklist.
Sample JSON response from Intelligent Tagging:
"http://d.opencalais.com/dochash-1/c178fd67-5c21-3259-9aad-8b42821635a8/cat/1":{
"_typeGroup":"topics",
"forenduserdisplay":"false",
"score":1,
"name":"Environment"
}
Resolved People
- Discard People based off of a configurable Blacklist.
- Discard any People below a confidence score less than .5
- Discard any People with a relevance score of .0
- Additionally add People to High Value People field when the Relevance score is >= .8
- Additionally add People to Medium Value People field when the Relevance score is between .2 < and <.8
- Additionally add People to Low Value People field when the Relevance score is <=.2
- Use the commonname field from the resolution
- Discard resolutions that have a score under .1
Sample JSON response from Intelligent Tagging:
"http://d.opencalais.com/pershash-1\/a35b1bff-5635-33b1-910f-1f3369db8e44":{
"_typeGroup":"entities",
"_type":"Person",
"forenduserdisplay":"true",
"name":"Jeff Bezos",
"persontype":"N\/A",
"nationality":"N\/A",
"confidencelevel":"0.999",
"firstname":"Jeff",
"lastname":"Bezos",
"commonname":"Jeff Bezos",
"_typeReference":"http://s.opencalais.com/1/type/em/e/Person",
"instances":[
{
"detection":"[/Author><TableCount>0</TableCount><Body>\n Column: ]Jeff Bezos[ is an owner who knows how to deliver\n By Jack]",
"prefix":"/Author><TableCount>0</TableCount><Body>\n Column: ",
"exact":"Jeff Bezos",
"suffix":" is an owner who knows how to deliver\n By Jack",
"offset":77,
"length":10
}
],
"relevance":0.8,
"confidence":{
"statisticalfeature":"0.999",
"dblookup":"0.95",
"resolution":"0.80842197",
"aggregate":"0.999“
},
"resolutions":[
{
"name":"Jeffrey P. Bezos",
"personid":"90915",
"paid":"34415676965",
"officerid":"35834",
"commonname":"Jeff Bezos",
"score":0.80842197
}
]
}
Unresolved People
- Discard People without resolutions that are less than .8 confidence.
- Additionally add People to High Value Unresolved People field when the Relevance score is >= .8
- Additionally add People to Medium Value Unresolved People field when the Relevance score is between .2 < and <.8
- Additionally add People to Low Value Unresolved People field when the Relevance score is <=.2
- Discard People based off of a configurable Blacklist.
Sample JSON response from Intelligent Tagging:
"http://d.opencalais.com/pershash-1/b4c193fc-8b76-3932-984d-f1c190b9b1f0":{
"_typeGroup":"entities",
"_type":"Person",
"forenduserdisplay":"true",
"name":"John Smith",
"persontype":"economic",
"nationality":"N\/A",
"confidencelevel":"0.999",
"firstname":"John",
"lastname":"Smith",
"commonname":"John Smith",
"_typeReference":"http://s.opencalais.com/1/type/em/e/Person",
"instances":[
{
"detection":"[]John Smith[ is President and CEO of Blocker Excavation Corp.]",
"exact":"John Smith",
"suffix":" is President and CEO of Blocker Excavation Corp.",
"offset":0,
"length":10
}
],
"relevance":0.8,
"confidence":{
"statisticalfeature":"0.999",
"dblookup":"0.97",
"resolution":"0.0",
"aggregate":"0.999“
}
}
Industries
- Discard anything below a relevance of .5
- Additionally add Industry to High Value Industries field when the Relevance score is >= .8
- Additionally add Industry to Medium Value Industries field when the Relevance score is between .2 < and <.8
- Additionally add Industry to Low Value Industries field when the Relevance score is <=.2
- Discard Industry's based off of a configurable Blacklist.
Sample JSON response from Intelligent Tagging:
"http://d.opencalais.com/dochash-1/c178fd67-5c21-3259-9aad-8b42821635a8/Industry/1":{
"_typeGroup":"industry",
"forenduserdisplay":"false",
"name":"Auto & Truck Manufacturers - NEC",
"rcscode":"B:1292",
"trbccode":"5310101010",
"permid":"4294951709",
"relevance":0.8
}
Technologies
- Additionally add Technology's to High Value Technology's field when the Relevance score is >= .8
- Additionally add Technology's to Medium Value Technology's field when the Relevance score is between .2 < and <.8
- Additionally add Technology's to Low Value Technology's field when the Relevance score is <= .2
- You should not add Technology's based off of a configurable Blacklist.
- Additionally add Technology's to High Value Technology's field when the Relevance score is >= .8
Sample JSON response from Intelligent Tagging:
"http://d.opencalais.com/genericHasher-1/d2529e6a-4e4b-3f5e-8253-fc7471b6e904":{
"_typeGroup":"entities",
"_type":"Technology",
"forenduserdisplay":"false",
"name":"smartphones",
"_typeReference":"http://s.opencalais.com/1/type/em/e/Technology",
"instances":[
{
"detection":"[As ]smartphones[ get larger the increased demands from the]",
"prefix":"As ",
"exact":"smartphones",
"suffix":" get larger the increased demands from the",
"offset":3,
"length":11
},
{
"detection":"[so it was to be expected that the South Korean ]smartphones[ would pick up some extra power. n nMophie Juice]",
"prefix":"so it was to be expected that the South Korean ",
"exact":"smartphones",
"suffix":" would pick up some extra power. n nMophie Juice",
"offset":1108,
"length":11
}
],
"relevance":0.2
}
Industry Terms
- Additionally add IndustryTerms to High Value IndustryTerms field when the Relevance score is >= .8
- Additionally add IndustryTerms to Medium Value IndustryTerms field when the Relevance score is between .2 < and <.8
- Additionally add IndustryTerms to Low Value IndustryTerms field when the Relevance score is <= .2
- You should not add IndustryTerms based off of a configurable Blacklist.
Sample JSON response from Intelligent Tagging:
"http://d.opencalais.com/genericHasher-1/d79374b4-3408-3a42-b74a-1e460d36bba0":{
"_typeGroup":"entities",
"_type":"IndustryTerm",
"forenduserdisplay":"false",
"name":"battery energy",
"_typeReference":"http://s.opencalais.com/1/type/em/e/IndustryTerm",
"instances":[
{
"detection":"[capacities together because some of the Mophie ]battery energy[ will be lost as it charges up the battery in the]",
"prefix":"capacities together because some of the Mophie ",
"exact":"battery energy",
"suffix":" will be lost as it charges up the battery in the",
"offset":1580,
"length":14
}
],
"relevance":0.2
}
Organizations
- Discard Organizations based off of a configurable Blacklist.
- Discard Organization with a relevance lower than .5
Sample JSON response from Intelligent Tagging:
"http://d.opencalais.com/genericHasher-1/4a91acf8-33e3-3c95-b678-377770820978":{
"_typeGroup":"entities",
"_type":"Organization",
"forenduserdisplay":"false",
"name":"Charge Force",
"organizationtype":"N/A",
"nationality":"N/A",
"_typeReference":"http://s.opencalais.com/1/type/em/e/Organization",
"instances":[
{
"detection":"[Mophie’s Juice Pack cases now come with its ‘]Charge Force[’ wireless power system. Not only can you]",
"prefix":"Mophie’s Juice Pack cases now come with its ‘",
"exact":"Charge Force",
"suffix":"’ wireless power system. Not only can you",
"offset":2425,
"length":12
},
{
"detection":"[without taking the Galaxy out of the case, the ]Charge Force[ system uses magnets to help guide your case into]",
"prefix":"without taking the Galaxy out of the case, the ",
"exact":"Charge Force",
"suffix":" system uses magnets to help guide your case into",
"offset":2586,
"length":12
}
],
"relevance":0.2
}
Country
- Discard Country’s based off of a configurable Blacklist.
- Use short name field from the resolution
- Discard Country’s without a resolution
- Discard Country’s with relevance lower then .5
Sample JSON response from Intelligent Tagging:
"http://d.opencalais.com/genericHasher-1/1d1529b7-da5f-3884-8de0-c765b3b7d3a3":{
"_typeGroup":"entities",
"_type":"City",
"forenduserdisplay":"true",
"name":"Washington",
"resolutions":[
{
"name":"Washington,United States",
"shortname":"Washington",
"latitude":"38.89",
"longitude":"-77.03",
"containedbycountry":"United States",
"rcscode":"G:11",
"permid":"100082“
}
],
"_typeReference":"http://s.opencalais.com/1/type/em/e/City",
"instances":[
{
"detection":"[on Republican U.S. lawmakers practicing near ]Washington[ for a charity baseball game, wounding senior]",
"prefix":"on Republican U.S. lawmakers practicing near ",
"exact":"Washington",
"suffix":" for a charity baseball game, wounding senior",
"offset":171,
"length":10
}
],
"relevance":0.8
}
Final thoughts
Now that you have the proper levers in place, you will want to iterate over the data a couple times (2-3) with a decent set of documents in comparison to your overall data set. Understand how the precision and recall change for confidence, high relevance, medium relevance, low relevance, score and unresolved and resolved entities. In later blogs we can cover what to do with Intelligent Tagging derived relationships and how to use the Intelligent Tagging output to connect to the Knowledge Graph.