Chevron RightKensho NERDChevron Right

Text Annotation

v1 (Latest)

Text Annotation

You've obtained a gleaming new token through Okta or RSA, you've grabbed some of your favorite documents, and you're ready to start extracting entities with Kensho NERD! Follow along to make your first requests through the NERD API. Or, jump straight to the full API Reference.

Annotation Quickstart

The NERD API supports two workflows for obtaining annotations for a text document:

  • A synchronous workflow through the /annotations-sync endpoint that is useful for real-time annotations
  • An asynchronous workflow though the /annotations-async endpoint that is useful for bulk processing of documents

The two endpoints expect input of the same shape.

Keeping things simple, let's make our first request to the synchronous (real-time) endpoint. We're going to extract Wikimedia entities from the following text:

"The Supreme Court in Nairobi rendered its decision to the AU on Wednesday."

This example is trickier than it seems! Both "Supreme Court" and "AU" can only be linked properly if the model takes into account the surrounding entity of "Nairobi" and the joint context of the three entities together. Standard entity extraction solutions might simply return the most "salient" Supreme Court, i.e., the Supreme Court of the United States, and any of the many meanings of the acronym AU, for example Australia and American University. NERD's context awareness enables it to link even the toughest of entities consistently and accurately.

Let's hit the /annotations-sync endpoint and get our results. In Python, this could look like:

import json
import requests
NERD_API_URL = 'https://nerd.kensho.com/api/v1/annotations-sync'
data = {
"knowledge_bases": [
"wikimedia"
],
"text": "The Supreme Court in Nairobi rendered its decision to the AU on Wednesday."
}
# Send the document to the NERD API
response = requests.post(
NERD_API_URL,
data=json.dumps(data),
headers={'Content-Type': 'application/json',
'Authorization': 'Bearer <token obtained from login>'}
)
annotations_results = response.json()

The results represent a list of entity annotations. Each annotation includes a start and end location in the text and the ID, name (label), and type of the entity in the selected knowledge base. For this example, we'd see results like:

{
"results": [
{
"annotations": [
{
"start_index": 4,
"end_index": 17,
"text": "Supreme Court",
"entity_kb_id": "2368297",
"entity_label": "Supreme Court of Kenya",
"entity_type": "GOVERNMENT",
"ned_score": 0.1703,
"ner_score": 1.0,
},
{
"start_index": 21,
"end_index": 28,
"text": "Nairobi",
"entity_kb_id": "3870",
"entity_label": "Nairobi",
"entity_type": "CITY",
"ned_score": 0.2128,
"ner_score": 1.0,
},
{
"start_index": 58,
"end_index": 60,
"text": "AU",
"entity_kb_id": "7159",
"entity_label": "African Union",
"entity_type": "NGO",
"ned_score": 0.0919,
"ner_score": 1.0,
},
],
"knowledge_base": "wikimedia",
}
]
}

NERD has successfully disambiguated all three of our entities: the Supreme Court of Kenya, the city of Nairobi, and the African Union!

Capital IQ Entities

To extract Capital IQ entities, simply replace "wikimedia" with "capiq" and add an extra parameter, "originating_entity_id". Optimized for the financial domain, the Capital IQ variant of NERD allows a user to specify an "originating entity", or the entity that issued the document in question. For example, the company whose earnings call transcript or 10K filing is passed through NERD would be the originating entity for that document. The value of this parameter should be the originating entity's Capital IQ ID. For example:

import json
import requests
NERD_API_URL = 'https://nerd.kensho.com/api/v1/annotations-sync'
data = {
"knowledge_bases": [
"capiq"
],
"text": "The LEGO Group today reported first half earnings for the six months ending June 30, 2020.",
# Capital IQ ID of LEGO A/S. Profile page: https://www.capitaliq.com/CIQDotNet/company.aspx?companyid=701221
"originating_entity_id": "701221"
}
# Send the document to the NERD API
response = requests.post(
NERD_API_URL,
data=json.dumps(data),
headers={'Content-Type': 'application/json',
'Authorization': 'Bearer <token obtained from login>'}
)
annotations_results = response.json()

with the results:

{
"results": [
{
"annotations": [
{
"end_index": 14,
"entity_kb_id": "701221",
"entity_label": "LEGO A/S",
"entity_type": "ORG",
"ned_score": 0.9993,
"ner_score": 0.9973,
"start_index": 0,
"text": "The LEGO Group",
}
],
"knowledge_base": "capiq",
}
]
}

Providing an originating entity ID allows NERD to take even more context into account and therefore produce more precise annotations. If there isn't an appropriate originating entity for a document, such as in the case of a news article, simply enter "0" or omit the field.

Asynchronous Workflows

The asynchronous endpoint allows for bulk processing. Each document uploaded to this endpoint is immediately assigned a job_id, which can then be used to request annotations at a later time. We recommend using this endpoint when processing a large batch of documents, e.g., in a backfill. Here is an example of using the async endpoint to upload a document, receive a job_id, then poll the server until the results are ready, or until 5 minutes have elapsed:

import json
import requests
import time
NERD_API_URL = "https://nerd.kensho.com/api/v1/annotations-async"
TIMEOUT = 300 # 5 minute timeout in seconds
headers = {
"Content-Type": "application/json",
"Authorization": "Bearer <token obtained from login>",
}
data = ... # your document, choice of knowledge base, and possible originating entity, as per above
response = requests.post(NERD_API_URL, data=json.dumps(data), headers=headers)
# The POST method, if successful, returns a response including a `job_id` key and value.
if response.status_code != 202:
raise RuntimeError("Error submitting document to the NERD API")
job_id = response.json()["job_id"]
start_time = time.time()
# poll until results are ready or timeout is reached
while time.time() <= start_time + TIMEOUT:
response = requests.get(NERD_API_URL, params={"job_id": job_id}, headers=headers)
if response.status_code != 200:
raise RuntimeError(f"Error retrieveing job results for {job_id}")
if response.json().get("status") == "success":
break
time.sleep(1)
if not (
response.json().get("results")
and response.json().get("results")[0]
and response.json().get("results")[0].get("annotations")
):
raise TimeoutError(f"Job {job_id} timed out")
AnS&P Globalcompany

Harvard Square + AI Lab

44 Brattle St
Cambridge, MA 02138

New York City

One World Trade Center
New York, NY 10006

Washington D.C.

Tysons Corner
McLean, VA 22102
Copyright © 2021 Kensho Technologies, LLC. Kensho and Visallo marks are the property of Kensho Technologies, LLC. All rights reserved.