Chevron RightKensho NERDChevron Right

Text Annotation

v1 (Latest)

Text Annotation

You've obtained a gleaming new token through Okta or RSA, you've grabbed some of your favorite documents, and you're ready to start extracting entities with Kensho NERD! Follow along to make your first requests through the NERD API. Or, jump straight to the full API Reference.

Annotation Quickstart

The NERD API supports two workflows for obtaining annotations for a text document:

  • A synchronous workflow through the /annotations-sync endpoint that is useful for real-time annotations
  • An asynchronous workflow though the /annotations-async endpoint that is useful for bulk processing of documents

The two endpoints expect input of the same shape.

Keeping things simple, let's make our first request to the synchronous (real-time) endpoint. We're going to extract Wikimedia entities from the following text:

"The Supreme Court in Nairobi rendered its decision to the AU on Wednesday."

This example is trickier than it seems! Both "Supreme Court" and "AU" can only be linked properly if the model takes into account the surrounding entity of "Nairobi" and the joint context of the three entities together. Standard entity extraction solutions might simply return the most "salient" Supreme Court, i.e., the Supreme Court of the United States, and any of the many meanings of the acronym AU, for example Australia and American University. NERD's context awareness enables it to link even the toughest of entities consistently and accurately.

Let's hit the /annotations-sync endpoint and get our results. In Python, this could look like:

import json
import requests
NERD_API_URL = "https://nerd.kensho.com/api/v1/annotations-sync"
my_access_token = "" # copy-paste your Access Token string obtained from login in the quotation marks
data = {
"knowledge_bases": [
"wikimedia"
],
"text": "The Supreme Court in Nairobi rendered its decision to the AU on Wednesday."
}
# Send the document to the NERD API
response = requests.post(
NERD_API_URL,
data=json.dumps(data),
headers={"Content-Type": "application/json",
"Authorization": "Bearer " + my_access_token} # "Bearer " must be included
)
annotations_results = response.json()

The results represent a list of entity annotations. Each annotation includes a start and end location in the text and the ID, name (label), and type of the entity in the selected knowledge base. For this example, we'd see results like:

{
"results": [
{
"annotations": [
{
"start_index": 4,
"end_index": 17,
"text": "Supreme Court",
"entity_kb_id": "2368297",
"entity_label": "Supreme Court of Kenya",
"entity_type": "GOVERNMENT",
"ned_score": 0.1703,
"ner_score": 1.0,
},
{
"start_index": 21,
"end_index": 28,
"text": "Nairobi",
"entity_kb_id": "3870",
"entity_label": "Nairobi",
"entity_type": "CITY",
"ned_score": 0.2128,
"ner_score": 1.0,
},
{
"start_index": 58,
"end_index": 60,
"text": "AU",
"entity_kb_id": "7159",
"entity_label": "African Union",
"entity_type": "NGO",
"ned_score": 0.0919,
"ner_score": 1.0,
},
],
"knowledge_base": "wikimedia",
}
]
}

NERD has successfully disambiguated all three of our entities: the Supreme Court of Kenya, the city of Nairobi, and the African Union!

Capital IQ Entities

To extract Capital IQ entities, simply replace "wikimedia" with "capiq" and add an extra parameter, "originating_entity_id". Optimized for the financial domain, the Capital IQ variant of NERD allows a user to specify an "originating entity", or the entity that issued the document in question. For example, the company whose earnings call transcript or 10K filing is passed through NERD would be the originating entity for that document. The value of this parameter should be the originating entity's Capital IQ ID. For example:

import json
import requests
NERD_API_URL = "https://nerd.kensho.com/api/v1/annotations-sync"
my_access_token = "" # copy-paste your Access Token string obtained from login in the quotation marks
data = {
"knowledge_bases": [
"capiq"
],
"text": "The LEGO Group today reported first half earnings for the six months ending June 30, 2020.",
# Capital IQ ID of LEGO A/S. Profile page: https://www.capitaliq.com/CIQDotNet/company.aspx?companyid=701221
"originating_entity_id": "701221"
}
# Send the document to the NERD API
response = requests.post(
NERD_API_URL,
data=json.dumps(data),
headers={"Content-Type": "application/json",
"Authorization": "Bearer " + my_access_token} # "Bearer " must be included
)
annotations_results = response.json()

with the results:

{
"results": [
{
"annotations": [
{
"end_index": 14,
"entity_kb_id": "701221",
"entity_label": "LEGO A/S",
"entity_type": "ORG",
"ned_score": 0.9993,
"ner_score": 0.9973,
"start_index": 0,
"text": "The LEGO Group",
}
],
"knowledge_base": "capiq",
}
]
}

Providing an originating entity ID allows NERD to take even more context into account and therefore produce more precise annotations. If there isn't an appropriate originating entity for a document, such as in the case of a news article, simply enter "0" or omit the field.

Asynchronous Workflows

The asynchronous endpoint allows for bulk processing. Each document uploaded to this endpoint is immediately assigned a job_id, which can then be used to request annotations at a later time. We recommend using this endpoint when processing a large batch of documents, e.g., in a backfill. Here is an example of using the async endpoint to upload a document, receive a job_id, then poll the server until the results are ready, or until 5 minutes have elapsed:

import json
import requests
import time
NERD_API_URL = "https://nerd.kensho.com/api/v1/annotations-async"
my_access_token = "" # copy-paste your Access Token string obtained from login in the quotation marks
TIMEOUT = 300 # 5 minute timeout in seconds
headers = {
"Content-Type": "application/json",
"Authorization": "Bearer " + my_access_token, # "Bearer " must be included
}
data = ... # your document, choice of knowledge base, and possible originating entity, as per above
response = requests.post(NERD_API_URL, data=json.dumps(data), headers=headers)
# The POST method, if successful, returns a response including a `job_id` key and value.
if response.status_code != 202:
raise RuntimeError("Error submitting document to the NERD API")
job_id = response.json()["job_id"]
start_time = time.time()
# poll until results are ready or timeout is reached
while time.time() <= start_time + TIMEOUT:
response = requests.get(NERD_API_URL, params={"job_id": job_id}, headers=headers)
if response.status_code != 200:
raise RuntimeError(f"Error retrieving job results for {job_id}")
if response.json().get("status") == "success":
break
time.sleep(1)
if not (
response.json().get("results")
and response.json().get("results")[0]
and response.json().get("results")[0].get("annotations")
):
raise TimeoutError(f"Job {job_id} timed out")

Since the Access Token expires one hour after it is provisioned, users who expect their code to run for one hour or longer should use their Refresh Token, which expires after one week, to generate new Access Tokens as needed. Here's an example:

(Production services that run continually should use keypair authentication.)

import json
import requests
import time
import os
NERD_API_URL = "https://nerd.kensho.com/api/v1/annotations-async"
my_refresh_token = "" # copy-paste your Refresh Token string obtained from login in the quotation marks
def get_access_token_from_refresh_token(refresh_token):
"""Get Access Token by Refresh Token."""
response = requests.get(f"https://nerd.kensho.com/oauth2/refresh?refresh_token={refresh_token}")
new_access_token = response.json()["access_token"]
return new_access_token
class NerdClient:
def __init__(self, refresh_token):
self.refresh_token = refresh_token
def update_access_token(self):
self.access_token = get_access_token_from_refresh_token(self.refresh_token)
def call_api(self, verb, *args, headers={}, **kwargs):
"""Call NERD API, refreshing access token as needed."""
if not hasattr(self, "access_token"):
self.update_access_token()
def call_with_updated_headers():
nonlocal method
headers["Authorization"] = f"Bearer {self.access_token}"
return method(*args, headers=headers, **kwargs)
method = getattr(requests, verb)
response = call_with_updated_headers()
if response.status_code == 401:
self.update_access_token()
response = call_with_updated_headers()
return response
def make_async_annotations_request(self, data):
"""Make a POST call to NERD Async Endpoint."""
response = self.call_api(
"post",
NERD_API_URL,
data=json.dumps(data),
headers={"Content-Type": "application/json"}
)
return response.json()["job_id"]
def get_async_annotations_results(self, job_id):
"""Get annotations results from NERD Async Endpoint."""
while True:
response = self.call_api(
"get",
NERD_API_URL + "?job_id=" + job_id
)
result = response.json()
if result["status"] != "pending":
break
time.sleep(10)
return result
# data preparation
file_dir = "" # file path to directory containing documents you want NERD to process
files = os.listdir(file_dir)
job_dict = {} # dict to store file_name/job_id pair
data = {"knowledge_bases": ["capiq"]}
nerd_client = NerdClient(my_refresh_token)
# submit requests to async endpoint
for file_name in files:
file_name = os.path.join(file_dir, file_name)
with open(file_name, "r") as f:
text = f.read()
data.update({"text": text})
job_id = nerd_client.make_async_annotations_request(data)
job_dict.update({file_name: job_id})
print(f'Submitted {file_name} as {job_id}')
time.sleep(0.1)
# retrieve results from async endpoint
for file_name, job_id in job_dict.items():
file_name += '.nerd.json'
result = nerd_client.get_async_annotations_results(job_id)
print(f'Wrote result for {job_id} to {file_name}')
with open(file_name, 'w') as result_file:
json.dump(result, result_file, indent=4)
time.sleep(0.1)
AnS&P Globalcompany

Harvard Square + AI Lab

44 Brattle St
Cambridge, MA 02138

New York City

One World Trade Center
New York, NY 10006

Washington D.C.

Tysons Corner
McLean, VA 22102
Copyright © 2022 Kensho Technologies, LLC. Kensho and Visallo marks are the property of Kensho Technologies, LLC. All rights reserved.