Below we answer your most burning questions about NERD. If you have a question not covered here, please don't hesitate to reach out to us at email@example.com.
NERD (Named Entity Recognition and Disambiguation) is Kensho’s breakthrough natural language processing capability to identify the entities (e.g., companies, people, places, and events) that appear in any textual data and links, or disambiguates, them to two knowledge bases: S&P Capital IQ and Wikimedia.
For a product overview please visit our profile page on the S&P Global Marketplace.
Under the hood, NERD is a collection of advanced machine learning models that together extract entities from text documents and link them, in context, to a knowledge base. Kensho combines the latest advances in machine learning with S&P Global’s unparalleled data universe to train the models that underlie NERD.
NERD has learned from the patterns in millions of documents, with a special focus on financial and business-related text, from news articles to earnings call transcripts. Many of these documents are hand-labeled by Kensho’s domain experts in the financial and business worlds, so as to teach NERD to extract accurate information from such text.
NERD is designed to extract entities out of standard English text documents comprising complete sentences. While NERD is by no means limited to formal, lengthy texts, it will function optimally on documents written with conventional grammatical structure.
One of NERD’s most powerful features is its context-awareness. Documents whose entities make sense given the surrounding text are ideally situated to take advantage of contextual extraction. Conversely, entities in non-prose text, such as HTML tables, sentence fragments, or lists will present NERD with limited opportunities to make proper predictions. Hence, you can expect degraded performance if you pass text to NERD that it was not designed to process. That said, NERD achieves excellent performance on a diverse array of document types, from emails to research reports.
NERD is not a keyword or fuzzy matching product. Such products suffer from low accuracy and highly unstable results, even on simple examples. For instance, a text matching engine would struggle to determine whether "TSLA" refers to a Task Service Level Agreement or to the ticker symbol for Tesla, Inc.
Instead, NERD uses a bespoke combination of fine-tuned, domain-specific neural networks and ensembled tree-based learning algorithms, along with granular heuristics informed by human subject-matter experts, to extract and link the entities in a document consistently and accurately.
NERD is the only technology of its kind and its performance on financial documents approaches human labeling accuracy.
NERD is accessed via REST API.
Simply input your text, specify to which knowledge base(s) you would like to link, and get your results.
After running, NERD will return a list of JSON annotations, each corresponding to a mention of an entity in your document. Each annotation will include:
- Location of the entity in the text (start_index, end_index)
- Entity Name (entity_label)
- Entity ID in either Capital IQ or Wikidata (entity_kb_id)
- Entity Type (entity_type)
- NED Score (ned_score)
- NER Score (ner_score)
For more detailed information, please visit the API reference.
The S&P Capital IQ knowledge base imbues your textual data with connections to one of the most comprehensive financial databases in the world, covering 25 million public and private companies with thousands of interconnected datasets, including corporate financials, key events, and professional profiles. NERD turns your text documents into a financial data goldmine by linking them to an extraordinarily comprehensive business data platform.
The Wikimedia knowledge base is a broad, free, and open database consisting of more than 92 million data items. Using NERD to link your text documents to Wikimedia provides you with broad insight across domains, linking all manner of entities, from people and organizations to events and geographies.
Yes! Simply specify both knowledge bases in your REST request.
One of NERD’s most powerful features is its “context-awareness,” meaning that it can use the information contained in the text surrounding an entity to inform its disambiguation decision. NERD does what many AI technologies claim to do, but few achieve, that is, to resemble human comprehension and judgment.
For example, NERD understands that if a company referred to simply as “Apple” appears in an earnings call transcript for Microsoft, then it should link the entity to Apple Inc. as opposed to Apple Bank of Savings. This is because NERD takes into account the important context of the technology industry, the competitive relationship between Microsoft and Apple, and the other entities mentioned in the document.
The NER score represents the model’s confidence that the extracted text span is an entity, whereas the NED score represents the model’s confidence in its link (or disambiguation) to Capital IQ or Wikimedia.
It is important to note that these scores are not probabilities, percentages, or linearly related. The model provides these scores to enable you to threshold results that are returned for your individual use case. For example, a 0.9 NED score is more confident than a 0.6 NED score, but it is not 50% more confident than the latter.
NERD returns annotations at all confidence levels. Typical use cases will “threshold” annotations such that only annotations over a certain confidence are considered.
The Capital IQ NERD and Wikimedia NERD systems return annotations with differently scaled NED scores! Generally speaking, we suggest considering the following as starting points:
ned_score > 0.9for “recall-weighted” use cases (i.e., you favor “completeness” in terms of capturing entities and are more concerned with false negatives than you are with false positives)
ned_score > 0.98for “precision-weighted” use cases (i.e., you favor “accuracy” in terms of linking entities and are more concerned with false positives than you are with false negatives)
ned_score > 0.95for balanced use cases (i.e., somewhere in the middle)
ned_score > 0.0for “recall-weighted” use cases (i.e., you favor “completeness” in terms of capturing entities and are more concerned with false negatives than you are with false positives)
ned_score > 0.25for “precision-weighted” use cases (i.e., you favor “accuracy” in terms of linking entities and are more concerned with false positives than you are with false negatives)
ned_score > 0.15for balanced use cases (i.e., somewhere in the middle)
However, we suggest exploring various thresholds and adjusting to your specific needs.