You are here

Semantic Analysis Method (SAM): A Tool for Identifying Potential Access Points in Unstructured Text

This presentation describes a tool developed to help solve the challenge of converting unstructured textual descriptions of cultural heritage material, specifically archival descriptive data, into linked data. The Semantic Analysis Method (SAM) tool serves as a bridge between unstructured narrative description and semantically-defined access points. SAM identifies name entities and topics through the use of a semantic analysis engine, OpenCalais, with a JSON data file as an initial output, and also parses and saves the
results of the OpenCalais analysis as a comma-separated value (CSV) database file. Once in a database form, this list of potential access points may be imported into a data cleanup application such as OpenRefine for further editing and removal of any misidentified entities.
 
While the SAM Tool successfully provides the functionalities of text retrieval, semantic analysis for entity extraction, and conversion of results to a more manageable format for later data clean-up, much work remains to streamline these processes and improve accuracy in data extraction and characterization. There are certain challenges to be overcome in the areas of entity extraction and name resolution for historical names and places (where those names may not be available in authority files), and misidentification of certain phrases as entities. Due to this issue of not finding many names in well-established national and international data sources, it is clear that establishing a local name authority needs to be another task to be incorporated into the SAM tool in order to improve accuracy in identifying and correctly categorizing entities.
Presentation Type: 
Talk
Language: 
English