Skip to content
This repository has been archived by the owner on Sep 5, 2020. It is now read-only.

Service Implementation Details

Ed Armstrong edited this page Jul 9, 2019 · 30 revisions

Service Base (Parent Class)

source

The ServiceBase class prepares the input for the services. All services accept a document, a context, and a schema. If the document is missing an exception is thrown. If the context is missing a default context is used. If the schema is missing a default schema is used. The default schema and context are based on the schema attribute in the document. If this is missing an empty context/schema is used.

ServiceBase flowchart

All markup services (ner, dict-link, dict-all) use the ServiceBase class as an entry point. When an http request is made the 'processRequest' method.

Process request method

javadoc

Create a new manager object from JSON information. It is this method that verifies the contents of the incoming JSON object and parses out the relevant context and schema information. This is a utility method that the service endpoints will call if they are tagging documents. The document field is required in the JSON input object. The context and schemaURL fields are optional. If the document is not provided then an exception will be thrown. If the 'context' is not provided then the document type will be determined from the 'schema' is available. If no context is available, a default empty schema will be used. If a schema URL is provided then it will be used, otherwise the url will be parsed from the document. If the schema URL return a 302 (redirect) then that url will be used. All other non-200 return codes will throw an exception.

Services

module
There are 3 tagging services, ner, dict-link, and dict-all. They accept a document, insert named entity tags, and then return the marked up document. A abstracted overview, and description (api) of the services can be found here.

The three service calls ( ner , dict-link , dict-all ) forwards the data from ServiceBase.ProcessRequest and pass it to the appropriate service module. Each service module class inherits from the ServiceModuleBase class. The parent class provides the interface to the document, context, and schema values. The logic is contained in the child classes, which are described here.

NER Service

links

source
api
javadoc

description

Add entity tags to a document using NLP entity recognition. Entities will not be added inside of already tagged entities, nor will they be added where they will violate the document schema.

logic

NER service flowchart

  1. The data is passed into the service from ServiceBase.processRequest.
  2. Each inner text portion of an xml element is processed individually.
  3. If the text is empty, continue with next element.
  4. If any ancestor element is a valid tag in the schema, continue with next element. This prevents nested tagged elements.
  5. Apply the Stanford NER to the text, this returns a list of new nodes. The tag names in this list will not necessarily match the schema.
  6. Examine each new element.
  7. Convert the element tag name so that it matches the schema. This information is found in the context file.
  8. Check if the element is valid is the schema.
  9. If it is not valid, untag it.
  10. If it is valid, set a default lemma value to be the same as the text value.
  11. Replace the text node 'n' with all the new node elements in the list.

dict-link

source
javadoc

description

Match entities with do not have a link attribute with entities in the database. The link attribute will be added if one is found.

logic

dict-link service flowchart

  1. The data is passed into the service from ServiceBase.processRequest.
  2. For each tag field 't' in the context, usually {organization, location, person, title}.
  3. For each element in the document of type 't'.
  4. If the node is already linked continue at (2).
  5. Lookup the node in the dictionary using the Dictionary.lookup method.. Lookup includes node text and tagname. Lemma is included if present.
  6. If any result is found.
  7. Copy link information from the first result returned into the node. (https://cwrc.github.io/NERVE/ca/sharcnet/dh/scriber/dictionary/Dictionary.html#lookup-java.lang.String-java.lang.String-java.lang.String-java.lang.String-) method.

dict-all

source
javadoc

description

Perform a brute force search of all text in the document that is not currently with a tagged entity. Add tag, lemma, link values from the database to the document if it doesn't violate the schema.

logic

dict-link service flowchart

  1. Examine each inner text node in the document.
  2. If the text node is a decedent of a tagged entity, continue at (1).
  3. Search all possible combinations of words for an entity in the database. For example "Toronto Ontario Canada" will search for "Toronto", "Ontario", "Canada", "Toronto Ontario", "Ontario Canada", and "Toronto Ontario Canada". The longest match is kept.
  4. If the entity is not valid in schema, continue at (1).
  5. Replace text with an element created from database result.