Semantic Enrichment Component

Semantic Enrichment Component (SEC) provides semantic enrichment services. Provides for example service text annotation annotate, searching of entities get_entities and listing of all types and attributes in KB get_entity_types_and_attributes.

Service is publicly available at http://sec.fit.vutbr.cz/ on port 8082 (Protocol documentation).

Content

Prerequisites

The current version is available on git in branch D114-SEC_API.

 git clone http://sec.fit.vutbr.cz/sec/secapi.git secapi && cd secapi
 git checkout -b D114-SEC_API origin/D114-SEC_API

Directory secapi will be created, move to this directory. In directory ./NER we will download KB and afterwards in directory ./SEC_API necessary programs will be set up by command make.

 (cd ./NER && ./deleteKB.sh && ./downloadKB.sh)
 (cd ./SEC_API && make)

It is necessary to be aware of the fact that when using script downloadKB.sh, KB and machines (*.fsa) cannot be located in the directories secapi/NER and secapi/NER/figa. It is appropriate to delete them before by script deleteKB.sh. Beware, scripts downloadKB.sh and deleteKB.sh have to be launched only from the directory in which they are located (thus ./NER)!

Description of parts

SEC can be found in the director Its scripts are written so that they can be called from any directory. At the moment SEC is divided into scripts sec_daemon.py, sec.py and sec_api.py.

Script sec_daemon.py

Script sec_daemon.py is the core of SEC. It has been made to reduce memory demands while running sec.py in parallel. By launching this script Unix domain socket (UDS) is created and is waiting for connection of several instances of scripts sec.py or sec_api.py. instances of these two scripts communicate with sec_daemon.py by internal communication protocol described below.

Usage

 ./sec_daemon.py [-h] [-p PATH] [--own_kb_daemon]
 Optional arguments:
   -h, --help            shows help and then terminates.
   -p PATH, --uds_path PATH
                         Sets path to Unix domain socket, where daemon is 
                         waiting for clients. Default value is ./daemon_uds 
                         comparatively to script's directory.
   --own_kb_daemon       Launches its own KB daemon even if any other is already running.

Script sec.py

Script sec.py is client of deamon sec_daemon.py. The services of SEC provided by deamon are presented to user through it. On standard input requirement in JSON is expected. Answer is passed to standard output. Description of services and requirements with examples can be seen in ./doc/sec_api.pdf after compilation by command make.

Usage

 ./sec.py [-h] [-t [DIRECTORY]] [-p PATH] [-c CONFIG.json] [--plaintext]
          [-f FILENAME]
  Optional arguments:
   -h, --help            shows help and then terminates.
   -t [DIRECTORY], --testing_mode [DIRECTORY]
                         Switches to test mode, which will allow to check work 
                         with structured annotations that NER is not familiar with. 
                         Meaning - service "annotate" is looking in URI query
                         "DOCUMENT_URI" for value of key "tid" and according to this 
                         looks in directory DIRECTORY for file with answer to
                         "annotation_format". If such file is found, instead of 
                         results from NER, its content will be returned. URI
                         query "DOCUMENT_URI" can contain key "aid". Unlike key
                         "tid", content of file found accordingly to this value,
                         the result of NER will be only enriched (connected to it).
                         Default value is ./testing_mode comparatively to script's 
                         directory.
   -p PATH, --uds_path PATH
                         Sets path to Unix domain socket, where daemon is 
                         waiting for clients. Default value is ./daemon_uds 
                         comparatively to script's directory.
   -c CONFIG.json, --config_file CONFIG.json
                         Sets the service and its parameters from JSON file, 
                         instead of standard input. In this case just a text to 
                         be processed or nothing is expected on standard input.
   --plaintext           Output of services "annotate", "annotate_vertical"
                         and "get_raw_annotations" is in plain text. If exception 
                         occurs, it remains in JSON.
   -f FILENAME, --filename FILENAME
                         Sets filename for service "annotate_vertical".

Script sec_api.py

Script sec_api.py is very similar to script sec.py and that is why it uses its part. Unlike it, more requests can be entered on standard input per one instance. After each request answer will be printed on standard output. Server using HTTP protocol will be created as well during launching, waiting on port 8082. Any HTTP client can send a SEC request through this script for a specific service via HTTP request POST and get a response.

Usage

 ./sec_api.py [-h] [-t [DIRECTORY]] [-p PATH] [-n PORT]
  Optional arguments:
   -h, --help            Shows help and then terminates.
   -t [DIRECTORY], --testing_mode [DIRECTORY]
                         Switches to test mode, which will allow to check work 
                         with structured annotations that NER is not familiar with. 
                         Meaning - service "annotate" is looking in URI query
                         "DOCUMENT_URI" for value of key "tid" and according to this 
                         looks in directory DIRECTORY for file with answer to
                         "annotation_format". If such file is found, instead of 
                         results from NER, its content will be returned. URI
                         query "DOCUMENT_URI" can contain key "aid". Unlike key
                         "tid", content of file found accordingly to this value,
                         the result of NER will be only enriched (connected to it).
                         Default value is ./testing_mode comparatively to script's 
                         directory.
   -p PATH, --uds_path PATH
                         Sets path to Unix domain socket, where daemon is 
                         waiting for clients. Default value is ./daemon_uds 
                         comparatively to script's directory.
   -n PORT, --net_port PORT
                         Sets port, where SEC is waiting for clients. Default value
                         is 8082.

Internal communication protocol

Internal communication protocol is based on model client-server using Unix domain socket (UDS) in stream mode.

Key points include:

Procedure

  1. Server is waiting for clients.
  2. Client connects.
  3. Client sends the settings to server (directory to the test mode and JSON with setting of the required service).
  4. Server receives the settings and gives client confirmation.
  5. Client sends data to be processed to server (if required service does not require it, the data may be equal to zero).
  6. Server can send client a request (within the test mode) to open several files and send its file descriptor (this also demonstrates the client's permission to open the file)..
  7. Server sends processed data to client.
  8. Client closes connection or continues to point no. 5, respectively point no. 3.

If the server detects incorrect settings or an error occurs during processing, the client sends information about the error and terminates the connection associated with it.

Commands and packet structure

You can check commands in file daemon_lib.py. They have dynamically generated two-digit number Opcode.

For commands two packet structures are being used. For errors it is:

   2 bytes        String      2 bytes
  -----------------------------------
 | Opcode |  Error message  |  CRLF  |
  -----------------------------------

For the rest (except for file descriptor) it is a structure that is being repeated until the number of bytes is equal to zero:

  2 bytes        Number (decimal)        2 bytes    N bytes    2 bytes
  ---------------------------------------------------------------------
 | Opcode |  Number of bytes of data N  |  CRLF  |  Raw data  |  CRLF  |
  ---------------------------------------------------------------------

Library python-fdsend is being used to send file descriptors.

Support of multiple NERs

In development - documentation will be completed later (contains only essential facts at the moment):

 ner_manager.appendNER("default", module_annotate.NER())

similar line with another name of NER and instance of another wrap

Output specification from NERs

Specification is created according to our NER and other requests. At the output from NERs is expected this syntax (BNF):

<output from NERs> ::= <origin_base>
    | <origin_base> "\t" <id>
    | <origin_base> "\t" <id> "\t" <direct_attributes>
<origin_base> ::= <start_offset> "\t" <end_offset> "\t" <data_type> "\t" <string_between_offsets> "\t" <data>
<data_type> ::= "kb"
    | "activity"
    | "date"
    | "interval"
    | "coref"
    | "uri"
<data> ::= <data-kb>
    | <data-activity>
    | <data-date>
    | <data-interval>
    | <data-coref>
    | <data-uri>
<data-kb> ::= <KB_row> | <KB_row> ";" <data-kb>
<data-date> ::= <year> "-" <month> "-" <day>
<data-interval> ::= <data-date> " -- " <data-date>
<data-coref> ::= <data-kb>

<direct_attributes> ::= <attribute> | <attribute> "|" <direct_attributes>
<attribute> ::= <attribute_name> "[" <attribute_type> "]=" <attribute_value>
<attribute_type> ::= "string" | "decimal" | "date" | "image" | "integer" | "uri" | <other_attribute_type>

<year> ::= <digit> <digit> <digit> <digit>
<month> ::= <digit> <digit>
<day> ::= <digit> <digit>
<digit> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

where:

  • <attribute_type> - based on XSD, however can also contain other types of which I am not aware where they are described (e.g. type "AnnotationLink")
  • In case that <data_type> is:

    Calling SEC as a library

    In order to not need to create subprocess with sec.py while using in SEC by another program - class Sec in sec.py has been created. Its methods are described in the source code. Like with "sec.py" it is necessary to have script "sec_daemon.py" launched and to initialize this class with path to Unix domain socket, where daemon is waiting. Configurations are defined similarly to sec.py with the difference that instead of JSON alternative of Python is being used (see table).

    Launching on grid

    For launching of SEC on grid (SGE) script ./sge/sec.sh has been created.

    Several requirements were placed:

    Final ./sge/sec.sh accepts the same arguments as "sec.py". Even though it was designed for launching on grid, it is possible to use it on ordinary machines (which is perhaps obvious).

    Within this aim switch --own_kb_daemon has been created at sec_daemon.py and --plaintext at sec.py. For this purpose ability to change name of shared memory by an argument of program in KB of deamon has been created.

    Usage in the manner of NER

    To use SEC with stdin/stdout of NER you can use service "get_raw_annotations". It is necessary to create a configuration file (for example "get_raw_annotations.cfg"), e.g. with:

     {
         "get_raw_annotations": {}
     }
    

    Then NER can be called via SEC like this:

     ./sge/sec.sh -c get_raw_annotations.cfg --plaintext
    

    Launching on Salomon

    To launch on supercomputer Salomon (IT4I) - scripts in directory ./salomon have been created.

    SEC is dependent on several libraries that are not installed on Salomon. It is necessary to copy them from knot09:/mnt/minerva1/nlp/projects/corpproc/dependencies_for_salomon/opt. It can be done e.g. this way:

     $ mkdir -p ~/mnt/ssh-knot-knot09
     $ sshfs xlogin01@knot09.fit.vutbr.cz:/ ~/mnt/ssh-knot-knot09/
     $ cp -r ~/mnt/ssh-knot-knot09/mnt/minerva1/nlp/projects/corpproc/dependencies_for_salomon/opt ~/
     $ fusermount -u ~/mnt/ssh-knot-knot09
    

    Dependencies are already assembled. If a new complilation would be necessary, launch ./salomon/prepare.sh.

    To launch use one of the several variations of script ./salomon/start.sh. Each variant expects:

     $ ls ~/parsed | sed 's/\.vert.*//g' > ~/namelist
    

    after launch it creates:

    Variants

    1. Variant ./salomon/start.sh will launch instance ./sge/sec.sh separately on one node for each file from ~/namelist.
    2. Variant ./salomon/v2/start.sh requires argument defining number of jobs per node. According to it and according to the number of files in ~/namelist, necessary number of jobs will be created, these jobs will occupy all nodes available by user per one job according to limits.

    Provided services

    Service annotate

    This service returns annotations for the specified document. It uses enrichment engine chosen by user. Annotations include information about their location in a document (start and end offset), lenght and annotated text itself. It also contains information obtained from KB, including e.g. type, name and URL on wikipedia.

    Enrichment engine is chosen by user in parameter "enrichment_engine". User can also assign its maximal processing time by parameter "enrichment_engine_timeout". To print all enrichment engines you can use service "get_enrichment_engines".

    Text in the document is usually ambiguous and that is why enrichment engine might find more possible entities to the particular text. If parameter "disambiguate" is set, then enrichment engine will select the most probable meaning of annotated text.

    Output format can be chosen by parameter "annotation_format". It is possible to choose multiple output formats for one input. Note: This might be edited later on in order to have always correct JSON as output when "plaintext": false Parameter "annotation_format" can have these values:

    By parameter "types_and_attributes" you can specify what information from KB will be included to the output. It is possible to allow specific types and all of their attributes (syntax { str(type): "all" }) or some of them (syntax { str(type): [ str(attribute), ... ] }). Its default value is "all", which means that statement with all types of annotations and its attributes is allowed. All available types and their attributes can be printed by service "get_entity_types_and_attributes".

    Parameter "document_uri" is used to enter the URL from which the document was taken over. If output format NIF is set, this parameter becomes necessary.

    If parameter "plaintext" is set to true, encapsulation of output to JSON is canceled. In this case various output formats are separated by character '\0'.

    Samples and more information can be found here.

    Request format

     {
         "annotate": {
             "input_text": str,
             "annotation_format": [ str, ... ],
             "disambiguate": int,
             "document_uri": str,
             "types_and_attributes": "all" | { str(type): "all" } | { str(type): [ str(attribute), ... ] },
             "enrichment_engine": str,
             "enrichment_engine_timeout": int,
             "plaintext": bool
         }
     }
    

    Answer format

     {
         "annotation": str
     }
    

    Output formats

    Output format of this service can be chosen by parameter "annotation_format". These formats are described below.

    SXML

    XML document including annotated text only. It is designed mainly for further processing.

    <?xml version="1.0" encoding="UTF-8"?>
    <!-- Generated using: trang -I xml -O rng *.sxml SXML.rng -->
    <grammar ns="" xmlns="http://relaxng.org/ns/structure/1.0" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
     <start>
       <element name="suggestion">
         <zeroOrMore>
           <element name="text">
             <attribute name="e_offset">
               <data type="integer"/>
             </attribute>
             <attribute name="s_offset">
               <data type="integer"/>
             </attribute>
             <attribute name="string"/>
             <zeroOrMore>
               <element name="annotation">
                 <optional>
                   <attribute name="id">
                     <data type="anyURI"/>
                   </attribute>
                 </optional>
                 <attribute name="type">
                   <data type="NCName"/>
                 </attribute>
                 <zeroOrMore>
                   <element name="attribute">
                     <optional>
                       <attribute name="annotType">
                         <data type="NCName"/>
                       </attribute>
                     </optional>
                     <attribute name="name">
                       <data type="NCName"/>
                     </attribute>
                     <attribute name="type">
                       <data type="NCName"/>
                     </attribute>
                     <text/>
                   </element>
                 </zeroOrMore>
               </element>
             </zeroOrMore>
           </element>
         </zeroOrMore>
       </element>
     </start>
    </grammar>
    
    XML
    HTML
    Text
    Index
    Index2
    RDF
    NIF

    Service annotate_vertical

    Special clone of service "annotate" for annotation of vertical. (For this reason, I will describe only the difference.) Its part is service "deverticalize" that is taking care of gradual getting of individual documents from the input in vertical format. Output format must be specified in the request.

    Request format

     {
         "annotate_vertical": {
             "input_text": str,
             "annotation_format": str,
             "vert_cols": [ str, ... ],
             "types_and_attributes": "all" | { str(type): "all" } | { str(type): [ str(attribute), ... ] },
             "enrichment_engine": str,
             "enrichment_engine_timeout": int,
             "filename": str,
             "num_workers": int,
             "plaintext": bool,
             "max_values_per_col": int | null,
             "wiki_mode": bool,
             "enable_figa": bool
         }
     }
    

    Answer format

     {
         "annotation": str | [
             {
                 "title": str,
                 "uri": str,
                 "article": str
             },
             ...
         ]
     }
    

    Service deverticalize

    Deverticalize text in vertical format (see http://www.sketchengine.co.uk/documentation/wiki/SkE/PrepareText or http://nlp.fi.muni.cz/cs/PopisVertikalu).

    Request format

     {
         "deverticalize": {
             "input_text": str,
             "vert_cols": [ str, ... ]
         }
     }
    

    Answer format

     {
         "deverticalized": [
             {
                 "id": str,
                 "document": str
             },
             ...
         ]
     }
    

    Errors

    Service get_enrichment_engines

    Lists all available enrichment engines, which essentially means values attribute "enrichment_engine" can have.

    Request format

     {
         "get_enrichment_engines": {}
     }
    

    Answer format

     {
         "enrichment_engines": [ str, ... ]
     }
    

    Service get_entities

    According to the specified name to attribute "input_string" each entity from KB that have the same or similar name will be printed. Output is ordered according to the value of attribute of entity "confidence". You can filter as well as in service "annotate" by attribute "types_and_attributes".

    Samples and more information can be found here.

    Request format

     {
         "get_entities": {
             "input_string": str,
             "types_and_attributes": "all" | { str(type): "all" } | { str(type): [ str(attribute), ... ] },
             "max_results": int
         }
     }
    

    Answer format

     {
         "data": [
             {
                 str(type): {
                     str(attribute): str,
                     ...
                 }
             },
             ...
         ]
     }
    

    Service get_entity_by_uri

    It finds entities by URI.

    Request format

     {
         "get_entity_by_uri": {
             "input_string": str,
             "types_and_attributes": "all" | { str(type): "all" } | { str(type): [ str(attribute), ... ] }
         }
     }
    

    Answer format

     {
         "data": [
             {
                 str(type): {
                     str(attribute): str,
                     ...
                 }
             },
             ...
         ]
     }
    

    Service get_entity_types_and_attributes

    Lists all the available types and their attributes. This information can be used at attribute "types_and_attributes" that is used as a filter for certain services.

    Request format

     {
         "get_entity_types_and_attributes": {}
     }
    

    Answer format

     {
         "data": [
             {
                 "type": str(type),
                 "attributes": [
                     str(attribute),
                     ...
                 ]
             },
             ...
         ]
     }
    

    Service get_kb_version

    Returns the version number of loaded KB.

    Request format

     {
         "get_kb_version": {}
     }
    

    Answer format

     {
         "version": int
     }
    

    Service get_raw_annotations

    Returns string obtained by NER.

    Request format

     {
         "get_raw_annotations": {
             "input_text": str,
             "disambiguate": int,
             "enrichment_engine": str,
             "enrichment_engine_timeout": int,
             "plaintext": bool
         }
     }
    

    Answer format

     {
         "annotation": str
     }
    

    Some of the generated attributes for types

    "confidence"

    "identifier"

    "disambiguation"

    <disambiguation> ::= <text in URI between brackets> "," <description> "(" <interval of living> ")"
                       | <text in URI between brackets> "(" <interval of living> ")"
                       | <text in URI between brackets>
                       | <description> "(" <interval of living> ")"
                       | <description>
                       | <interval of living>
                       | ""
    
    <interval of living> ::= <YYYY-MM-DD date of birth> -- <YYYY-MM-DD date of death>
                      | "born " <YYYY-MM-DD date of birth>
    

    References

    See also

    Manatee

    External links

    Vertical

    Manatee

    MG4J