Interface NCToken

  • All Superinterfaces:
    NCMetadata

    public interface NCToken
    extends NCMetadata
    Detected model element. A token is a detected model element and is a part of the parsed user input. Sequence of tokens represents a fully parsed (see NCContext.getVariants() method) user input. A single token corresponds to a one or more words, sequential or not, in the user sentence.

    Note that tokens can be used to define other tokens (i.e. tokens are composable). Because of that tokens naturally form a tree hierarchy - see methods findPartTokens(String...), getAliases(), isOfAlias(String) and getPartTokens(). Note also that detected model elements that tokens represent also form another hierarchy, namely a model element hierarchy that user can also access via getAncestors(), getParentId() methods. These two hierarchies should be not be confused.

    Configuring Token Providers
    Token providers (built-in or 3rd party) have to be enabled in the REST server configuration. Data models also have to specify tokens they are expecting the REST server and probe to detect. This is done to limit the unnecessary processing since implicit enabling of all token providers and all tokens can lead to a significant slow down of processing. REST server configuration property nlpcraft.server.tokenProvides provides the list of enabled token providers. Data models provide their required tokens in NCModelView.getEnabledBuiltInTokens() method.

    Read full documentation in Data Model section and review examples.

    See Also:
    NCElement, NCContext.getVariants()
    • Method Summary

      All Methods Instance Methods Abstract Methods Default Methods 
      Modifier and Type Method Description
      default List<NCToken> findPartTokens​(String... idOrAlias)
      Gets the list of all part tokens with given IDs or aliases traversing entire part token graph.
      Set<String> getAliases()
      Gets optional set of aliases this token is known by.
      List<String> getAncestors()
      Gets the list of all parent IDs from this token up to the root.
      int getEndCharIndex()
      Gets end character index of this token in the original text.
      List<String> getGroups()
      Gets the list of groups this token belongs to.
      String getId()
      If this token represents user defined model element this method returns the ID of that element.
      default int getIndex()
      A shortcut method that gets index of this token in the sentence.
      default String getLemma()
      A shortcut method to get lemma of this token, i.e.
      NCModelView getModel()
      Gets reference to the model this token belongs to.
      default String getNormalizedText()
      A shortcut method that gets normalized user input text for this token.
      default String getOriginalText()
      A shortcut method that gets original user input text for this token.
      String getParentId()
      Gets the optional parent ID of the model element this token represents.
      List<NCToken> getPartTokens()
      Gets the list of tokens this tokens is composed of.
      default String getPos()
      A shortcut method to get Penn Treebank POS tag for this token.
      String getServerRequestId()
      Gets ID of the server request this token is part of.
      default int getSparsity()
      A shortcut method to get numeric value of how sparse the token is.
      int getStartCharIndex()
      Gets start character index of this token in the original text.
      default String getStem()
      A shortcut method to get stem of this token.
      default String getUnid()
      A shortcut method that gets internal globally unique system ID of the token.
      String getValue()
      Gets the value if this token was detected via element's value (or its synonyms).
      boolean isAbstract()
      Whether or not this token is abstract.
      default boolean isBracketed()
      A shortcut method on whether or not this token is surrounded by any of '[', ']', '{', '}', '(', ')' brackets.
      default boolean isChildOf​(String tokId)
      Tests whether this token is a child of given token ID.
      default boolean isDirect()
      A shortcut method on whether or not this token was matched on direct (not permutated) synonym.
      default boolean isEnglish()
      A shortcut method on whether this token represents an English word.
      default boolean isFreeWord()
      A shortcut method checking whether or not this token represents a free word.
      default boolean isMemberOf​(String grp)
      Tests whether or not this token belongs to the given group.
      default boolean isOfAlias​(String alias)
      Tests whether or not this token has given alias.
      default boolean isPermutated()
      A shortcut method on whether or not this token was matched on permutated (not direct) synonym.
      default boolean isQuoted()
      A shortcut method on whether or not this token is surrounded by single or double quotes.
      default boolean isStopWord()
      A shortcut method checking whether or not this token is a stopword.
      default boolean isSwear()
      A shortcut method on whether or not this token is a swear word.
      default boolean isSystemDefined()
      Tests whether or not this token is not for a user-defined model element.
      default boolean isUserDefined()
      Tests whether or not this token is for a user-defined model element.
      default boolean isWordnet()
      A shortcut method on whether or not this token is found in Princeton WordNet database.
    • Method Detail

      • getModel

        NCModelView getModel()
        Gets reference to the model this token belongs to.
        Returns:
        Model reference.
      • getServerRequestId

        String getServerRequestId()
        Gets ID of the server request this token is part of.
        Returns:
        ID of the server request this token is part of.
      • getId

        String getId()
        If this token represents user defined model element this method returns the ID of that element. Otherwise, it returns ID of the built-in system token. Note that a sentence can have multiple tokens with the same element ID.
        Returns:
        ID of the element (system or user defined).
        See Also:
        NCElement.getId()
      • getParentId

        String getParentId()
        Gets the optional parent ID of the model element this token represents. This only available for user-defined model elements - built-in tokens do not have parents and this will return null.
        Returns:
        ID of the token's element immediate parent or null if not available.
        See Also:
        NCElement.getParentId(), getAncestors()
      • getAncestors

        List<String> getAncestors()
        Gets the list of all parent IDs from this token up to the root. This only available for user-defined model elements = built-in tokens do not have parents and will return an empty list.
        Returns:
        List, potentially empty but never null, of all parent IDs from this token up to the root.
        See Also:
        getParentId()
      • isChildOf

        default boolean isChildOf​(String tokId)
        Tests whether this token is a child of given token ID. It is equivalent to:
             return getAncestors().contains(tokId);
         
        Parameters:
        tokId - Ancestor token ID.
        Returns:
        true this token is a child of given token ID, false otherwise.
      • getPartTokens

        List<NCToken> getPartTokens()
        Gets the list of tokens this tokens is composed of. This method returns only immediate part tokens.
        Returns:
        List of constituent tokens, potentially empty but never null, that this token is composed of.
        See Also:
        findPartTokens(String...)
      • findPartTokens

        default List<NCToken> findPartTokens​(String... idOrAlias)
        Gets the list of all part tokens with given IDs or aliases traversing entire part token graph.
        Parameters:
        idOrAlias - List of token IDs or aliases, potentially empty. If empty, the entire tree of part tokens is return as a list.
        Returns:
        List of all part tokens with given IDs or aliases. Potentially empty but never null.
        See Also:
        getPartTokens()
      • getAliases

        Set<String> getAliases()
        Gets optional set of aliases this token is known by. Token can get an alias if it is a part of other composed token and IDL expression that was used to match it specified an alias. Note that token can have zero, one or more aliases.
        Returns:
        List of aliases this token is known by. Can be empty, but never null.
      • isOfAlias

        default boolean isOfAlias​(String alias)
        Tests whether or not this token has given alias. It is equivalent to:
              return getAliases().contains(alias);
         
        Parameters:
        alias - Alias to test.
        Returns:
        True if this token has alias alias, false otherwise.
      • getValue

        String getValue()
        Gets the value if this token was detected via element's value (or its synonyms). Otherwise returns null. Only applicable for user-defined model elements - built-in tokens do not have values and it will return null.
        Returns:
        Value for the user-defined model element or null, if not available.
        See Also:
        NCElement.getValues()
      • getGroups

        List<String> getGroups()
        Gets the list of groups this token belongs to. Note that, by default, if not specified explicitly, token always belongs to one group with ID equal to token ID.
        Returns:
        Token groups list. Never null - but can be empty.
        See Also:
        NCElement.getGroups()
      • isMemberOf

        default boolean isMemberOf​(String grp)
        Tests whether or not this token belongs to the given group. It is equivalent to:
              return getGroups().contains(grp);
         
        Parameters:
        grp - Group to test.
        Returns:
        True if this token belongs to the group grp, false otherwise.
      • getStartCharIndex

        int getStartCharIndex()
        Gets start character index of this token in the original text.
        Returns:
        Start character index of this token.
      • getEndCharIndex

        int getEndCharIndex()
        Gets end character index of this token in the original text.
        Returns:
        End character index of this token.
      • isStopWord

        default boolean isStopWord()
        A shortcut method checking whether or not this token is a stopword. Stopwords are some extremely common words which add little value in helping understanding user input and are excluded from the processing entirely. For example, words like a, the, can, of, about, over, etc. are typical stopwords in English. NLPCraft has built-in set of stopwords.

        This method is equivalent to:

             return meta("nlpcraft:nlp:stopword");
         
        See more information on token metadata here.
        Returns:
        Whether or not this token is a stopword.
        See Also:
        NCModelView.getAdditionalStopWords(), NCModelView.getExcludedStopWords()
      • isFreeWord

        default boolean isFreeWord()
        A shortcut method checking whether or not this token represents a free word. A free word is a token that was detected neither as a part of user defined or system tokens.

        This method is equivalent to:

             return meta("nlpcraft:nlp:freeword");
         
        See more information on token metadata here.
        Returns:
        Whether or not this token is a freeword.
      • getOriginalText

        default String getOriginalText()
        A shortcut method that gets original user input text for this token.

        This method is equivalent to:

             return meta("nlpcraft:nlp:origtext");
         
        See more information on token metadata here.
        Returns:
        Original user input text for this token.
      • getIndex

        default int getIndex()
        A shortcut method that gets index of this token in the sentence.

        This method is equivalent to:

             return meta("nlpcraft:nlp:index");
         
        See more information on token metadata here.
        Returns:
        Index of this token in the sentence.
      • getNormalizedText

        default String getNormalizedText()
        A shortcut method that gets normalized user input text for this token.

        This method is equivalent to:

             return meta("nlpcraft:nlp:normtext");
         
        See more information on token metadata here.
        Returns:
        Normalized user input text for this token.
      • isDirect

        default boolean isDirect()
        A shortcut method on whether or not this token was matched on direct (not permutated) synonym.

        This method is equivalent to:

             return meta("nlpcraft:nlp:direct");
         
        See more information on token metadata here.
        Returns:
        Whether or not this token was matched on direct (not permutated) synonym.
      • isPermutated

        default boolean isPermutated()
        A shortcut method on whether or not this token was matched on permutated (not direct) synonym.

        This method is equivalent to:

             return !meta("nlpcraft:nlp:direct");
         
        See more information on token metadata here.
        Returns:
        Whether or not this token was matched on permutated (not direct) synonym.
      • isEnglish

        default boolean isEnglish()
        A shortcut method on whether this token represents an English word. Note that this only checks that token's text consists of characters of English alphabet, i.e. the text doesn't have to be necessary a known valid English word.

        This method is equivalent to:

             return meta("nlpcraft:nlp:english");
         
        See more information on token metadata here.
        Returns:
        Whether this token represents an English word.
      • isSwear

        default boolean isSwear()
        A shortcut method on whether or not this token is a swear word. NLPCraft has built-in list of common English swear words.

        This method is equivalent to:

             return meta("nlpcraft:nlp:swear");
         
        See more information on token metadata here.
        Returns:
        Whether or not this token is a swear word.
      • isQuoted

        default boolean isQuoted()
        A shortcut method on whether or not this token is surrounded by single or double quotes.

        This method is equivalent to:

             return meta("nlpcraft:nlp:quoted");
         
        See more information on token metadata here.
        Returns:
        Whether or not this token is surrounded by single or double quotes.
      • isBracketed

        default boolean isBracketed()
        A shortcut method on whether or not this token is surrounded by any of '[', ']', '{', '}', '(', ')' brackets.

        This method is equivalent to:

             return meta("nlpcraft:nlp:bracketed");
         
        See more information on token metadata here.
        Returns:
        Whether or not this token is surrounded by any of '[', ']', '{', '}', '(', ')' brackets.
      • isWordnet

        default boolean isWordnet()
        A shortcut method on whether or not this token is found in Princeton WordNet database.

        This method is equivalent to:

             return meta("nlpcraft:nlp:dict");
         
        See more information on token metadata here.
        Returns:
        Whether or not this token is found in Princeton WordNet database.
      • getLemma

        default String getLemma()
        A shortcut method to get lemma of this token, i.e. a canonical form of this word. Note that stemming and lemmatization allow to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. Lemmatization refers to the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

        This method is equivalent to:

             return meta("nlpcraft:nlp:lemma");
         
        See more information on token metadata here.
        Returns:
        Lemma of this token, i.e. a canonical form of this word.
      • getStem

        default String getStem()
        A shortcut method to get stem of this token. Note that stemming and lemmatization allow to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. Unlike lemma, stemming is a basic heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes.

        This method is equivalent to:

             return meta("nlpcraft:nlp:stem");
         
        See more information on token metadata here.
        Returns:
        Stem of this token.
      • getSparsity

        default int getSparsity()
        A shortcut method to get numeric value of how sparse the token is. This makes sense only for multi-word tokens. Sparsity zero means that all individual words in the token follow each other (regardless of the order).

        This method is equivalent to:

             return meta("nlpcraft:nlp:sparsity");
         
        See more information on token metadata here.
        Returns:
        Numeric value of how sparse the token is. Zero means no gaps between words. Bigger the sparsity value the bigger the average gap between words.
      • getPos

        default String getPos()
        A shortcut method to get Penn Treebank POS tag for this token. Note that additionally to standard Penn Treebank POS tags NLPCraft introduced '---' synthetic tag to indicate a POS tag for multiword tokens.

        This method is equivalent to:

             return meta("nlpcraft:nlp:pos");
         
        See more information on token metadata here.
        Returns:
        Penn Treebank POS tag for this token.
      • getUnid

        default String getUnid()
        A shortcut method that gets internal globally unique system ID of the token.

        This method is equivalent to:

             return meta("nlpcraft:nlp:unid");
         
        Returns:
        Internal globally unique system ID of the token.
      • isUserDefined

        default boolean isUserDefined()
        Tests whether or not this token is for a user-defined model element.
        Returns:
        True if this token is defined by the user model element, false otherwise.
      • isSystemDefined

        default boolean isSystemDefined()
        Tests whether or not this token is not for a user-defined model element.
        Returns:
        True if this token is not defined by the user model element, false otherwise.
      • isAbstract

        boolean isAbstract()
        Whether or not this token is abstract.

        An abstract token is only detected when it is either a constituent part of some other non-abstract token or referenced by built-in tokens. In other words, an abstract token will not be detected in a standalone unreferenced position. By default (unless returned by this method), all named entities considered to be non-abstract.

        Declaring tokens as abstract is important to minimize number of parsing variants automatically generated as permutation of all possible parsing compositions. For example, if it is known that a particular named entity will only be used as a constituent part of some other token - declaring such named entity as abstract can significantly reduce the number of parsing variants leading to a better performance, and often simpler corresponding intent definition and callback logic.

        Returns:
        Whether or not this token is abstract.