org.apache.nlpcraft.nlp.enrichers

Type members

Classlikes

class NCBracketsTokenEnricher extends NCTokenEnricher with LazyLogging

Brackets token enricher.

Brackets token enricher.

This enricher adds brackets boolean metadata property to the token instance if the word it represents is enclosed in brackets. Supported brackets are: (), {}, [] and <>.

NOTE: invalid enclosed brackets are ignored and for all input tokens property brackets assigned as false.

Source:
NCBracketsTokenEnricher.scala
class NCDictionaryTokenEnricher(dictRes: String) extends NCTokenEnricher with LazyLogging

Dictionary-based "known-word" token enricher.

Dictionary-based "known-word" token enricher.

This enricher adds dict boolean metadata property to the token instance if word it represents is a known dictionary word, i.e. the configured dictionary contains this word's lemma. The value true of the metadata property indicates that this word's lemma is found in the dictionary, false value indicates otherwise.

NOTE: this implementation requires lemma string metadata property that contains token's lemma. You can configure NCOpenNLPTokenEnricher for required language that provides this metadata property before this enricher in your pipeline.

Value parameters:
dictRes

Relative path, absolute path, classpath resource or URL to the dictionary. The dictionary should have a simple plain text format with one lemma per line, empty lines are skipped, duplicates ignored, lines starting with # symbol will be treated as comments and ignored. Note that the search in the dictionary is implemented using words' lemma and case is ignored.

Source:
NCDictionaryTokenEnricher.scala
class NCEnStopWordsTokenEnricher(addSet: Set[String], exclSet: Set[String], stemmer: NCStemmer) extends NCTokenEnricher with LazyLogging

Stopword token enricher for English (EN) language. Stopwords are the words which are filtered out (i.e. stopped) before processing of natural language text because they are insignificant.

Stopword token enricher for English (EN) language. Stopwords are the words which are filtered out (i.e. stopped) before processing of natural language text because they are insignificant.

This enricher adds stopword boolean metadata property to the token instance if the word it represents is an English stopword. The value true of this metadata property indicates that this word is detected as a stopword, false value indicates otherwise. This implementation works off the algorithm that uses an internal list of English stopwords as well as a procedural logic to determine the stopword status of the token. This algorithm should work fine for most of the general uses cases. User can also add additional stopwords or exceptions for the existing ones using corresponding parameters in NCEnStopWordsTokenEnricher constructor.

More information about stopwords can be found at https://en.wikipedia.org/wiki/Stop_word.

NOTE: this implementation requires lemma and pos string metadata properties that contain token's lemma and part of speech accordingly. You can configure NCOpenNLPTokenEnricher with the model for English language that would provide these metadata properties before this enricher in your pipeline.

Value parameters:
addSet

User defined collection of additional stopwords. These words will be stemmatized by the given stemmer before attempting to find a match. Default value is an empty set.

exclSet

User defined collection of exceptions, i.e. the words which should not be marked as stopwords during processing. These words will be stemmatized by the given stemmer before attempting to find a match. Default value is an empty set.

stemmer

English stemmer implementation. Default value is the instance of NCEnStemmer.

Source:
NCEnStopWordsTokenEnricher.scala
class NCOpenNLPTokenEnricher(posMdlRes: String, lemmaDicRes: String) extends NCTokenEnricher with LazyLogging

OpenNLP-based language independent token enricher. This enricher adds lemma and pos (part-of-speech) string metadata property to the token instance. Learn more about lemmas here and about part-of-speech here.

OpenNLP-based language independent token enricher. This enricher adds lemma and pos (part-of-speech) string metadata property to the token instance. Learn more about lemmas here and about part-of-speech here.

This OpenNLP enricher requires PoS and lemma models. Some of free OpenNLP community-maintained models can be found here. Note that at least one of model must be defined.

Value parameters:
lemmaDicRes

Relative path, absolute path, classpath resource or URL to DictionaryLemmatizer model. Can be null if lemmatizer model is not configured, so lemma property will not be set. Note that at least one of the model must be provided.

posMdlRes

Relative path, absolute path, classpath resource or URL to POSTaggerME model. Can be null if part-of-speech model is not configured, so pos property will not be set. Note that at least one of the model must be provided.

Source:
NCOpenNLPTokenEnricher.scala
class NCQuotesTokenEnricher extends NCTokenEnricher with LazyLogging

Quotes token enricher.

This enricher adds quoted boolean metadata property to the token instance if word it represents is in quotes. The value true of the metadata property indicates that this word is in quotes, false value indicates otherwise.

Supported quotes are: «, », ", ', `.

NOTE: invalid enclosed quotes are ignored.

Source:
NCQuotesTokenEnricher.scala
class NCSwearWordsTokenEnricher(dictRes: String, stemmer: NCStemmer) extends NCTokenEnricher with LazyLogging

Swear-word token enricher.

Swear-word token enricher.

This enricher adds swear boolean metadata property to the token instance if word it represents is in a swear word dictionary, i.e. the swear dictionary contains this word's stem. The value true of the metadata property indicates that this word's stem is found in the dictionary, false value indicates otherwise.

Value parameters:
dictRes

Relative path, absolute path, classpath resource or URL to the dictionary. The dictionary should have a simple plain text format with one lemma per line, empty lines are skipped, duplicates ignored, lines starting with # symbol will be treated as comments and ignored. Note that the search in the dictionary is implemented using words' stem and case is ignored.

stemmer

Stemmer implementation for the language used in the supplied swear-word dictionary.

See also:
Source:
NCSwearWordsTokenEnricher.scala