org.apache.nlpcraft.nlp.enrichers
Type members
Classlikes
Brackets token enricher.
Brackets token enricher.
This enricher adds brackets
boolean metadata property to the token
instance if the word it represents is enclosed in brackets. Supported brackets are: ()
, {}
, []
and <>
.
NOTE: invalid enclosed brackets are ignored and for all input tokens property brackets
assigned as false
.
Dictionary-based "known-word" token enricher.
Dictionary-based "known-word" token enricher.
This enricher adds dict
boolean metadata property to the token
instance if word it represents is a known dictionary word, i.e. the configured dictionary contains this word's
lemma. The value true
of the metadata property indicates that this word's lemma is found in the dictionary,
false
value indicates otherwise.
NOTE: this implementation requires lemma
string metadata property that contains
token's lemma. You can configure NCOpenNLPTokenEnricher for required language that provides this
metadata property before this enricher in your pipeline.
- Value parameters:
- dictRes
Relative path, absolute path, classpath resource or URL to the dictionary. The dictionary should have a simple plain text format with one lemma per line, empty lines are skipped, duplicates ignored, lines starting with # symbol will be treated as comments and ignored. Note that the search in the dictionary is implemented using words' lemma and case is ignored.
- Source:
- NCDictionaryTokenEnricher.scala
Stopword token enricher for English (EN) language. Stopwords are the words which are filtered out (i.e. stopped) before processing of natural language text because they are insignificant.
Stopword token enricher for English (EN) language. Stopwords are the words which are filtered out (i.e. stopped) before processing of natural language text because they are insignificant.
This enricher adds stopword
boolean metadata property to the token
instance if the word it represents is an English stopword. The value true
of this metadata property indicates that
this word is detected as a stopword, false
value indicates otherwise. This implementation works off the
algorithm that uses an internal list of English stopwords as well as a procedural logic to determine the stopword
status of the token. This algorithm should work fine for most of the general uses cases. User can also add
additional stopwords or exceptions for the existing ones using corresponding parameters in NCEnStopWordsTokenEnricher
constructor.
More information about stopwords can be found at https://en.wikipedia.org/wiki/Stop_word.
NOTE: this implementation requires lemma
and pos
string metadata properties that
contain token's lemma and part of speech accordingly. You can configure NCOpenNLPTokenEnricher with the model
for English language that would provide these metadata properties before this enricher in your pipeline.
- Value parameters:
- addSet
User defined collection of additional stopwords. These words will be stemmatized by the given
stemmer
before attempting to find a match. Default value is an empty set.- exclSet
User defined collection of exceptions, i.e. the words which should not be marked as stopwords during processing. These words will be stemmatized by the given
stemmer
before attempting to find a match. Default value is an empty set.- stemmer
English stemmer implementation. Default value is the instance of NCEnStemmer.
- Source:
- NCEnStopWordsTokenEnricher.scala
OpenNLP-based language independent token enricher. This
enricher adds lemma
and pos
(part-of-speech) string metadata property to the token
instance. Learn more about lemmas here and about part-of-speech
here.
OpenNLP-based language independent token enricher. This
enricher adds lemma
and pos
(part-of-speech) string metadata property to the token
instance. Learn more about lemmas here and about part-of-speech
here.
This OpenNLP enricher requires PoS and lemma models. Some of free OpenNLP community-maintained models can be found here. Note that at least one of model must be defined.
- Value parameters:
- lemmaDicRes
Relative path, absolute path, classpath resource or URL to DictionaryLemmatizer model. Can be
null
if lemmatizer model is not configured, solemma
property will not be set. Note that at least one of the model must be provided.- posMdlRes
Relative path, absolute path, classpath resource or URL to POSTaggerME model. Can be
null
if part-of-speech model is not configured, sopos
property will not be set. Note that at least one of the model must be provided.
- Source:
- NCOpenNLPTokenEnricher.scala
Quotes token enricher.
Quotes token enricher.
This enricher adds quoted
boolean metadata property to the token
instance if word it represents is in quotes. The value true
of the metadata property indicates that this word is in quotes,
false
value indicates otherwise.
Supported quotes are: «, », ", ', `.
NOTE: invalid enclosed quotes are ignored.
- Source:
- NCQuotesTokenEnricher.scala
Swear-word token enricher.
Swear-word token enricher.
This enricher adds swear
boolean metadata property to the token
instance if word it represents is in a swear word dictionary, i.e. the swear dictionary contains this word's
stem. The value true
of the metadata property indicates that this word's stem is found in the dictionary,
false
value indicates otherwise.
- Value parameters:
- dictRes
Relative path, absolute path, classpath resource or URL to the dictionary. The dictionary should have a simple plain text format with one lemma per line, empty lines are skipped, duplicates ignored, lines starting with # symbol will be treated as comments and ignored. Note that the search in the dictionary is implemented using words' stem and case is ignored.
- stemmer
Stemmer implementation for the language used in the supplied swear-word dictionary.
- See also:
- Source:
- NCSwearWordsTokenEnricher.scala