Package weka.core.tokenizers
Class NGramTokenizer
- java.lang.Object
-
- weka.core.tokenizers.Tokenizer
-
- weka.core.tokenizers.CharacterDelimitedTokenizer
-
- weka.core.tokenizers.NGramTokenizer
-
- All Implemented Interfaces:
java.io.Serializable,java.util.Enumeration,OptionHandler,RevisionHandler
public class NGramTokenizer extends CharacterDelimitedTokenizer
Splits a string into an n-gram with min and max grams. Valid options are:-delimiters <value> The delimiters to use (default ' \r\n\t.,;:'"()?!').
-max <int> The max size of the Ngram (default = 3).
-min <int> The min size of the Ngram (default = 1).
- Version:
- $Revision: 1.4 $
- Author:
- Sebastian Germesin (sebastian.germesin@dfki.de), FracPete (fracpete at waikato dot ac dot nz)
- See Also:
- Serialized Form
-
-
Constructor Summary
Constructors Constructor Description NGramTokenizer()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description intgetNGramMaxSize()Gets the max N of the NGram.intgetNGramMinSize()Gets the min N of the NGram.java.lang.String[]getOptions()Gets the current option settings for the OptionHandler.java.lang.StringgetRevision()Returns the revision string.java.lang.StringglobalInfo()Returns a string describing the stemmerbooleanhasMoreElements()returns true if there's more elements availablejava.util.EnumerationlistOptions()Returns an enumeration of all the available options..static voidmain(java.lang.String[] args)Runs the tokenizer with the given options and strings to tokenize.java.lang.ObjectnextElement()Returns N-grams and also (N-1)-grams and ....java.lang.StringNGramMaxSizeTipText()Returns the tip text for this property.java.lang.StringNGramMinSizeTipText()Returns the tip text for this property.voidsetNGramMaxSize(int value)Sets the max size of the Ngram.voidsetNGramMinSize(int value)Sets the min size of the Ngram.voidsetOptions(java.lang.String[] options)Parses a given list of options.voidtokenize(java.lang.String s)Sets the string to tokenize.-
Methods inherited from class weka.core.tokenizers.CharacterDelimitedTokenizer
delimitersTipText, getDelimiters, setDelimiters
-
Methods inherited from class weka.core.tokenizers.Tokenizer
runTokenizer, tokenize
-
-
-
-
Method Detail
-
globalInfo
public java.lang.String globalInfo()
Returns a string describing the stemmer- Specified by:
globalInfoin classTokenizer- Returns:
- a description suitable for displaying in the explorer/experimenter gui
-
listOptions
public java.util.Enumeration listOptions()
Returns an enumeration of all the available options..- Specified by:
listOptionsin interfaceOptionHandler- Overrides:
listOptionsin classCharacterDelimitedTokenizer- Returns:
- an enumeration of all available options.
-
getOptions
public java.lang.String[] getOptions()
Gets the current option settings for the OptionHandler.- Specified by:
getOptionsin interfaceOptionHandler- Overrides:
getOptionsin classCharacterDelimitedTokenizer- Returns:
- the list of current option settings as an array of strings
-
setOptions
public void setOptions(java.lang.String[] options) throws java.lang.ExceptionParses a given list of options. Valid options are:-delimiters <value> The delimiters to use (default ' \r\n\t.,;:'"()?!').
-max <int> The max size of the Ngram (default = 3).
-min <int> The min size of the Ngram (default = 1).
- Specified by:
setOptionsin interfaceOptionHandler- Overrides:
setOptionsin classCharacterDelimitedTokenizer- Parameters:
options- the list of options as an array of strings- Throws:
java.lang.Exception- if an option is not supported
-
getNGramMaxSize
public int getNGramMaxSize()
Gets the max N of the NGram.- Returns:
- the size (N) of the NGram.
-
setNGramMaxSize
public void setNGramMaxSize(int value)
Sets the max size of the Ngram.- Parameters:
value- the size of the NGram.
-
NGramMaxSizeTipText
public java.lang.String NGramMaxSizeTipText()
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setNGramMinSize
public void setNGramMinSize(int value)
Sets the min size of the Ngram.- Parameters:
value- the size of the NGram.
-
getNGramMinSize
public int getNGramMinSize()
Gets the min N of the NGram.- Returns:
- the size (N) of the NGram.
-
NGramMinSizeTipText
public java.lang.String NGramMinSizeTipText()
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
hasMoreElements
public boolean hasMoreElements()
returns true if there's more elements available- Specified by:
hasMoreElementsin interfacejava.util.Enumeration- Specified by:
hasMoreElementsin classTokenizer- Returns:
- true if there are more elements available
-
nextElement
public java.lang.Object nextElement()
Returns N-grams and also (N-1)-grams and .... and 1-grams.- Specified by:
nextElementin interfacejava.util.Enumeration- Specified by:
nextElementin classTokenizer- Returns:
- the next element
-
tokenize
public void tokenize(java.lang.String s)
Sets the string to tokenize. Tokenization happens immediately.
-
getRevision
public java.lang.String getRevision()
Returns the revision string.- Returns:
- the revision
-
main
public static void main(java.lang.String[] args)
Runs the tokenizer with the given options and strings to tokenize. The tokens are printed to stdout.- Parameters:
args- the commandline options and strings to tokenize
-
-