Class NexusTokenizer
- java.lang.Object
-
- pal.io.NexusTokenizer
-
public final class NexusTokenizer extends java.lang.Object
Comments
A simple token pull-parser for the NEXUS file format as specified in:
Maddison, D. R., Swofford, D. L., & Maddison, W. P., Systematic Biology, 46(4), pp. 590 - 621.
The parser is designed to break a NEXUS file into tokens which are read individually. Tokens come in four different types:
- Punctuation: any of the punctuation characters (see constants)
- Whitespace: sequences of characters composed of
' '
or'\t'
. Whitespace is only returned if the option is set - Word: any string of characters delimited by whitespace or punctuation
- Newline:
'\r'
,'\n'
or'\r\n'
. The parser will return the character unlessconvertNL
is set, in which case it will replace the token with the user specified new line character
The parser has a set of options allowing tokens to be modified before they are returned (such as case modification or newline substitution).
Each read by the parser moves forward in the stream, at present there is no support for unreading tokens or for moving bi-directionally through the stream
NB: in this implementation, the token #NEXUS is considered special and when read by the parser, it will return one token: '#NEXUS' not two: '#' and 'NEXUS'. This token has special meaning and is reflected in it having its own token type
Usage
NexusTokenizer ntp = new NexusTokenizer(new PushbackReader(new FileReader("afile")));
ntp.setReadWhiteSpace(false);
// ignore whitespace ntp.setIgnoreComments(true);
// ignore comments ntp.setWordModification(NexusTokenizer.WORD_UPPERCASE);
// all tokens in uppercase String nToken = ntp.readToken();
while(nToken != null) {
System.out.println("Token: " + nToken);
System.out.println("Col: " + ntp.getCol());
System.out.println("Row: " + ntp.getRow());
}
- Version:
- $Id$, $Name$
- Author:
- $Author$
-
-
Field Summary
Fields Modifier and Type Field Description static char
ADDITION
static char
ASTERIX
static char
B_SLASH
static char
B_TICK
static char
C_RETURN
static char
COLON
static char
COMMA
static char
D_QUOTE
static char
DASH
static char
EQUALS
static char
F_SLASH
static char
G_THAN
static char
HASH
static int
HEADER_TOKEN
Flag indicating last token read was the header token #NEXUSstatic char
L_BRACE
static char
L_BRACKET
static char
L_FEED
static char
L_PARENTHESIS
static char
L_THAN
static int
NEWLINE_TOKEN
Flag indicating last token read was a newline symbol/wordstatic char
PERIOD
static int
PUNCTUATION_TOKEN
Flag indicating last token read was a punctuation symbolstatic char
R_BRACE
static char
R_BRACKET
static char
R_PARENTHESIS
static char
S_QUOTE
static char
SEMI_COLON
static char
SPACE
static char
TAB
static int
UNDEFINED_TOKEN
Flag indicating last token read was undefinedstatic int
WHITESPACE_TOKEN
Flag indicating last token read was whitespacestatic int
WORD_LOWERCASE
Flag indicating words should be converted to lowercasestatic int
WORD_TOKEN
Flag indicating last token read was a wordstatic int
WORD_UNMODIFIED
Flag indicating words should be untouchedstatic int
WORD_UPPERCASE
Flag indicating words should be converted to uppercase
-
Constructor Summary
Constructors Constructor Description NexusTokenizer(java.io.PushbackReader pr)
Constructor for aNexusTokenParser
NexusTokenizer(java.lang.String file)
Constructor for aNexusTokenParser
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description boolean
convertNewLine()
Gets the flag indicating whether this parser instance should convert newline characters.int
getCol()
Gets the current column position of the cursor.java.lang.String
getLastReadToken()
Returns the last read token.int
getLastTokenType()
Determine the type of the last read token.int
getRow()
Gets the current row position of the cursor.int
getWordModification()
Gets the word modification flag currently in usejava.lang.String
readToken()
Reads a token in from the underlying stream.boolean
readWhiteSpace()
Get the flag indicating whether or not this parser object is reading (and returning) whitespacejava.lang.String
seek(int tokenType)
Seeks through the stream to find the next token of the specified type.java.lang.String
seek(java.lang.String token)
Seeks through the stream to find the token argument.void
setConvertNewLine(boolean b)
Sets theconvertNL
flag.void
setIgnoreComments(boolean b)
Sets theignoreComments
flag.void
setNewLineChar(char nl)
Sets the character to be convert newline characters intovoid
setReadWhiteSpace(boolean b)
Sets thereadWS
flag.void
setWordModification(int flag)
Sets the flag value for word modification.
-
-
-
Field Detail
-
L_PARENTHESIS
public static final char L_PARENTHESIS
- See Also:
- Constant Field Values
-
R_PARENTHESIS
public static final char R_PARENTHESIS
- See Also:
- Constant Field Values
-
L_BRACKET
public static final char L_BRACKET
- See Also:
- Constant Field Values
-
R_BRACKET
public static final char R_BRACKET
- See Also:
- Constant Field Values
-
L_BRACE
public static final char L_BRACE
- See Also:
- Constant Field Values
-
R_BRACE
public static final char R_BRACE
- See Also:
- Constant Field Values
-
F_SLASH
public static final char F_SLASH
- See Also:
- Constant Field Values
-
B_SLASH
public static final char B_SLASH
- See Also:
- Constant Field Values
-
COMMA
public static final char COMMA
- See Also:
- Constant Field Values
-
SEMI_COLON
public static final char SEMI_COLON
- See Also:
- Constant Field Values
-
COLON
public static final char COLON
- See Also:
- Constant Field Values
-
EQUALS
public static final char EQUALS
- See Also:
- Constant Field Values
-
ASTERIX
public static final char ASTERIX
- See Also:
- Constant Field Values
-
S_QUOTE
public static final char S_QUOTE
- See Also:
- Constant Field Values
-
D_QUOTE
public static final char D_QUOTE
- See Also:
- Constant Field Values
-
B_TICK
public static final char B_TICK
- See Also:
- Constant Field Values
-
ADDITION
public static final char ADDITION
- See Also:
- Constant Field Values
-
DASH
public static final char DASH
- See Also:
- Constant Field Values
-
L_THAN
public static final char L_THAN
- See Also:
- Constant Field Values
-
G_THAN
public static final char G_THAN
- See Also:
- Constant Field Values
-
HASH
public static final char HASH
- See Also:
- Constant Field Values
-
PERIOD
public static final char PERIOD
- See Also:
- Constant Field Values
-
L_FEED
public static final char L_FEED
- See Also:
- Constant Field Values
-
C_RETURN
public static final char C_RETURN
- See Also:
- Constant Field Values
-
TAB
public static final char TAB
- See Also:
- Constant Field Values
-
SPACE
public static final char SPACE
- See Also:
- Constant Field Values
-
WORD_UPPERCASE
public static final int WORD_UPPERCASE
Flag indicating words should be converted to uppercase- See Also:
- Constant Field Values
-
WORD_LOWERCASE
public static final int WORD_LOWERCASE
Flag indicating words should be converted to lowercase- See Also:
- Constant Field Values
-
WORD_UNMODIFIED
public static final int WORD_UNMODIFIED
Flag indicating words should be untouched- See Also:
- Constant Field Values
-
UNDEFINED_TOKEN
public static final int UNDEFINED_TOKEN
Flag indicating last token read was undefined- See Also:
- Constant Field Values
-
WORD_TOKEN
public static final int WORD_TOKEN
Flag indicating last token read was a word- See Also:
- Constant Field Values
-
PUNCTUATION_TOKEN
public static final int PUNCTUATION_TOKEN
Flag indicating last token read was a punctuation symbol- See Also:
- Constant Field Values
-
NEWLINE_TOKEN
public static final int NEWLINE_TOKEN
Flag indicating last token read was a newline symbol/word- See Also:
- Constant Field Values
-
WHITESPACE_TOKEN
public static final int WHITESPACE_TOKEN
Flag indicating last token read was whitespace- See Also:
- Constant Field Values
-
HEADER_TOKEN
public static final int HEADER_TOKEN
Flag indicating last token read was the header token #NEXUS- See Also:
- Constant Field Values
-
-
Constructor Detail
-
NexusTokenizer
public NexusTokenizer(java.lang.String file) throws java.io.IOException
Constructor for aNexusTokenParser
- Parameters:
file
- File name for the NEXUS file- Throws:
java.io.IOException
- I/O errors
-
NexusTokenizer
public NexusTokenizer(java.io.PushbackReader pr) throws java.io.IOException
Constructor for aNexusTokenParser
- Parameters:
pr
- PushbackReader- Throws:
java.io.IOException
- I/O errors
-
-
Method Detail
-
readWhiteSpace
public boolean readWhiteSpace()
Get the flag indicating whether or not this parser object is reading (and returning) whitespace- Returns:
- returns the
readWS
flag
-
convertNewLine
public boolean convertNewLine()
Gets the flag indicating whether this parser instance should convert newline characters. As the specification says (see link in class description above), newline characters may be '\r', '\n', '\r\n'. To provide some kind of uniformity, the parser can convert these symbols into one specified. As a default, this feature is off.- Returns:
- returns the
convertNL
flag
-
setReadWhiteSpace
public void setReadWhiteSpace(boolean b)
Sets thereadWS
flag. True means that the parser will return whitespace characters as a token (where whitespace = ' ' or '\t').- Parameters:
b
- flag value forreadWS
-
setConvertNewLine
public void setConvertNewLine(boolean b)
Sets theconvertNL
flag. True means that the the parser will convert newline characters ('\r', '\n' or '\r\n') into either the default ('\n' ifsetNewLineChar()
is not called) or to a user specified newline char- Parameters:
b
- flag value forconvertNL
-
setIgnoreComments
public void setIgnoreComments(boolean b)
Sets theignoreComments
flag. True means that the the tokenizer will ignore comments (i.e. sections of a nexus file delimited by '[...]'. When set to true, the tokenizer will return the first token available after a comment.- Parameters:
b
- flag value forignoreComments
-
setNewLineChar
public void setNewLineChar(char nl)
Sets the character to be convert newline characters into- Parameters:
nl
- Replacement newline character
-
getCol
public int getCol()
Gets the current column position of the cursor. Changed after each read.- Returns:
- Column number (zero indexed)
-
getRow
public int getRow()
Gets the current row position of the cursor. Changed after each read.- Returns:
- Row number (zero indexed)
-
getWordModification
public int getWordModification()
Gets the word modification flag currently in use- Returns:
- Flag value for word modification
-
setWordModification
public void setWordModification(int flag)
Sets the flag value for word modification. The token case can be changed to lowercase or uppercasse once it has been read from the stream (depending on the set flag).WORD_UNMODIFIED
indicates that the tokens should be returned in the case that they are read from the stream. This value can be set at any time between token reads and thus the next token read will be altered depending on this value. The default isWORD_UNMODIFIED.
- Parameters:
flag
- Flag value, one ofWORD_LOWERCASE
,WORD_UPPERCASE
orWORD_UNMODIFIED
-
readToken
public java.lang.String readToken() throws java.io.IOException, NexusParseException
Reads a token in from the underlying stream. Tokens are individual chunks read from the underlying stream. Each token is one of the four basic types:- Word: any string of characters delimited by whitespace or punctuation
- Punctuation: any of the punctuation characters (see constants)
- Whitespace: sequences of characters composed of ' ' or '\t'. Whitespace is only returned if the option is set
- Newline: '\r', '\n' or '\r\n'. The parser will return the character
unless
convertNL
is set, in which case it will replace the token with the user specified new line character
- Returns:
- returns a
String
token ornull
if EOF is reached (i.e. no more tokens to read) - Throws:
java.io.IOException
- I/O errorsNexusParseException
- Parsing errors
-
getLastTokenType
public int getLastTokenType()
Determine the type of the last read token. AfterreadToken()
has been called, the type of token returned can be determined by callinggetLastTokenType()
. This returns one of five different constants:UNDEFINED_TOKEN
: default before anything is read from the streamWORD_TOKEN
: word token was readPUNCTUATION_TOKEN
: punctuation token was readNEWLINE_TOKEN
: newline token was readWHITESPACE_TOKEN
: whitespace token was read (never returned unless whitespace is being returned)HEADER_TOKEN
: last token was the special word #NEXUS
- Returns:
- Last token read.
-
seek
public java.lang.String seek(int tokenType) throws java.io.IOException, NexusParseException
Seeks through the stream to find the next token of the specified type. The type value can be one of:- WORD_TOKEN
- PUNCTUATION_TOKEN
- NEWLINE_TOKEN
- WHITESPACE_TOKEN
- HEADER_TOKEN
- Returns:
- returns a
String
token ornull
if EOF is reached (i.e. no more tokens to read) - Throws:
java.io.IOException
- I/O errorsNexusParseException
- Thrown by parsing errors or if tokenType == WHITESPACE_TOKEN && readWhiteSpace() == false
-
seek
public java.lang.String seek(java.lang.String token) throws java.io.IOException, NexusParseException
Seeks through the stream to find the token argument.- Returns:
- returns a
String
token ornull
if token is not found (i.e. EOF is reached) - Throws:
java.io.IOException
- I/O errorsNexusParseException
- Thrown by parsing errors or if token is whitespace && readWhiteSpace() == false
-
getLastReadToken
public java.lang.String getLastReadToken()
Returns the last read token. Each call toreadToken()
stores the returned token so that it can be retrieved again. However, each consumingreadToken()
call replaces this buffer with the new token.- Returns:
- return the last read token
-
-