Package pal.io

Class NexusTokenizer


  • public final class NexusTokenizer
    extends java.lang.Object

    Comments

    A simple token pull-parser for the NEXUS file format as specified in:

    Maddison, D. R., Swofford, D. L., & Maddison, W. P., Systematic Biology, 46(4), pp. 590 - 621.

    The parser is designed to break a NEXUS file into tokens which are read individually. Tokens come in four different types:

    • Punctuation: any of the punctuation characters (see constants)
    • Whitespace: sequences of characters composed of ' ' or '\t'. Whitespace is only returned if the option is set
    • Word: any string of characters delimited by whitespace or punctuation
    • Newline: '\r', '\n' or '\r\n'. The parser will return the character unless convertNL is set, in which case it will replace the token with the user specified new line character

    The parser has a set of options allowing tokens to be modified before they are returned (such as case modification or newline substitution).

    Each read by the parser moves forward in the stream, at present there is no support for unreading tokens or for moving bi-directionally through the stream

    NB: in this implementation, the token #NEXUS is considered special and when read by the parser, it will return one token: '#NEXUS' not two: '#' and 'NEXUS'. This token has special meaning and is reflected in it having its own token type

    Usage

    NexusTokenizer ntp = new NexusTokenizer(new PushbackReader(new FileReader("afile")));
    ntp.setReadWhiteSpace(false);
        // ignore whitespace ntp.setIgnoreComments(true);
         // ignore comments ntp.setWordModification(NexusTokenizer.WORD_UPPERCASE);
    // all tokens in uppercase String nToken = ntp.readToken();

    while(nToken != null) {
        System.out.println("Token: " + nToken);
        System.out.println("Col: " + ntp.getCol());
        System.out.println("Row: " + ntp.getRow());
    }
    Version:
    $Id$, $Name$
    Author:
    $Author$
    • Constructor Summary

      Constructors 
      Constructor Description
      NexusTokenizer​(java.io.PushbackReader pr)
      Constructor for a NexusTokenParser
      NexusTokenizer​(java.lang.String file)
      Constructor for a NexusTokenParser
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      boolean convertNewLine()
      Gets the flag indicating whether this parser instance should convert newline characters.
      int getCol()
      Gets the current column position of the cursor.
      java.lang.String getLastReadToken()
      Returns the last read token.
      int getLastTokenType()
      Determine the type of the last read token.
      int getRow()
      Gets the current row position of the cursor.
      int getWordModification()
      Gets the word modification flag currently in use
      java.lang.String readToken()
      Reads a token in from the underlying stream.
      boolean readWhiteSpace()
      Get the flag indicating whether or not this parser object is reading (and returning) whitespace
      java.lang.String seek​(int tokenType)
      Seeks through the stream to find the next token of the specified type.
      java.lang.String seek​(java.lang.String token)
      Seeks through the stream to find the token argument.
      void setConvertNewLine​(boolean b)
      Sets the convertNL flag.
      void setIgnoreComments​(boolean b)
      Sets the ignoreComments flag.
      void setNewLineChar​(char nl)
      Sets the character to be convert newline characters into
      void setReadWhiteSpace​(boolean b)
      Sets the readWS flag.
      void setWordModification​(int flag)
      Sets the flag value for word modification.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • NexusTokenizer

        public NexusTokenizer​(java.lang.String file)
                       throws java.io.IOException
        Constructor for a NexusTokenParser
        Parameters:
        file - File name for the NEXUS file
        Throws:
        java.io.IOException - I/O errors
      • NexusTokenizer

        public NexusTokenizer​(java.io.PushbackReader pr)
                       throws java.io.IOException
        Constructor for a NexusTokenParser
        Parameters:
        pr - PushbackReader
        Throws:
        java.io.IOException - I/O errors
    • Method Detail

      • readWhiteSpace

        public boolean readWhiteSpace()
        Get the flag indicating whether or not this parser object is reading (and returning) whitespace
        Returns:
        returns the readWS flag
      • convertNewLine

        public boolean convertNewLine()
        Gets the flag indicating whether this parser instance should convert newline characters. As the specification says (see link in class description above), newline characters may be '\r', '\n', '\r\n'. To provide some kind of uniformity, the parser can convert these symbols into one specified. As a default, this feature is off.
        Returns:
        returns the convertNL flag
      • setReadWhiteSpace

        public void setReadWhiteSpace​(boolean b)
        Sets the readWS flag. True means that the parser will return whitespace characters as a token (where whitespace = ' ' or '\t').
        Parameters:
        b - flag value for readWS
      • setConvertNewLine

        public void setConvertNewLine​(boolean b)
        Sets the convertNL flag. True means that the the parser will convert newline characters ('\r', '\n' or '\r\n') into either the default ('\n' if setNewLineChar() is not called) or to a user specified newline char
        Parameters:
        b - flag value for convertNL
      • setIgnoreComments

        public void setIgnoreComments​(boolean b)
        Sets the ignoreComments flag. True means that the the tokenizer will ignore comments (i.e. sections of a nexus file delimited by '[...]'. When set to true, the tokenizer will return the first token available after a comment.
        Parameters:
        b - flag value for ignoreComments
      • setNewLineChar

        public void setNewLineChar​(char nl)
        Sets the character to be convert newline characters into
        Parameters:
        nl - Replacement newline character
      • getCol

        public int getCol()
        Gets the current column position of the cursor. Changed after each read.
        Returns:
        Column number (zero indexed)
      • getRow

        public int getRow()
        Gets the current row position of the cursor. Changed after each read.
        Returns:
        Row number (zero indexed)
      • getWordModification

        public int getWordModification()
        Gets the word modification flag currently in use
        Returns:
        Flag value for word modification
      • setWordModification

        public void setWordModification​(int flag)
        Sets the flag value for word modification. The token case can be changed to lowercase or uppercasse once it has been read from the stream (depending on the set flag). WORD_UNMODIFIED indicates that the tokens should be returned in the case that they are read from the stream. This value can be set at any time between token reads and thus the next token read will be altered depending on this value. The default is WORD_UNMODIFIED.
        Parameters:
        flag - Flag value, one of WORD_LOWERCASE, WORD_UPPERCASE or WORD_UNMODIFIED
      • readToken

        public java.lang.String readToken()
                                   throws java.io.IOException,
                                          NexusParseException
        Reads a token in from the underlying stream. Tokens are individual chunks read from the underlying stream. Each token is one of the four basic types:
        • Word: any string of characters delimited by whitespace or punctuation
        • Punctuation: any of the punctuation characters (see constants)
        • Whitespace: sequences of characters composed of ' ' or '\t'. Whitespace is only returned if the option is set
        • Newline: '\r', '\n' or '\r\n'. The parser will return the character unless convertNL is set, in which case it will replace the token with the user specified new line character
        Returns:
        returns a String token or null if EOF is reached (i.e. no more tokens to read)
        Throws:
        java.io.IOException - I/O errors
        NexusParseException - Parsing errors
      • getLastTokenType

        public int getLastTokenType()
        Determine the type of the last read token. After readToken() has been called, the type of token returned can be determined by calling getLastTokenType(). This returns one of five different constants:
        • UNDEFINED_TOKEN : default before anything is read from the stream
        • WORD_TOKEN : word token was read
        • PUNCTUATION_TOKEN : punctuation token was read
        • NEWLINE_TOKEN : newline token was read
        • WHITESPACE_TOKEN : whitespace token was read (never returned unless whitespace is being returned)
        • HEADER_TOKEN : last token was the special word #NEXUS
        Returns:
        Last token read.
      • seek

        public java.lang.String seek​(int tokenType)
                              throws java.io.IOException,
                                     NexusParseException
        Seeks through the stream to find the next token of the specified type. The type value can be one of:
        • WORD_TOKEN
        • PUNCTUATION_TOKEN
        • NEWLINE_TOKEN
        • WHITESPACE_TOKEN
        • HEADER_TOKEN
        Returns:
        returns a String token or null if EOF is reached (i.e. no more tokens to read)
        Throws:
        java.io.IOException - I/O errors
        NexusParseException - Thrown by parsing errors or if tokenType == WHITESPACE_TOKEN && readWhiteSpace() == false
      • seek

        public java.lang.String seek​(java.lang.String token)
                              throws java.io.IOException,
                                     NexusParseException
        Seeks through the stream to find the token argument.
        Returns:
        returns a String token or null if token is not found (i.e. EOF is reached)
        Throws:
        java.io.IOException - I/O errors
        NexusParseException - Thrown by parsing errors or if token is whitespace && readWhiteSpace() == false
      • getLastReadToken

        public java.lang.String getLastReadToken()
        Returns the last read token. Each call to readToken() stores the returned token so that it can be retrieved again. However, each consuming readToken() call replaces this buffer with the new token.
        Returns:
        return the last read token