Class URI


  • public class URI
    extends Object
    The interface for the URI(Uniform Resource Identifiers) version of RFC 2396. This class has the purpose of supportting of parsing a URI reference to extend any specific protocols, the character encoding of the protocol to be transported and the charset of the document.

    A URI is always in an "escaped" form, since escaping or unescaping a completed URI might change its semantics.

    Implementers should be careful not to escape or unescape the same string more than once, since unescaping an already unescaped string might lead to misinterpreting a percent data character as another escaped character, or vice versa in the case of escaping an already escaped string.

    In order to avoid these problems, data types used as follows:

       URI character sequence: char
       octet sequence: byte
       original character sequence: String
     

    So, a URI is a sequence of characters as an array of a char type, which is not always represented as a sequence of octets as an array of byte.

    URI Syntactic Components

     - In general, written as follows:
       Absolute URI = <scheme>:<scheme-specific-part>
       Generic URI = <scheme>://<authority><path>?<query>
     

    - Syntax absoluteURI = scheme ":" ( hier_part | opaque_part ) hier_part = ( net_path | abs_path ) [ "?" query ] net_path = "//" authority [ abs_path ] abs_path = "/" path_segments

    The following examples illustrate URI that are in common use.

     ftp://ftp.is.co.za/rfc/rfc1808.txt
        -- ftp scheme for File Transfer Protocol services
     gopher://spinaltap.micro.umn.edu/00/Weather/California/Los%20Angeles
        -- gopher scheme for Gopher and Gopher+ Protocol services
     http://www.math.uio.no/faq/compression-faq/part1.html
        -- http scheme for Hypertext Transfer Protocol services
     mailto:mduerst@ifi.unizh.ch
        -- mailto scheme for electronic mail addresses
     news:comp.infosystems.www.servers.unix
        -- news scheme for USENET news groups and articles
     telnet://melvyl.ucop.edu/
        -- telnet scheme for interactive services via the TELNET Protocol
     
    Please, notice that there are many modifications from URL(RFC 1738) and relative URL(RFC 1808).

    The expressions for a URI

     For escaped URI forms
      - URI(char[]) // constructor
      - char[] getRawXxx() // method
      - String getEscapedXxx() // method
      - String toString() // method
     

    For unescaped URI forms - URI(String) // constructor - String getXXX() // method

    • Field Detail

      • within_userinfo

        public static final BitSet within_userinfo
        BitSet for within the userinfo component like user and password.
      • control

        public static final BitSet control
        BitSet for control.
      • space

        public static final BitSet space
        BitSet for space.
      • delims

        public static final BitSet delims
        BitSet for delims.
      • unwise

        public static final BitSet unwise
        BitSet for unwise.
      • disallowed_rel_path

        public static final BitSet disallowed_rel_path
        Disallowed rel_path before escaping.
      • disallowed_opaque_part

        public static final BitSet disallowed_opaque_part
        Disallowed opaque_part before escaping.
      • allowed_authority

        public static final BitSet allowed_authority
        Those characters that are allowed for the authority component.
      • allowed_opaque_part

        public static final BitSet allowed_opaque_part
        Those characters that are allowed for the opaque_part.
      • allowed_reg_name

        public static final BitSet allowed_reg_name
        Those characters that are allowed for the reg_name.
      • allowed_userinfo

        public static final BitSet allowed_userinfo
        Those characters that are allowed for the userinfo component.
      • allowed_within_userinfo

        public static final BitSet allowed_within_userinfo
        Those characters that are allowed for within the userinfo component.
      • allowed_IPv6reference

        public static final BitSet allowed_IPv6reference
        Those characters that are allowed for the IPv6reference component. The characters '[', ']' in IPv6reference should be excluded.
      • allowed_host

        public static final BitSet allowed_host
        Those characters that are allowed for the host component. The characters '[', ']' in IPv6reference should be excluded.
      • allowed_within_authority

        public static final BitSet allowed_within_authority
        Those characters that are allowed for the authority component.
      • allowed_abs_path

        public static final BitSet allowed_abs_path
        Those characters that are allowed for the abs_path.
      • allowed_rel_path

        public static final BitSet allowed_rel_path
        Those characters that are allowed for the rel_path.
      • allowed_within_path

        public static final BitSet allowed_within_path
        Those characters that are allowed within the path.
      • allowed_query

        public static final BitSet allowed_query
        Those characters that are allowed for the query component.
      • allowed_within_query

        public static final BitSet allowed_within_query
        Those characters that are allowed within the query component.
      • allowed_fragment

        public static final BitSet allowed_fragment
        Those characters that are allowed for the fragment component.
      • percent

        protected static final BitSet percent
        The percent "%" character always has the reserved purpose of being the escape indicator, it must be escaped as "%25" in order to be used as data within a URI.
      • digit

        protected static final BitSet digit
        BitSet for digit.

         digit    = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
                    "8" | "9"
         

      • alpha

        protected static final BitSet alpha
        BitSet for alpha.

         alpha         = lowalpha | upalpha
         

      • alphanum

        protected static final BitSet alphanum
        BitSet for alphanum (join of alpha & digit).

          alphanum      = alpha | digit
         

      • hex

        protected static final BitSet hex
        BitSet for hex.

         hex           = digit | "A" | "B" | "C" | "D" | "E" | "F" |
                                 "a" | "b" | "c" | "d" | "e" | "f"
         

      • escaped

        protected static final BitSet escaped
        BitSet for escaped.

         escaped       = "%" hex hex
         

      • mark

        protected static final BitSet mark
        BitSet for mark.

         mark          = "-" | "_" | "." | "!" | "~" | "*" | "'" |
                         "(" | ")"
         

      • unreserved

        protected static final BitSet unreserved
        Data characters that are allowed in a URI but do not have a reserved purpose are called unreserved.

         unreserved    = alphanum | mark
         

      • reserved

        protected static final BitSet reserved
        BitSet for reserved.

         reserved      = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
                         "$" | ","
         

      • uric

        protected static final BitSet uric
        BitSet for uric.

         uric          = reserved | unreserved | escaped
         

      • fragment

        protected static final BitSet fragment
        BitSet for fragment (alias for uric).

         fragment      = *uric
         

      • query

        protected static final BitSet query
        BitSet for query (alias for uric).

         query         = *uric
         

      • pchar

        protected static final BitSet pchar
        BitSet for pchar.

         pchar         = unreserved | escaped |
                         ":" | "@" | "&" | "=" | "+" | "$" | ","
         

      • param

        protected static final BitSet param
        BitSet for param (alias for pchar).

         param         = *pchar
         

      • segment

        protected static final BitSet segment
        BitSet for segment.

         segment       = *pchar *( ";" param )
         

      • path_segments

        protected static final BitSet path_segments
        BitSet for path segments.

         path_segments = segment *( "/" segment )
         

      • abs_path

        protected static final BitSet abs_path
        URI absolute path.

         abs_path      = "/"  path_segments
         

      • uric_no_slash

        protected static final BitSet uric_no_slash
        URI bitset for encoding typical non-slash characters.

         uric_no_slash = unreserved | escaped | ";" | "?" | ":" | "@" |
                         "&" | "=" | "+" | "$" | ","
         

      • opaque_part

        protected static final BitSet opaque_part
        URI bitset that combines uric_no_slash and uric.

         opaque_part   = uric_no_slash *uric
         

      • path

        protected static final BitSet path
        URI bitset that combines absolute path and opaque part.

         path          = [ abs_path | opaque_part ]
         

      • port

        protected static final BitSet port
        Port, a logical alias for digit.
      • IPv4address

        protected static final BitSet IPv4address
        Bitset that combines digit and dot fo IPv$address.

         IPv4address   = 1*digit "." 1*digit "." 1*digit "." 1*digit
         

      • IPv6address

        protected static final BitSet IPv6address
        RFC 2373.

         IPv6address = hexpart [ ":" IPv4address ]
         

      • IPv6reference

        protected static final BitSet IPv6reference
        RFC 2732, 2373.

         IPv6reference   = "[" IPv6address "]"
         

      • toplabel

        protected static final BitSet toplabel
        BitSet for toplabel.

         toplabel      = alpha | alpha *( alphanum | "-" ) alphanum
         

      • hostname

        protected static final BitSet hostname
        BitSet for hostname.

         hostname      = *( domainlabel "." ) toplabel [ "." ]
         

      • host

        protected static final BitSet host
        BitSet for host.

         host          = hostname | IPv4address | IPv6reference
         

      • hostport

        protected static final BitSet hostport
        BitSet for hostport.

         hostport      = host [ ":" port ]
         

      • userinfo

        protected static final BitSet userinfo
        Bitset for userinfo.

         userinfo      = *( unreserved | escaped |
                            ";" | ":" | "&" | "=" | "+" | "$" | "," )
         

      • server

        protected static final BitSet server
        Bitset for server.

         server        = [ [ userinfo "@" ] hostport ]
         

      • reg_name

        protected static final BitSet reg_name
        BitSet for reg_name.

         reg_name      = 1*( unreserved | escaped | "$" | "," |
                             ";" | ":" | "@" | "&" | "=" | "+" )
         

      • authority

        protected static final BitSet authority
        BitSet for authority.

         authority     = server | reg_name
         

      • scheme

        protected static final BitSet scheme
        BitSet for scheme.

         scheme        = alpha *( alpha | digit | "+" | "-" | "." )
         

      • rel_segment

        protected static final BitSet rel_segment
        BitSet for rel_segment.

         rel_segment   = 1*( unreserved | escaped |
                             ";" | "@" | "&" | "=" | "+" | "$" | "," )
         

      • rel_path

        protected static final BitSet rel_path
        BitSet for rel_path.

         rel_path      = rel_segment [ abs_path ]
         

      • net_path

        protected static final BitSet net_path
        BitSet for net_path.

         net_path      = "//" authority [ abs_path ]
         

      • hier_part

        protected static final BitSet hier_part
        BitSet for hier_part.

         hier_part     = ( net_path | abs_path ) [ "?" query ]
         

      • relativeURI

        protected static final BitSet relativeURI
        BitSet for relativeURI.

         relativeURI   = ( net_path | abs_path | rel_path ) [ "?" query ]
         

      • absoluteURI

        protected static final BitSet absoluteURI
        BitSet for absoluteURI.

         absoluteURI   = scheme ":" ( hier_part | opaque_part )
         

      • URI_reference

        protected static final BitSet URI_reference
        BitSet for URI-reference.

         URI-reference = [ absoluteURI | relativeURI ] [ "#" fragment ]
         

    • Constructor Detail

      • URI

        public URI()
    • Method Detail

      • encode

        protected static char[] encode​(String original,
                                       BitSet allowed,
                                       String charset)
                                throws org.apache.http.HttpException
        Encodes URI string.

        This is a two mapping, one from original characters to octets, and subsequently a second from octets to URI characters:

           original character sequence->octet sequence->URI character sequence
         

        An escaped octet is encoded as a character triplet, consisting of the percent character "%" followed by the two hexadecimal digits representing the octet code. For example, "%20" is the escaped encoding for the US-ASCII space character.

        Conversion from the local filesystem character set to UTF-8 will normally involve a two step process. First convert the local character set to the UCS; then convert the UCS to UTF-8. The first step in the process can be performed by maintaining a mapping table that includes the local character set code and the corresponding UCS code. The next step is to convert the UCS character code to the UTF-8 encoding.

        Mapping between vendor codepages can be done in a very similar manner as described above.

        The only time escape encodings can allowedly be made is when a URI is being created from its component parts. The escape and validate methods are internally performed within this method.

        Parameters:
        original - the original character sequence
        allowed - those characters that are allowed within a component
        charset - the protocol charset
        Returns:
        URI character sequence
        Throws:
        org.apache.http.HttpException - null component or unsupported character encoding
      • decode

        protected static String decode​(char[] component,
                                       String charset)
                                throws org.apache.http.HttpException
        Decodes URI encoded string.

        This is a two mapping, one from URI characters to octets, and subsequently a second from octets to original characters:

           URI character sequence->octet sequence->original character sequence
         

        A URI must be separated into its components before the escaped characters within those components can be allowedly decoded.

        Notice that there is a chance that URI characters that are non UTF-8 may be parsed as valid UTF-8. A recent non-scientific analysis found that EUC encoded Japanese words had a 2.7% false reading; SJIS had a 0.0005% false reading; other encoding such as ASCII or KOI-8 have a 0% false reading.

        The percent "%" character always has the reserved purpose of being the escape indicator, it must be escaped as "%25" in order to be used as data within a URI.

        The unescape method is internally performed within this method.

        Parameters:
        component - the URI character sequence
        charset - the protocol charset
        Returns:
        original character sequence
        Throws:
        org.apache.http.HttpException - incomplete trailing escape pattern or unsupported character encoding
      • decode

        protected static String decode​(String component,
                                       String charset)
                                throws org.apache.http.HttpException
        Decodes URI encoded string.

        This is a two mapping, one from URI characters to octets, and subsequently a second from octets to original characters:

           URI character sequence->octet sequence->original character sequence
         

        A URI must be separated into its components before the escaped characters within those components can be allowedly decoded.

        Notice that there is a chance that URI characters that are non UTF-8 may be parsed as valid UTF-8. A recent non-scientific analysis found that EUC encoded Japanese words had a 2.7% false reading; SJIS had a 0.0005% false reading; other encoding such as ASCII or KOI-8 have a 0% false reading.

        The percent "%" character always has the reserved purpose of being the escape indicator, it must be escaped as "%25" in order to be used as data within a URI.

        The unescape method is internally performed within this method.

        Parameters:
        component - the URI character sequence
        charset - the protocol charset
        Returns:
        original character sequence
        Throws:
        org.apache.http.HttpException - incomplete trailing escape pattern or unsupported character encoding
        Since:
        3.0