Class NonSequentialPDFParser


  • public class NonSequentialPDFParser
    extends PDFParser
    PDFParser which first reads startxref and xref tables in order to know valid objects and parse only these objects. Thus it is closer to a conforming parser than the sequential reading of PDFParser. This class can be used as a PDFParser replacement. First parse() must be called before page objects can be retrieved, e.g. getPDDocument(). This class is a much enhanced version of QuickParser presented in PDFBOX-1104 by Jeremy Villalobos.
    • Field Detail

      • SYSPROP_PARSEMINIMAL

        public static final java.lang.String SYSPROP_PARSEMINIMAL
        See Also:
        Constant Field Values
      • SYSPROP_EOFLOOKUPRANGE

        public static final java.lang.String SYSPROP_EOFLOOKUPRANGE
        See Also:
        Constant Field Values
      • DEFAULT_TRAIL_BYTECOUNT

        protected static final int DEFAULT_TRAIL_BYTECOUNT
        See Also:
        Constant Field Values
      • EOF_MARKER

        protected static final char[] EOF_MARKER
        EOF-marker.
      • STARTXREF_MARKER

        protected static final char[] STARTXREF_MARKER
        StartXRef-marker.
      • OBJ_MARKER

        protected static final char[] OBJ_MARKER
        obj-marker.
      • securityHandler

        protected SecurityHandler securityHandler
        The security handler.
    • Constructor Detail

      • NonSequentialPDFParser

        public NonSequentialPDFParser​(java.lang.String filename)
                               throws java.io.IOException
        Constructs parser for given file using memory buffer.
        Parameters:
        filename - the filename of the pdf to be parsed
        Throws:
        java.io.IOException - If something went wrong.
      • NonSequentialPDFParser

        public NonSequentialPDFParser​(java.io.File file,
                                      RandomAccess raBuf)
                               throws java.io.IOException
        Constructs parser for given file using given buffer for temporary storage.
        Parameters:
        file - the pdf to be parsed
        raBuf - the buffer to be used for parsing
        Throws:
        java.io.IOException - If something went wrong.
      • NonSequentialPDFParser

        public NonSequentialPDFParser​(java.io.File file,
                                      RandomAccess raBuf,
                                      java.lang.String decryptionPassword)
                               throws java.io.IOException
        Constructs parser for given file using given buffer for temporary storage.
        Parameters:
        file - the pdf to be parsed
        raBuf - the buffer to be used for parsing
        decryptionPassword - password to be used for decryption
        Throws:
        java.io.IOException - If something went wrong.
      • NonSequentialPDFParser

        public NonSequentialPDFParser​(java.io.InputStream input)
                               throws java.io.IOException
        Constructor.
        Parameters:
        input - input stream representing the pdf.
        Throws:
        java.io.IOException - If something went wrong.
      • NonSequentialPDFParser

        public NonSequentialPDFParser​(java.io.InputStream input,
                                      RandomAccess raBuf,
                                      java.lang.String decryptionPassword)
                               throws java.io.IOException
        Constructor.
        Parameters:
        input - input stream representing the pdf.
        raBuf - the buffer to be used for parsing
        decryptionPassword - password to be used for decryption.
        Throws:
        java.io.IOException - If something went wrong.
    • Method Detail

      • setEOFLookupRange

        public void setEOFLookupRange​(int byteCount)
        Sets how many trailing bytes of PDF file are searched for EOF marker and 'startxref' marker. If not set we use default value DEFAULT_TRAIL_BYTECOUNT.

        In case system property SYSPROP_EOFLOOKUPRANGE is defined this value will be set on initialization but can be overwritten later.

        Parameters:
        byteCount - number of trailing bytes
      • initialParse

        protected void initialParse()
                             throws java.io.IOException
        The initial parse will first parse only the trailer, the xrefstart and all xref tables to have a pointer (offset) to all the pdf's objects. It can handle linearized pdfs, which will have an xref at the end pointing to an xref at the beginning of the file. Last the root object is parsed.
        Throws:
        java.io.IOException - If something went wrong.
      • setPdfSource

        protected final void setPdfSource​(long fileOffset)
                                   throws java.io.IOException
        Sets BaseParser.pdfSource to start next parsing at given file offset.
        Parameters:
        fileOffset - file offset
        Throws:
        java.io.IOException - If something went wrong.
      • releasePdfSourceInputStream

        protected final void releasePdfSourceInputStream()
                                                  throws java.io.IOException
        Enable handling of alternative pdfSource implementation.
        Throws:
        java.io.IOException - If something went wrong.
      • getStartxrefOffset

        protected final long getStartxrefOffset()
                                         throws java.io.IOException
        Looks for and parses startxref. We first look for last '%%EOF' marker (within last DEFAULT_TRAIL_BYTECOUNT bytes (or range set via setEOFLookupRange(int)) and go back to find startxref.
        Returns:
        the offset of StartXref
        Throws:
        java.io.IOException - If something went wrong.
      • lastIndexOf

        protected int lastIndexOf​(char[] pattern,
                                  byte[] buf,
                                  int endOff)
        Searches last appearance of pattern within buffer. Lookup before _lastOff and goes back until 0.
        Parameters:
        pattern - pattern to search for
        buf - buffer to search pattern in
        endOff - offset (exclusive) where lookup starts at
        Returns:
        start offset of pattern within buffer or -1 if pattern could not be found
      • readPattern

        protected final void readPattern​(char[] pattern)
                                  throws java.io.IOException
        Reads given pattern from BaseParser.pdfSource. Skipping whitespace at start and end.
        Parameters:
        pattern - pattern to be skipped
        Throws:
        java.io.IOException - if pattern could not be read
      • parse

        public void parse()
                   throws java.io.IOException
        This will parse the stream and populate the COSDocument object. This will close the stream when it is done parsing.
        Overrides:
        parse in class PDFParser
        Throws:
        java.io.IOException - If there is an error reading from the stream or corrupt data is found.
      • getPdfFile

        protected java.io.File getPdfFile()
        Return the pdf file.
        Returns:
        the pdf file
      • isLenient

        public boolean isLenient()
        Return true if parser is lenient. Meaning auto healing capacity of the parser are used.
        Returns:
        true if parser is lenient
      • setLenient

        public void setLenient​(boolean lenient)
                        throws java.lang.IllegalArgumentException
        Change the parser leniency flag. This method can only be called before the parsing of the file.
        Parameters:
        lenient -
        Throws:
        java.lang.IllegalArgumentException - if the method is called after parsing.
      • deleteTempFile

        protected void deleteTempFile()
        Remove the temporary file. A temporary file is created if this class is instantiated with an InputStream
      • getSecurityHandler

        public SecurityHandler getSecurityHandler()
        Returns security handler of the document or null if document is not encrypted or parse() wasn't called before.
        Returns:
        the security handler.
      • getPDDocument

        public PDDocument getPDDocument()
                                 throws java.io.IOException
        This will get the PD document that was parsed. When you are done with this document you must call close() on it to release resources. Overwriting super method was necessary in order to set security handler.
        Overrides:
        getPDDocument in class PDFParser
        Returns:
        The document at the PD layer.
        Throws:
        java.io.IOException - If there is an error getting the document.
      • getPageNumber

        public int getPageNumber()
                          throws java.io.IOException
        Returns the number of pages in a document.
        Returns:
        the number of pages.
        Throws:
        java.io.IOException - if PAGES or other needed object is missing
      • getPage

        public PDPage getPage​(int pageNr)
                       throws java.io.IOException
        Returns the page requested with all the objects loaded into it.
        Parameters:
        pageNr - starts from 0 to the number of pages.
        Returns:
        the page with the given pagenumber.
        Throws:
        java.io.IOException - If something went wrong.
      • parseObjectDynamically

        protected final COSBase parseObjectDynamically​(COSObject obj,
                                                       boolean requireExistingNotCompressedObj)
                                                throws java.io.IOException
        This will parse the next object from the stream and add it to the local state. This is taken from PDFParser and reduced to parsing an indirect object.
        Parameters:
        obj - object to be parsed (we only take object number and generation number for lookup start offset)
        requireExistingNotCompressedObj - if true object to be parsed must not be contained within compressed stream
        Returns:
        the parsed object (which is also added to document object)
        Throws:
        java.io.IOException - If an IO error occurs.
      • parseObjectDynamically

        protected COSBase parseObjectDynamically​(int objNr,
                                                 int objGenNr,
                                                 boolean requireExistingNotCompressedObj)
                                          throws java.io.IOException
        This will parse the next object from the stream and add it to the local state. This is taken from PDFParser and reduced to parsing an indirect object.
        Parameters:
        objNr - object number of object to be parsed
        objGenNr - object generation number of object to be parsed
        requireExistingNotCompressedObj - if true the object to be parsed must be defined in xref (comment: null objects may be missing from xref) and it must not be a compressed object within object stream (this is used to circumvent being stuck in a loop in a malicious PDF)
        Returns:
        the parsed object (which is also added to document object)
        Throws:
        java.io.IOException - If an IO error occurs.
      • decryptDictionary

        protected final void decryptDictionary​(COSDictionary dict,
                                               long objNr,
                                               long objGenNr)
                                        throws java.io.IOException
        Parameters:
        dict - the dictionary to be decrypted
        objNr - the object number
        objGenNr - the object generation number
        Throws:
        java.io.IOException - ff something went wrong
      • decryptString

        protected final void decryptString​(COSString str,
                                           long objNr,
                                           long objGenNr)
                                    throws java.io.IOException
        Decrypts given COSString.
        Parameters:
        str - the string to be decrypted
        objNr - the object number
        objGenNr - the object generation number
        Throws:
        java.io.IOException - ff something went wrong
      • decrypt

        protected final void decrypt​(COSBase pb,
                                     int objNr,
                                     int objGenNr)
                              throws java.io.IOException
        Decrypts given object.
        Parameters:
        pb - the object to be decrypted
        objNr - the object number
        objGenNr - the object generation number
        Throws:
        java.io.IOException - ff something went wrong
      • parseCOSStream

        protected COSStream parseCOSStream​(COSDictionary dic,
                                           RandomAccess file)
                                    throws java.io.IOException
        This will read a COSStream from the input stream using length attribute within dictionary. If length attribute is a indirect reference it is first resolved to get the stream length. This means we copy stream data without testing for 'endstream' or 'endobj' and thus it is no problem if these keywords occur within stream. We require 'endstream' to be found after stream data is read.
        Overrides:
        parseCOSStream in class BaseParser
        Parameters:
        dic - dictionary that goes with this stream.
        file - file to write the stream to when reading.
        Returns:
        parsed pdf stream.
        Throws:
        java.io.IOException - if an error occurred reading the stream, like problems with reading length attribute, stream does not end with 'endstream' after data read, stream too short etc.