The HTML Parser Libraries.

These java libraries provide access to the contents of local or remote HTML resources in a programatic way.

Components

The HTML Parser distribution is composed of:
  • a low level lexer that converts characters from a HTML page into a linear sequence of nodes
  • a high level parser that provides a heirarchical document model of a HTML page
  • source code in the src.zip file

Getting Started

For novice users, an introductory guide on how to set up your environment to use the HTML Parser is provided in HTML Parser for Dummies.

Building

To build the HTML Parser you'll need to get the sources from the HTML Parser project on Sourceforge if you haven't already, and then follow the build instructions.

Outstanding Issues.

Bugs are by far, the highest priority issues. Various reports of bugs related to the HTML Parser is available from the Bug Tracker on SourceForge. Issues related to incorrect behaviour of the current parser should be logged and tracked using this mechanism. Please use task lists and enhancement requests for issues that would not be considered bugs.

Several task lists are used to track the items that are not percieved as bugs, but are viewed by developers as things that need attention. The following list summarizes the purpose and target issues for each list.

  • Applications - Work associated with the sample applications included with the HTML Parser download is tracked by this list. This would also include proposals for other example applications.
  • Release - Work to be done before a major release is tracked by this list. Items included here must be resolved before the major release is considered complete. This can include refactoring, code clean-up, out-of-the-box experience work, build process fixes, platform (JDK) issues, performance or scalability enhancements, memory usage issues and other 'quality' issues that are not associated with a specific bug.
  • API - Work needed to enhance or fix the parser API is tracked by this list. Standards compliance, additional classes, method signatures, changes to parameter types, refactoring, deprecation, new or enhanced constructors, and other programatic interface issues would fall into this category. This list should be limited to those changes that could impact the developer community that relies on existing behaviour from the parser.
  • Documentation - Work associated with documenting the parser and it's example code and sample applications is tracked by this list. Javadocs, the web site and Wiki, Sourceforge site maintenance, mailing lists, forums, project documentation and other developer visible reference material would all fall under this category.

The Request For Enhancement list contains items that are proposed for future versions of the parser. Users may add to this list what they feel are extensions beyond simple bug fixing. Some user entered bugs are also transferred to this list if the scope of the fix would be too significant a change for the current version, or involve API changes that need to be vetted against the current user community.

Mailing Lists.

If you want to be notified when new releases of HTML Parser are available, join the HTML Parser Announcement List.
If you have questions about the usage of the parser, join the HTML Parser User List.
If you want to join as a developer, please sign up on the HTML Parser Developer List
All Packages Main Package Example Applications Nodes Lexer Scanners Beans Patterns Http Sax Utility 
Package Description
org.htmlparser
The basic API classes which will be used by most developers when working with the HTML Parser.
org.htmlparser.beans
The beans package contains Java Beans using the HTML Parser.
org.htmlparser.filters
The filters package contains example filters to select only desired nodes.
org.htmlparser.http
The http package is responsible for HTTP connections to servers.
org.htmlparser.lexer
The lexer package is the base level I/O subsystem.
org.htmlparser.lexerapplications.tabby
The Tabby program is a demonstration of how to use the underlying Lexer classes to perform file I/O.
org.htmlparser.lexerapplications.thumbelina
Extract the images behind thumbnail images.
org.htmlparser.nodes
The nodes package has the concrete node implementations.
org.htmlparser.parserapplications
Example applications.
org.htmlparser.parserapplications.filterbuilder  
org.htmlparser.parserapplications.filterbuilder.layouts  
org.htmlparser.parserapplications.filterbuilder.wrappers  
org.htmlparser.sax
The sax package implements a SAX (Simple API for XML) parser for HTML.
org.htmlparser.scanners
The scanners package contains classes responsible for the tertiary identification of tags.
org.htmlparser.tags
The tags package contains specific tags.
org.htmlparser.util
Code which can be reused by many classes, is located in this package.
org.htmlparser.util.sort
Provides generic sorting and searching.
org.htmlparser.visitors
The visitors package contains classes that use the Visitor pattern.