Rogue Wave banner
Previous fileTop of DocumentContentsIndex pageNext file
Objective Toolkit User's Guide
Rogue Wave web site:  Home Page  |  Main Documentation Page

17.2 Data Extraction Framework

Since the Web has become a huge virtual database of sorts, more and more often these days we find ourselves wanting to scrape data from Web sites (although this tactic may not be legal or allowed on some sites).

However, most data commonly available on the Web is in HTML format, which cannot be easily processed by software systems that do not understand HTML presentation. Most systems that extract and process data from the Web (including shopping agents, personal news bots, etc.) fall into this category.

When you extract only the required data from HTML streams, software systems designed to process that data can easily consume it — without regard to its formatting. With simple HTML streams (such as the one listed above) this is not a very difficult process. Any regular expression scanner will do nicely. With more complex files though, this quickly becomes quite difficult. The following issues are the biggest stumbling blocks.

Stingray developers considered these two issues carefully when designing the Objective Toolkit library. With reference to the first issue, the system that we have in place does not provide much of an advantage over lex (or other such systems). We offer the same functionality in terms of states. However, we believe that we have a more object oriented, easily extensible solution for the following reasons:



Previous fileTop of DocumentContentsNo linkNext file

Copyright © Rogue Wave Software, Inc. All Rights Reserved.

The Rogue Wave name and logo, and Stingray, are registered trademarks of Rogue Wave Software. All other trademarks are the property of their respective owners.
Provide feedback to Rogue Wave about its documentation.