Rogue Wave banner
Previous fileTop of DocumentContentsIndex pageNext file
Objective Toolkit User's Guide
Rogue Wave web site:  Home Page  |  Main Documentation Page

17.3 Example: Setting Up a Scanner

With this bit of background information under our belt, let us now look at how we setup a scanner.


It would help to compile the Deals sample at this point. This sample does not ship with the product. If you need this sample, please submit a request to Stingray customer support.

17.3.1 Introduction to the Deals Sample

The Deals sample retrieves the main page from http://www.thedailydeals.com (a Web site that has information on Web deals). It parses the information available at the site and populates a list control. Clicking on a link in the list control will take you directly to that link. This sample can automatically run every few minutes to gather this data, parse it, and populate the list control.

17.3.2 Declaring the Processor

When using the data extraction framework, you will usually need to concern yourself with only two major classes—the data processor class (CRgProcessor), which contains and drives the data extraction process, and the listener class (CRgDefListener), which acts as a stub for any feedback generated from the scanning process.

In the Deals sample, Class CDealProcessor (derived from CRgProcessor) is declared as shown below:


Typically, this is how we recommend that you declare your own processors. There should be an enum declaration for tokens that will be recognized by the scanner. There should be an enum declaration for states that the scanner will traverse.

The steps for writing code to scan a text stream are:

  1. Identify the tokens that you want to read from the stream.

  2. Define an enum and add a value for each of these tokens.

  3. Consider how you will extract the information, thinking ahead about the states and sub states to use to facilitate implementation.

For example, if you are looking at stock information from www.Yahoo.com you may be looking at data much like the listing below. (A series of three dots indicates where code has been deleted for brevity.)

In the HTML above, we have highlighted (in bold) the portions of interest—the stock quote area and the news area. To scan these areas you need different regular expressions. This usually implies that you should have these two areas denoted by different states.

Within each state, you may need to extract each needed token in a different manner. On the Yahoo! page, for example, you have to extract the title heading of the news link in a different manner than you will extract the article reference. Each of these would be sub states within the encompassing state. Sub states are not important when you define the token enum values; however, thinking about the sub states ahead of time makes implementing the scanner easier.

17.3.3 Implementation Details

Now that you understand the rationale for the two enum definitions, consider the data that we scan when interacting with http://www.thedailydeals.com. If you make a request to http://www.thedailydeals.com and click View|Source (or the equivalent command in your browser) you will see the entire HTML stream that returns from the server. The sample makes an HTTP request for the default page using the Microsoft Wininet library.

17.3.3.1 Token Definitions

Based on the data available at the Web site, we determined the information to extract (shown in Table 42). This should lead us to the token definitions. Therefore, the following tokens correspond to the three pieces of information we want.

Table 42: Information and token definitions

Information Required Token Definition
The description of the deal. etokenDealName
The link to it on the Daily Deals Web site. etokenRef
Whether the deal has expired or not. etokenExpired

To make the interpretation of the feedback data a little easier in the sample, we have a separate token for the expired state and the not expired state. We could have passed the token as it is and then looked at the string to see if it was expired or not expired. However, because we have two different tokens, the regular expression engine will do the work for us.

This leads us to define the following four tokens:

17.3.3.2 States

The next part is kind of heuristic in nature; each person may come up with a different solution. Looking at the chunk of data returned from the site, we came up with the following states:

estateNormal is the state that we start with. When we see the start of the deal listings we go into estateDeals. If we find another deal, we continue in estateDeals. If not, we exit the scanner.

The next step is to decide how we go about shifting between the states and what data we collect in each state. The conclusions of this exercise are translated into code using macros provided in the body of the Initialize() function (in the processor).

Each individual state is started with the macro rg_start_state(). This macro takes two parameters—the name of the state variable and the name of the state identifier token (one of the possible values of the state enum that we defined earlier). Inside the state declaration we can add sub states as desired. Each sub state is defined by an rg_sub_state()macro. This takes a token ID (from the enum that we defined earlier for this purpose) and the regular expression that extracts the data.

17.3.3.3 Sub Expressions

Be sure to place any data that you wish to collect inside parentheses so that a sub expression is defined. See Table 43 for an example.

Table 43: Defining a sub expression

General sample form... Regular expression should be...
My name is John. My name is (.*)

Any token that is scanned will have the sub expressions available. These can be used to initialize our data easily.

You can add any number of sub states within each state. Each of these sub states will be OR'd (combined) with the others to create a composite expression for that state. This implies that you should not try to gather data using separate regular expressions (defining two sub states) when there is overlap. Instead, combine these into the same expression. For example, consider:

Table 44: Using separate regular expressions

For example... Don't do this... Do this...
My name is happyface; my e-mail address is happyface@roguewave.com. My name is ([a-z]*); my e-mail address is happyface@roguewave.com.andmy e-mail address is (.*) My name is ([a-z]*); my e-mail address is (.*)

Each sub state also takes a third parameter that specifies the action that should be taken when a scan is complete. This can be one of three possible values.

End the state with rg_end_state(). You can have any number of states within the processor.

The Deals sample implements the required Initialize() function as shown below:

We should mention two other macros used in the Deals sample:

rg_start_processor()
rg_end_processor()

17.3.3.4 Defining the Listener

The next step is to define the listener. The listener class defines the feedback interface through which we get information from the processor. For the Deals sample the listener is declared in the dealscan.h header file. The declaration is shown below:

17.3.3.5 Optional Overrides

The only override that is directly tied to the scanner is the OnMatch() override. After the listener has been added to the scanner, the OnMatch() function will be called when a match takes place. Other optional overrides that you can use, if necessary, are listed below (with a brief description of each one):

17.3.3.6 OnMatch() Implementation

Let us now look at the OnMatch() implementation for the CDealListener class in a little more detail. It looks like this:

You start with the rg_start_token() and rg_end_token() macros. Use the rg_assign_string() macro to assign the scanned sub expressions to strings. This macro takes a token ID and a string that is to be used for the assignment. It also takes a number that specifies the 0 based index of the sub string that you want to assign. Pass 0 if you want the entire match to be assigned.

You can also check the token ID directly and write any additional code as seen in the code shown below.

Additional code in the listener is application specific and simply declares and allocates structures to deal with the data that we collect.

Figure 141 shows the Deals sample in action.

Figure 141: Latest deals on the Web sample



Previous fileTop of DocumentContentsNo linkNext file

Copyright © Rogue Wave Software, Inc. All Rights Reserved.

The Rogue Wave name and logo, and Stingray, are registered trademarks of Rogue Wave Software. All other trademarks are the property of their respective owners.
Provide feedback to Rogue Wave about its documentation.