Tik-76.115 Functional Specification

Adaptive WWW Proxy Server for Mobile Communication

http://www.hut.fi/~aschantz/moxy/documents/to.html

Project team: Moxy
Document: Functional Specification Version: $Revision: 1.116 $ State:
Last update: $Date: 2000/02/16 02:50:31 $ Author: Anette Bernas,  Tancred Lindholm, Mikko Nuutinen, Ilkka Pirttimaa
Review date: Nov 2 1999 Reviewed by Anette Bernas,  Tancred Lindholm, Mikko Nuutinen, Ilkka Pirttimaa
Approval date: Approved by

Table of Contents

Summary
1. Introduction
2. General Description
    2.1 Users
        2.1.1 Core: QoS-enchanced web surfing
        2.1.2 Core: A research team developing a new mobile WWW application
        2.1.3 Base: Corporation publishing a QoS-aware WWW service
        2.1.4 Optional: ISP providing a QoS-aware gateway to the Internet
    2.2 Technology - Moxy server
    2.3 Technology - Client
    2.4 Scenarios
        2.4.1 Core: Scenario 1 - Moxy as a proxy
        2.4.2 Optional: Scenario 2 - Running Moxy on a client machine
        2.4.3 Core: Scenario 3 - Running Moxy as a Web server 
    2.5 Data Flow
    2.6 Tools
3. Data Structures 
    3.1 Page Content Tree (PCT)
    3.2 Rule and Filter Store (RFS) 
    3.3 Moxy Parameter List (MPL)
    3.4 Collection of Moxy Parameter Lists (CMPL)
    3.5 Parameter Store (PS)
4. Functions
    4.1 Core
        4.1.1 Get Document
        4.1.2 Redirector
        4.1.3 Generate Page Content Tree
        4.1.4 HTML Parser
        4.1.5 Parse Moxy Parameters
        4.1.6 Filter Planner and Executor (FPE)
        4.1.7 Filter and Rule Manager (FRM)
        4.1.8 Moxy Internal Get Document
        4.1.9 HTML Unparse
        4.1.10 Output Page Content Tree 
        4.1.11 Core Filters
            4.1.11.1 Image Filter
            4.1.11.2 Frame Removal Filter
            4.1.11.3 Comment Tag Removal Filter
            4.1.11.4 JavaScript Removal Filter
            4.1.11.5 Junk Filter
            4.1.11.6 Font Tag Filter
            4.1.11.7 Page Splitting Filter
            4.1.11.8 Table Filter
    4.2 Base
        4.2.1 A monitor for rule execution
        4.2.2 QoS control trough custom HTML tags
        4.2.3 Caching
        4.2.4 User Accounts
5. External Interfaces and Connections
    5.1 User Interfaces
        5.1.1 Core: Web Surfer Graphical User Interface
            5.1.1.1 Transparent web surfing
            5.1.1.2 Personal Settings
            5.1.1.3 Low-level GUI
        5.1.2 Core: Administrator Text Based User Interface
        5.1.3 Base: Administrator Graphical User Interface
        5.1.4 Base: Debug Tool GUI
    5.2 Data Communications
6. Additional Features
7. References


Summary

Mobile devices are gaining a footing in the Internet world. Because of the widely varying capabilities of mobile devices (display size, bandwidth etc.), the multimedia content of the Internet needs to be modified in order to increase the usability.

The mobile WWW proxy Moxy offers QoS control to people surfing the Web with mobile devices, for example personal digital assistants or hand-held computers. Other potential users are enterprises offering mobile WWW applications. By using Moxy different versions of the web pages do not have to be designed for different devices.

The main functionalities of Moxy are implementing QoS control for mobile clients, as well as server-side configurable QoS control. The web surfer can select from different levels of QoS using his web browser. The levels, defined on the server side, could for example be "no frames" or "low quality images". The usability of the web pages is increased due to the simplification of the pages. The server-side configurable QoS control is implemented with rules that define a mapping from client parameters (bandwidth, screen capabilities etc.) to what filters should be used.

Other important functionalities are filtering HTML in ways that makes the essential data of a HTML page accessible, as well as implementing an interface for plug-in custom filters, allowing users to write filters of their own and use them with Moxy. At least these filters will be implemented:

The Moxy software is integrated into a proxy server. The Moxy server acts as an intermediary between the user of a mobile device and a Web server, receiving and forwarding traffic between these. When a client downloads a Web page from a WWW server, the request for the page goes to the WWW server through the Moxy server, and when the page downloads Moxy filters it automatically. To the user, the Moxy server is transparent; all Internet requests and responses appear to be directly with the Web site.

The Moxy Content Filter Engine handles the dynamic simplification of the HTML pages. The HTML document is parsed and the Moxy parameters are analysed in order to determine which properties the filtered page shall have. Using this information, the Filter Engine determines which filters need to be applied to a page, as well as the order of application.

1. Introduction

This document is the Functional Specification for the Mobile WWW Proxy Moxy, that is to be implemented by the Moxy team.

A demand for access to the Internet from mobile devices has come up from the popularity of Internet services and the widespread usage of mobile devices. The widely varying capabilities of mobile devices, in terms of display size, storage, processing power and bandwidth, has resulted in a need to modify the content of the Internet into a suitable format for mobile equipment.

The mobile WWW proxy Moxy offers a guaranteed QoS to people surfing the Internet with a mobile device. Other potential users are Internet service providers offering a QoS-sensitive gateway to the Internet and enterprises offering mobile WWW applications. By using the mobile WWW proxy different versions of the web pages do not have to be designed for different devices.

The mobile WWW proxy developed in this project will be used in the GO-PROD research project as a prototype tool for controlling QoS of mobile Internet services.[1]

The main goals of the project are to:

The volatility of the project has strong implications on the requirements. As a basic risk managing strategy the requirements have been divided into three levels:

  1. Core requirements: These requirements are considered to be primary and should be met upon completion of the project.
  2. Base requirements: Secondary requirements, of which as many as possible, within reasonable effort, should be met upon project completion.
  3. Optional features: Things that may be implemented if there is time and/or they require little effort compared to their usefulness. This list includes "wild ideas", some of which will be totally beyond the scope of the project, thought up during customer meetings. Since the project is very much charting unknown territory it entails a great deal of prototyping and research.

Chapter 2 describes the mobile WWW proxy on a general level. The users of the proxy and the user environments are presented. Three different usage scenarios are explained without going into technical details. Further this chapter contains an illustration of the data flow between the different components of the proxy plus tools and equipment needed. In chapter 3 the data structures related to the project are presented. Chapter 4 specifies the functions of the mobile WWW proxy. The presentation of a function includes the purpose of the function and the procedure, as well as the input and output parameters. In chapter 5 user interfaces and data communications are presented. Chapter 6 contains a few non-functional features of the proxy.

A note about the term "filter": The proxy should allow the user to define the changes made to HTML pages with rules. Rules may be both declarative ("what") and procedural ("how"). The meaning of the term filter in this document is "procedural rule".

2. General Description

2.1 Users

There are two main types of users: WWW client users and technically skilled server administrators. A base requirement of the system is to support also server administrators with moderate technical knowledge. Another type of user is HTML content producers. The following usage scenarios describe the characteristics of each user type. The user interfaces are presented in chapter 5.

2.1.1 Core: QoS-enchanced web surfing

The mobile WWW proxy is used by a person, who is familiar with basic Internet usage. The person notices that some pages may look different from what he is used to: sometimes there are no pictures, frames may have been removed and the font is large enough to read even on small screens. The filtering made by the proxy increases the usability of the pages. The use of the proxy is practically invisible to the user.

2.1.2 Core: A research team developing a new mobile WWW application

A research team has developed a novel WWW application, which they want to be able to use over connections with varying bandwidth and on devices with different screen capabilities. Instead of having to design different versions of the service, they configure the proxy to filter the WWW pages of the service. It is possible to extend the proxy with custom-made filters if the standard filters are not sufficient. The rules of the proxy are flexible, yet simple enough so that the researchers feel comfortable to use them.

2.1.3 Base: Corporation publishing a QoS-aware WWW service

A corporation publishing a WWW service offers QoS guaranteed mobile access to the users. By configuring the proxy to filter the web pages, different versions of the pages do not have to be designed and updated. This scenario is similar to the research team. However, the corporate user wants the software to be easy to install and maintain, preferably on a common platform such as Windows NT or generic Un*x. The configuration is performed through a GUI. Advanced features may still require special technical knowledge.

2.1.4 Optional: ISP providing a QoS-aware gateway to the Internet

The scenario is similar to the Corporate WWW service. The proxy must, however, be able to successfully filter most of the popular WWW pages available on the Internet (i.e. render usable output) - otherwise it's just annoying.

2.2 Technology - Moxy server

The server on which the proxy software runs should be continuously connected to the network its clients are attached to, as well as to the network from which unfiltered HTML is fetched. Typically both networks are the same: the Internet. The networks must support TCP/IP. The Moxy server communicates with the Internet server with the HTTP protocol. In the internal communication between the Moxy modules and the Squid server the HTTP protocol is also used.

The proxy server will run on Linux, which is the target OS of the proxy, but could also be developed to run on other Unices and Windows NT.

The Moxy software will be integrated into an existing proxy server. Using an already existing proxy server that provides proxy and cache functionality, allows us to focus on the main functionality of Moxy. The Squid proxy server has been chosen to work together with the Moxy software. The Web cache server Squid is free, open-source software. Squid runs on Linux, Windows NT and other operating systems. [2] The Moxy software will be kept as independent as possible from the Squid proxy in order to minimize the changes needed in the Moxy software if the Squid proxy is upgraded or exchanged for some other proxy server. A redirector will be implemented in order to integrate the Moxy software and Squid.

2.3 Technology - Client

On the client side a web browser supporting the use of a proxy is needed. The clients are mobile devices, such as personal digital assistants (PDAs), hand-held computers, laptops, wearable computers and other mobile devices, as well as TV browsers.   The proxy server communicates with the client with a proxy protocol.

2.4 Scenarios

2.4.1 Core: Scenario 1 - Moxy as a proxy

The Moxy software is integrated into a proxy server. The proxy server Moxy acts as an intermediary between the user of a mobile device and a Web server, receiving and forwarding traffic between these.

The Moxy server receives a request from a user to download a Web page. Moxy, acting as a client on behalf of the user, uses its own IP address to request the page from the WWW site. When the page is returned, the Moxy server relates it to the original request, streams the web page through a filter and forwards the streamed page on to the user.

The Moxy server keeps a copy of the returned web page in its cache. When Moxy receives a request from a user, it looks in its cache of previously downloaded pages. If the page is in the cache, Moxy returns the page to the user without forwarding the request to the Internet server.

To the user, the Moxy server is invisible; all Internet requests and responses appear to be directly with the Web server. The Moxy server is not truly invisible, its IP address and port has to be specified in the configuration of the browser. The client connects to the Moxy server with a proxy protocol.

2.4.2 Optional: Scenario 2 - Running Moxy on a client machine

In this scenario the Moxy software is run locally, that is, on the same computer as the browser. This could be useful for example in product development. The user can make personal filters for local use only.

2.4.3 Core: Scenario 3 - Running Moxy as a Web server

In this scenario the Moxy server is closely related to a Web server and maintained by the Internet Service Provider in charge of the Web server. Instead of designing different versions of a Web page located at the WWW server, Moxy is configured to use different filters for different types of terminals.

To the user, Moxy have the appearance of a Web server. When the user make a request to download a Web page, Moxy forwards the request to the Web server. When Moxy receives the returned Web page, the page is streamed through a filter before it is forwarded to the client.

2.5 Data Flow

The following drawing illustrates the data flow between the functions of the system. Chapter 4 contains detailed descriptions of each function. The Moxy software is integrated into an existing proxy server.

When a user wants to download a web page the client makes a HTTP-request sending the URL of the requested page to the Moxy server. The client communicates with the server with a proxy protocol or the HTTP protocl, depending on whether the Moxy server appears to the client as a proxy or as a web server. The FrontDoor module handles the URLredirection of the HTTP-request. When FrontDoor receives a HTTP-request, it decides whether the requested URL shall be routed through the Moxy Content Filter Engine (MCFE) or sent directly to the client. An URL ending with "/", ".htm" or ".html" is routed to MCFE through the proxy. Any other URL (.pdf, .ps etc.) is sent directly to the client and not redirected to MCFE. FrontDoor uses the Moxy Naming Service (MNS) module to figure out the address of the target machine from the request. MNS changes the URL to a Moxy specific URL on the host where Moxy is running. The Filter Planner of the MNS determines which filters to use, and in which order to apply them to the page. Filtering parameters can be set by administrators and users. For example, in order to filter all the pages of the a site with the parameters frames=remove, an administrator would specify in the MNS database that the parameter frames=remove should be appended to every URL from that site. The users can set the paramters either at Settings Pages or by manually entering parameters after the URL at the Location bar of the browser. Filtering Rules are read from configuration files. The Moxy Naming Service also gives shortened alternatives for long URLs and translates a shortened URL back to the corresponding long one. Shortened URLs ends with ".mns". The way Moxy handles URLs are described in a separate document, Link Processing in Moxy. MNS and FrontDoor communicates with Java Remote Method Invocation (RMI).

The Get Document function of the proxy server connects to the remote WWW site on the Internet and gets the requested document from the WWW server. The HTML page is saved to disk in the proxy cache. Next time someone requests that page, the proxy server reads it off disk, and is able to forward the page to the Moxy Content Filter Engine witout connecting to the WWW server. A HTML pages that have been filtered by the Mocy Content Filter Engine is also saved in the proxy cache. Next time a user requests the same URL filtered with the same pearmeters, the proxy gets it from the cahche, and is able to transfer the page to the user without filtering it through the MCFE.

The Moxy Content Filter Engine (MCFE) carries out the dynamic simplification of the HTML page. The Parse Document function converts the HTML document to a Page Content Tree (PCT), which is an internal data structure representing HTML pages during processing. The Filter Executor finds out which properties the filtered page shall have (no frames, no pictures etc.) from the appendix of the URL added by the MNS module. The Filter Executor streams the parsed page through the filters in order to obtain the desired adaptive page. The Filter Executor outputs a PCT page filtered according to the parameters and the current rules. The difference between Moxy Internal Filters and External Filters are the format of the page filtered. Moxy Internal Filters stream PCT pages saved in memory. The pages streamed by External Filters are in HTML format. Filters not written in Java, for example awk-scripts, can be used with MCFE through the a wrapper Interface for External Filters. The page must be translated from the internal data structure PCT to HTML format before an External Filter can be applied to the page. The document that is being filtered may contain images that need to be resampled. After the internal Image Filter has resampled the image, the link specifying the image in the document is linked to the resampled image, and no longer to the original image on the remote WWW server. When the filtering is done, the Generate HTML-file function converts the page from PCT format back to a HTML document. The generated Moxy document is returned to the proxy server, which sends the client a HTTP-response.

moxyarchitecture.gif (14197 bytes)

2.6 Tools

Moxy will be implemented in Java due to the wide acceptance and portability of the language. The main modules of Moxy will be implemented as servlets, using the Java Servlet Development Kit available from Sun. Servlets provide excellent infrastructure for web-enabled applications, allowing us to focus on the main functionality of Moxy.[3]

The Moxy software will be integrated into an existing proxy server, providing proxy and cache functionality. For this purpose the Squid proxy server has been chosen.[2] C++ will be used to extend Squid to work together with the Moxy software.

The Rule and Filter Store will consist Prolog statements.

The Moxy Naming Service uses MySQL database to keep track of Moxy parameters and URL redirections.

In database definitions ERWIN from Logic Works will be used.

The UML diagrams for the documentation will be done with WithClass99 from MicroCold Software.[4]

The Image Filter will use the program ImageMagick to perform filtering operations on GIF and JPG images.[5]

In general, however, the choice of tools is free as long as the system will run on Linux. As an important characteristic of the project is prototyping and research, it is likely that some less conventional programming tools and languages will be used.

3. Data Structures

This section explains the data structures used in Moxy.

3.1 Page Content Tree (PCT)

The Page Content Tree (PCT) is a tree data structure used to store WWW objects. A WWW object is any form of data stream retrievable by a HTTP request, such as a HTML page, an image in GIF or JPG format etc. The PCT is used by Moxy to store WWW objects during the filtering process.

The level 1 nodes of the PCT correspond to WWW objects. One of the level 1 nodes may be the primary node, which is the node sent as response when filtering is done. The other level 1 nodes can be either named or unnamed. Unnamed nodes are simply discarded after the filtering process, whereas named nodes are stored for future retrieval. Naming is handled trough the MNS, guaranteeing unique names for the objects.

The naming facility would typically be used by the page splitting filter, which splits long pages into a table of content page and several subpages. As input the filter takes a PCT with one level 1 node, containing the long HTML page we want to split. The output of the filter is a PCT containing several level 1 nodes, where each level 1 node represents a subpage and the primary node the new table of contents page that is returned. The other nodes are named, and automatically stored on the server where Moxy is running, thus enabling the filter to create links from the primary page to the subpages.

The PCT is also used to represent an HTML document during the filtering of the page. The nodes in the PCT are storage places for HTML elements. Each node has a type tag, any number of attribute=value pairs and child nodes. If the PCT represent an HTML document, the type tag pf the node matches the HTML element the node represent, such as HTML, LI, IMG etc. Nodes can also have Moxy specific typetags and attribute=value pairs.

The figure below shows the general structure of the Page Content Tree. For specific examples, see section 4.1.3, 4.1.4, 4.1.9

3.2 Rule and Filter Store (RFS)

The Rule and Filter Store (RFS) is a data structure containing the filtering rules and filters used by Moxy. The rules will probably be stated using some kind of logic notation, for instance (in Prolog):

stripped_page(X) :- frames(X,none), images(X,none), tables(X,none)

The list of available filters could look like this:

frames(X): code=FrameFilter.class, preconditions={Content-type=PARSED_HTML}, postconditions={Frames=none}

The RFS should be designed in such a way that it is possible to modify it "online", e.g. without restarting Moxy. The actual implementation structure of the RFS is dictated by the needs of the FPE (see 4.1.6.).

3.3 Moxy Parameter List (MPL)

The Moxy Parameter List (MPL) is a list data structure used to store list of parameters. The MPL is used by the Filter Planner and Executor (FPE) when selecting  filters to be used. The filters can also use these parameters.

The MPL is a list of entries, each of which can have one attribute=value pair and priority. There can be many instances of the MPL. They can be static (for example Moxy default settings) or dynamic.

3.4 Collection of Moxy Parameter Lists (CMPL)

The Collection of Moxy Parameter Lists (CMPL) is a list data structure used to store list of Moxy Parameter Lists. The list could be: Moxy General Settings, Settings for specific URL, Personal settings for user A, Personal settings for user B, Personal settings for user C, Settings for specific User-Agent.

The CMPL is a list of ownertype, ownername and MPL. The list is ordered by the ownertype so it's easy to handle parameter overriding rules. There will be a mechanism to free entries that haven't been used for a long time (for example Personal settings for user that has logged out).

The figure below shows an example of the CMPL and MPLs.

3.5 Parameter Store (PS)

The Parameter Store (PS) is a data structure containing the MPL-parameters. It is basically a configuration file format, where the Moxy can read content to the MPL data structure.

4. Functions

This section lists the functional specifications. The functions are divided into core, base and optional sections. 

4.1 Core

4.1.1 Get Document

Function Get Document
Description The proxy server gets requested document from an Internet server
Input data HTTP-request (URL) from client
Output data  Document file (Raw data)
URL requested
A MPL structure, containing Proxy Additional information
A MPL structure, containing Client parameters HTTP-request

The document is retrieving from a web service and saved to the disk. The file consists of  raw data. There will be another file containing additional data retturned from the web server. The files generated will probably look like this:

First file (Raw Data)

GIF87aÔq...
...
... <RAW DATA>

Second file (Proxy Additional information)

HTTP/1.1 200 OK
Date: Sun, 24 Oct 1999 18:59:36 GMT
Server: Apache/1.3.9 (Unix) mod_perl/1.21
Cache-Control: max-age=-130189
Expires: Sat, 23 Oct 1999 06:49:47 GMT
Last-Modified: Fri, 22 Oct 1999 07:49:47 GMT
ETag: "bd36-10353-3810171b"
Accept-Ranges: bytes
Content-Length: 66387
Connection: close
Content-Type: image/gif

This additional data can be used for caching purposes. It also gives a type of the file to the Moxy Content Filter Engine.

A MPL structure of this data would look like

Third output (Client parameters) of this function will look like this:

Referer: http://www.msn.com/
Proxy-Connection: Keep-Alive
Accept-Encoding: gzip, deflate
Cookie: MC1=V=2&GUID=040BA0C1C07411D289C60008C7D9E3DA; mh=MSFT; HMP1=1; F=1; HMCMISC=MINDIF=27; MC1=V=2&GUID=442FDEAECD2C40DD903FA623D097CD22
Accept: */*
User-Agent: Mozilla/4.0 (compatible; MSIE 5.0; Windows NT)
Accept-Language: fi

A MPL structure of this data would look like

4.1.2 Redirector

Function Redirector
Description Redirector is called for every URL requested by the user. Redirector finds out where the data has to be fetched from; from original source or via the Moxy Content Filtering Engine
Input data URL requested
Output data Modified URL

Redirector is implemented in a MNS and called from the FrontDoor.

If content-type is HTML, Redirector contacts Moxy Contact Filter Engine and sends all parameters needed to do content filtering.
If content-type is other than HTML (gif, jpg, ...), the document is fetced from the original source.

4.1.3 Generate Page Content Tree

Function Generate Page Content Tree
Description To generate the Page Content Tree from an HTML page
Input data A raw HTML page in text format, HTTP info from the Get Document function
Output data A basic Page Content Tree data structure for the page

The purpose of this function is to convert the raw HTML page to a PCT, which is the data structure used by Moxy to represent HTML pages during processing. The PCT generated will probably look approximately like this:

4.1.4 HTML Parser

Function To parse an HTML page
Description The HTML parser parses an HTML page, producing a PCT corresponding to the page structure
Input data A PCT node, with Content-type=UNPARSED_HTML and as Content the HTML page as a string
Output data A PCT corresponding to the hierarchical tagging structure of the HTML page

Special considerations need to be taken so that the parser is as insensitive as possible to HTML syntax errors, since most browsers display defect HTML, leading to a lot of undetected errors in pages. Below is a sample PCT for a parsed page

4.1.5 Parse Moxy Parameters

Function Parse Moxy Parameters
Description The Moxy parameters (bandwidth, screen capabilities, debug options etc.) are collected from various sources
Input data A PCT tree containing a pageset
Client parameters and URL from the function Get Document
Moxy configuration files (on CMPL)
Output data List of the Moxy Parameters (MPL)

The main goal of this function is to collect all parameters set on different levels: Moxy Default Settings,  Moxy General Site related, Moxy Parameters found on an HTML and parameters specified by client user.

CMPL is a collection of all parameters found on different levels. If the list is empty, Moxy default settings and general site-related settings are read from Parameter Store to the CMPL.  This means that there will be few MPLs that are always sorted on the head of the CMPL.

If there are Moxy parameters defined on a fetched WEB page, one MPL-entry is inserted for them to the CMPL.

If the client that has done the URL-request is known, its personal parameters are already in the CMPL. If client is unknown, its personal Moxy parameters are read (from cookie) to the CMPL.

If the client has personal settings for the site that URL point to, the settings are read to the CMPL also.

When CMPL contains all static and dynamic MPL-entries, we start generating an MPL-entry that is used by the Moxy Filtering engine. We go through CMPL one by one and do the following things

4.1.6 Filter Planner and Executor (FPE)

Function To apply filtering rules to the page.
Description The FPE determines which filters to apply to the page, and in which order. It also applies the filters to the page
Input data A PCT tree containing a pageset. 
Current set of filtering rules 
An MPL containing the current set of parameters for the page
Output data A PCT tree containing a pageset filtered according to the parameters and the current rules.

The parameters supplied tells the FPE which properties the filtered page shall have (e.g. no frames, no tables, size preferably less than 10kbytes).

The FPE also has a current set of rules, which describe what filters are available and their effects on the page (postconditions), as well as what tags and attributes are required for the filter to work (preconditions).

Using this information, the FPE determines which filters need to be applied to a page, as well as the order of application. The filtering procedure selected should be optimal (in some sense) with respect to the desired result. A filtering sequence leading to an ideal result may not always be available.

Possibly, some filters will be indeterministic, i.e. their exact outcome is not always known, adding interaction between the planning and executing stages. In such cases the filtering plan must be updated after a filter has run.

For instance when trying to reduce the size of a page below a certain amount, one may run the "Remove JavaScript" filter, not knowing in advance how much the page will shrink. If the page met the size goals after the "Remove JavaScript" filter, filtering is done; otherwise another potentially size-reducing filter has to be run.

4.1.7 Filter and Rule Manager (FRM)

Function To specify and modify the rules and filters in the Rule and Filter Store (RFS)
Description The Filter and Rule Manager allows the user to specify and modify the rules and filters used by Moxy trough some suitable interface (e.g. text configuration file or GUI) 
Input data User commands specifying the rules and filters to be used. These commands may originate from e.g. a configuration file or a GUI.
Output data A RFS (or modification of the current rule RFS), to be used by the FPE

If time and resources are limited, the FRM will probably be implemented using configuration files. However, in terms of ease of use and usability a GUI, like the one sketched below, is desirable. Necessarily, the sketches are quite crude as the rule and filter architecture is not yet designed.

GUI sketch for specifying available filters

GUI sketch for specifying rules

4.1.8 Moxy Internal Get Document

Function Moxy Internal Get Document
Description The Moxy Content Filter Engine gets requested document from an Internet server
Input data URL (HTTP-request) 
Output data A PCT node, with the retrieved data as Content
Proxy Additional information
Client parameters  from HTTP-request

This function acts just like the Get Document described in section 4.1.1. The only difference is that this function is implemented inside the Moxy Content Filter Engine and the raw data is brought directly into the memory. This function is used for example if the page contains images that need to be converted.

4.1.9 HTML Unparse

Function HTML Unparse
Description HTML page is regenerated from the internal data structure PCT
Input data A PCT corresponding to the hierarchical tagging structure of the HTML page 
Pretty Print options
Output data A PCT node, with Content-type=UNPARSED_HTML and as Content the HTML page as a string

When Filter Planner and Executor (FPE) has applied all filters needed, we have to generate the PCT tree containing a pageset back to HTML format before the page is returned to the client.

The page is regenerated from the PCT by traversing the tree and writing all the information of the nodes to the output stream. If the Pretty Print option is set, file is written in a layout that is easy to read (for example when monitoring Moxy Content Filter Engine). This is an inverse function compared to the HTML Parser.

4.1.10 Output Page Content Tree

Function Output Page Content Tree
Description HTML page is regenerated from the internal data structure PCT
Input data A PCT node, with Content-type=UNPARSED_HTML and as Content the HTML page as a string
Output data An HTTP response

This function returns the Content of the node as a HTML file to an HTTP response. If there is more than one page on a pageset (for example in a case of frames removal), only the main page is sent.

4.1.11 Core Filters

4.1.11.1 Image Filter

Function A filter for image removal and resampling
Description The filter can adjust the size (pixels) of the image, it's compression ratio, and color depth
Input data Image in GIF or JPG image formats 
A PCT containing a parsed HTML page
Parameters specifying desired image processing 
Output data PCT with images modified

The image filter changes the links in the filtered page to point to the filtered images. When the image have been resampled, the link specifying the image in the document is linked to the resampled image, and no longer to the original image on the remote web server.

4.1.11.2 Frame Removal Filter

Function Removes frames from an HTML page
Description Frames are removed from HTML pages by creating a page for each frame or concatenating several frames into a single page. Links between the original frames are converted appropriately.
Input data A PCT containing parsed HTML. Possibly parameters acting as arguments for the frame removal process.
Output data A PCT containing the filtered frameset (e.g. a page for each frame or a single page containing many frames). 

4.1.11.3 Comment Tag Removal Filter

Function Comment Tag removal.
Description Removes comment tags (<!-- cmnt --> ) tags from an HTML page. This is done to optimize page size.
Input data A PCT containing parsed HTML (comment tags easily available). Possibly some parameters (e.g. to avoid removing Javascript, which is inside comment tags!).
Output data A PCT containing a parsed HTML page without comments.

4.1.11.4 JavaScript Removal Filter

Function Removes JavaScript from a page.
Description Removes embedded JavaScript from pages in order to optimize page size. Especially useful on for browsers that do not support JavaScript anyway (page size reduction). An advanced filter may try to find embedded links in the javascript code and include them as standard links in the processed page. 
Input data A PCT containing parsed HTML. Possibly parameters controlling the operation of the filter.
Output data A PCT containing a page without JavaScript.

4.1.11.5 Junk Filter

Function To extract the most important elements from a page.
Description When using slow links and web browsers with limited display capability a filter that picks the elements currently of interest to the user is useful. The user could, for instance decide not to read advertisements on a web page.
Input data A PCT containing a parsed HTML page and perhaps some extra tagging, e.g. Moxy specific tags that indicates prime content.
Output data A PCT containing the "essential" page data

The definition of this filter is by necessity rather loose at this stage. The general idea is that, using some suitable heuristics, and possibly with aid of extra HTML tagging (present in the original page, or generated by some other filter), the filter should be able to pick the "prime content" of the HTML page.

4.1.11.6 Font Filter

Function Modifies and removes font tags
Description The Font Filter can be used for the following purposes:
  • To modify the 'size' attribute of <font> elements
  • To modify the font size with Style Sheets
  • To remove font elements such as <font>, <basefont>, <small> and <big>
  • To remove Style Sheets
  • To move colors to black and white on grayscale devices
Input data A PCT containing parsed HTML
Output data A PCT containing parsed HTML with font tags modified

It might be desirable to remove font tags from pages although the browser supports them. This includes cases where the page contains an excessive amount of tags, resulting in a large size overhead (MS Word is known for putting font tags everywhere), and cases where the colors used makes the page unreadable on grayscale devices.

Another function is limiting font size variation. Pages containing everything from 72pt headings to 8pt text are likely to be more readable on a limited screen if the size range is limited to e.g. 14-8 pts. Still another use is an overall upscaling of the font size, for example for disabled persons, or screens on which that 8 pt font is simply too small.

4.1.11.7 Page Splitting Filter

Function Split page in table of contents and subpages 
Description Splits page in table of contents and subpages, using internal link anchors from table of contents 
Input data A PCT containing parsed HTML 
Output data A PCT containing a parsed HTML main page and subpages

The filter splits a page into a page collection, consisting of a main page (table of contents) and subpages. The filter marks, which page in the collection is the main page. The main page contains links pointing to the subpages. The page may be splitted into chapters by header tags information or internal anchor links. 

4.1.11.8 Table Filter

Function Table filter 
Description A filter for handling tables on non-table capable browsers 
Input data A PCT containing parsed HTML 
Output data A PCT containing a parsed HTML page without tables

The table filter shall get the content from a table in logical order or content provider specific manner. Filter arranges the content into suitable form, like lists and subtitles. Content order may also provide information about the content by custom HTML tags (see 4.2.2), "flow tags", which specifies the order of content.

4.2 Base

4.2.1 A monitor for rule execution

Function Monitor for rule execution 
Description Function for monitoring intermediate stages and filters used. 
Input data Intermediate stage PCTs, MPL
Output data List of filters used, intermediate stages (PCTs). 

With rule execution monitor administrators and developers may monitor rule execution and get results after each stage and use it for example as debug tool. Monitor may produce as a result main HTML page with links to subpages containing the intermediate stages. All intermediate stages could be stored in WPS. The following screenshot is an example of what Monitor could look like.

4.2.2 QoS control trough custom HTML tags

Function QoS control trough custom HTML tags
Description Custom HTML tags may be used, to give the proxy instructions of what elements of the page to include/exclude or information about order of the elements 
Input data Custom tags
Output data  

The content provider may use custom HTML tags in order to give the proxy instructions of what elements of the page to include/exclude at different QoS levels.

The META element of HTML lets authors specify meta data, which is information about a document as opposed to document content. The META element can be used to identify properties of a document and assign values to those properties.The NAME attribute identifies the property and the CONTENT attribute specifies the property's value.

For example, the following declaration sets a value for the quality property:
<META name="moxy" content="quality=low">

Custom HTML tags may also be used to inform filters about the order of the elements. For example "flow tags" may be used
for tables to specify the order of content, how it must be filtered.

4.2.3 Caching

Function Caching elements 
Description Moxy engine may use caching for filtered elements. 
Input data Filtered element to store in the cache
Output data Filtered element from cache

Caching is used to improve performance. By caching filtered elements in Moxy engine may reuse these elements if needed. For example filtered picture with lower resolution may be cached for later reuse.

4.2.4 User accounts

Function User account recognition.
Description Proxy can keep information about user settings and recognize the user by address and terminal type.
Input data Client parameters and URL from the function Get Document
Output data User's MPL 

The Moxy engine should have the ability to keep separate client user accounts with user-specific configuration information. If the client user is unknown, its personal Moxy parameters are read (from cookie) to the CMPL. Users may also set the configuration information by the GUI (see 5.1). If the user have personal settings for the site that URL points to, the settings are read to the CMPL also.

5. External Interfaces and Connections

5.1 User Interfaces

There are two main types of users: web surfers and technically skilled server administrators. A base requirement of the proxy is that also administrators with moderate technical knowledge should be able to maintain the Moxy server. Another type of user is HTML content producers. The following scenarios describe the interfaces of each user type. The illustrations are quite crude since the interfaces have not been designed yet.

5.1.1 Core: Web Surfer Graphical User Interface

5.1.1.1 Transparent web surfing

On the client side transparency and a high level of usability is important.

To the web surfer, the Moxy server is transparent. When the web surfer wants to download a web page he specifies the URL of the page, for example http://www.inet.fi/index.htm, just as he does when the proxy is not used. The request for a web page goes to the WWW server through the Moxy server, which filters the web page automatically. What should be filtered is determined by the administrators of the Moxy server. The Moxy server is not quite invisible, the web surfer must specify the IP address of the Moxy server in the configuration of the browser.

The usability of the web pages is increased due to the "simplification" of the pages. Web pages look different on mobile devices compared to workstations, for example pictures and Java scripts are removed. Hyperlinks works like the web surfer is used to. On each page there are additional control hyperlinks.

An example of a simplified web page. There are Moxy control hyperlinks at the top and the bottom of the page.

5.1.1.2 Personal Moxy Settings

The Moxy control hyperlinks on every web page are leading to personal settings pages, where the web surfer can increase or decrease the level of QoS. On the settings page, the web surfer can select different filtering options, such as getting the unfiltered page and decreasing/increasing the amount of filtering performed.

There are a Personal Moxy Settings page and a Personal Moxy Settings for Site page. When the web surfer changes the settings on the Personal Moxy Settings page, all the web pages he browses from then on are filtered with the same filtering options. The web surfer must return to the Personal Moxy Settings page to change the settings. On the Personal Moxy Settings for Site page, the web surfer can change the settings only for web pages of a specified site. The settings pages rely on the Moxy server.

An example of a Personal Moxy Settings page. All the web pages browsed are filtered according to the selected settings. The user is currently selecting the bandwidth parameter.

An example of a Personal Moxy Settings for Site page. Only the web pages of the site http://www.inet.fi are filtered according to the selected settings.

5.1.1.3 Low-level GUI

The web surfer manually enters a URL of the form http://www.iltalehti.fi/$moxy=bandwidth=low;frames=remove, where $moxy=bandwidth=low;frames=remove> is interpreted by Moxy as parameters specified by client user.

5.1.2 Core: Administrator Text Based User Interface

The text based user interface of the Moxy server will be designed for technically skilled server administrators, but must still be clear and straightforward. The core administrator user interface is a text configuration file. With user commands can the administrators specify and modify the rules and filters used by Moxy. The rules of the proxy are flexible, yet simple enough so that the administrators feel comfortable to use them.

5.1.3 Base: Administrator Graphical User Interface

The graphical user interface of the Moxy server will be designed for administrators with moderate technical knowledge. The GUI should be clear and user-friendly. The configuration is performed trough the GUI, which provides access to the most important filtering features.

An example of an Administrator GUI for specifying available filters

An example of an Administrator GUI for specifying rules.

5.1.4 Base: Debug Tool GUI

To implement a monitor for rule execution that can be used as a debug tool for administrators and developers is a base requirement. Administrators can see what filters were used on a certain page, and what the page looks like after each filter has been applied.

An example of a rule execution monitor GUI. The user is selecting which filter to apply to the page. With the buttons at the top of the page the user can go through the subpages containing the intermediate stages.

5.2 Data Communications

The server on which the Moxy software runs must continuously be connected both to the network its clients are attached to and to the network hosting the Internet server from which unfiltered HTML is fetched. Typically both networks are the Internet. The network must support TCP/IP. The Moxy server communicates with the Internet server with the HTTP protocol. In the communication with the client a proxy protocol or the HTTP protocol is used, depending on if Moxy is running as a proxy or a web servers.

6. Additional Features

The main requirements on the software architecture is that it is simple and clear, easily extensible and well documented in order to facilitate future support. Another main requirement of the system is that the proxy can be used by persons with varying technical skills.

The proxy must be able to handle at least 10 users at the same time. This restriction should not be due to the architecture, but to the slowness of the tools or languages used. The architecture shall be designed as if the system would be used by hundreds of users at the same time. Testing should be performed that indicates what the actual maximum number of simultaneous users of the system is. There are no other requirements on speed and user load.

The proxy will start cleanly after a crash, and any persistent information (such as the proxy cache) must not be corrupted in the event of a crash.

Secure connections, such as HTTP-S or SSL are not supported.

The Moxy software will be developed to run on Linux. The software could be modified to run on Unices (Base requirement) and Windows NT (Optional feature).

7. References

[1] Mäntylä M,  Tik-76.115 Aihe-esite: Adaptive WWW Proxy Server for Mobile Communication
http://mordor.cs.hut.fi/tik-76.115/99-00/aiheet/go-www-proxy.htm

[2] Squid Web Proxy Cache
http://squid.nlanr.net/

[3] Sun Microsystems, JAVA TM SERVLET API
http://java.sun.com/products/servlet/index.html

[4] WithClass99
http://www.microgold.com

[5] ImageMagick
http://www.wizards.dupont.com/cristy/ImageMagick.html


{$Id: to.html,v 1.116 2000/02/16 02:50:31 abernas Exp $}