Mauro Marinilli's FilterGus Project

marinilli.com > Projects > FilterGus > FilterGus IF Applet Details

Hi! The FilterGus Project

To recap from previous page, FilterGus is a simple Java Program that filters up any kind of textual documents, provided with Profiles of the user's actual interests.

A deeper look at the System

Theoretically..

It is an Information Filtering program, which employs somehow innovative approaches to this task, the program being written as a Java applet, and by the way it should be the very first full-working applet on this task, pubblicly released over the Internet.
This little program was motivated by the curiosity to see if some ideas I had during my thesis work (on a much more elaborated IF system) were a practical way to cover some aspects of the IF task as personal filtering, for instance. Anyway this program and its algorithms have little or nothing to share with my Thesis Project, and they are developed only as a personal interest. In the following description I will try to keep away all the technical jargon of the IF field..
The key idea was to explore the use of RL grammars (and their fast analizers) to the problem of IF, particularly the personal IF .. What is better than a little program embedded in a browser for this tasks? ..The bet is using this framework (ad-hoc revisited anyway) for coping with natural languages morphological variations (don't think to English!) extending it for augment the effectiveness of the filtering process.. The scanning algorithm thought for this kind of application was called fancifully the Matching Tree Algorithm so we can boast that FilterGus uses a MTA technique. It has been especially designed by me for this task with a target of maximum low-level working mechanism. By now it has been implemented in a half-way level for short.

Practically..

This simple program will start up when you load (locally) the starting HTML page. Giving it an URL you can filter it out and receive a score based on a profile you have previously loaded.
It's useful to people that needs to search through a great number of documents. It can save a lot of time! For example, if you need to do a big search on a lot of documents, maybe from a Search Engine, you could consider the effort to write a profile, i.e. a file that describes to FilterGus what you're looking for. In this actual release a profile it's just a list of words, with some attribute as the score if matched, or the suffixes allowed. But this tiny program lets you do a lot more: with the right profiles it can filters out every kind of textual document, not only HTML, even if, for this format (and other hypertexts formats) it performs special searching abilities. See next if you're interested to know how it works and how to write your profiles.

The Profile Structure

The Profile is made up with XML1.0 files with the .gus suffix. So you can edit them with your text editor as long as the right syntax is maintaned. These files define a kind of language that can teach to 'Gus many things about the documents he's going to parse for you. You can tell him particularly important areas of a document, for example the title, and to emphasize the words found there using the dw="" attribute. Also you can express the importance of a word giving it a score, using the cw="" attribute. (The following description is not necessary reading because FilterGus can make all the work for you without any knowledge of the behind the scenes)
So, an element in a .gus file looks like this:
<E WORD="word" DW="x" CW="y" >
The final score is normalised, so that it can be negative and not bigger than 100. It's made up with all the partial sums of matching words, when their cw is different from zero.
Then, you can use many word variants in one expression, to obtain the same result a stemmer can obtain, more or less.(This point was interesting for testing if this different approach could allow both good Precison and Recall at the same time while dealing with morphological variations; but the real test shouldn't be performed only on the English language that is too coarse in this aspect).
Profiles in this early version are extracted from a document when you push the "feedback" button. It is just a sketched functionality insofar, reporting all the non-matching words, with stamdarda score and without caring about suffixes. By now it's your task to edit it manually if you want a very high quality.
Talking about sophisticated profile handling, remember that profiles can be layered, so you can treat them as Java classes, for example, reusing them as more as possible. If you're going to make a profile about some particular hardware component, for example, try to write down two or three profile, one for Computers, another for Hardware and the last one for your actual need. Of course, you'll better look here for those general-purpose profiles, and if you write your own ones, please send them here! A central Profile database would allow 'Gus to enter in the world of Social (or collaborative) Filtering, in the clumsiest way, being him an Octopus..

Default Profiles

You noticed that FilterGus loads always some default profiles; they're written in the startup.gus file --where you'll find all the System's variables-- they're a standard stoplist (go take a look) and a short HTML profile, where you'll see an example of dw attributes set to discriminate between structured text.
Also suffixes can be set, up to nine in this version, so you can express the way a word can match; a special suffix is the "+" one that handles regular English plurals (casualty-casualties, etc.) and the "*" that works like in regular expressions. This aspect is to be refined because the goal is to keep it unspecific about languages and relying all on loaded profiles. For example, you want to match
computer , computers , computing , computation , computations ,..

all in one; with this simple mechanism you can, without sacrifice to weird stemmed words as comput- that could come from the word "compulsory" as well.. Without any performance loss.

Known Problems

Further issues

Firstly, a strategic decision, whether or not let this project evolve to a fully developed IF system, with plenty of sophisticated features but also a big size, or keep it as simple as possible, according to the Applet viewpoint. The answer lies both on technical (the embedded Java environment in next generation Browsers) and on user demands issues.

Multilanguage Support

Then, multilanguage support is a very important issue and needs further work to study general mechanisms in order to allow any language to be handled by Gus. That could be a too ambitious target, but for most languages (western ones, cyrillic) it should be fairly possible. Note that the particular architecture of FilterGus allows him to filter documents in two or more languages at the same time!

The Parsing Algorithm

Another important issue is the parsing algorithm, the real engine of this piece of software. The home-made approach to the task, the so-called Matching Tree Algorithm it has been revealed expensive to develop (from scratch) and still to be enhanced a lot. So questions arise to pass altogether to a more proved & efficient LR- parsing tool; this would be a big change from the first idea, where the matching algorithm (especially designed for this task) was a major part of the whole architecture.

Other Issues

Also multidocument format and a real complete set of actions to cover all the possible aspects of IF (.. to IR ) are parts of the FilterGus architecture, all of them already developed but furher in the scheduling agenda being complementary issues. Designing them was a great fun, but coding details are not so interesting to me.

Changes History

Here is the complete list of all the main changes and added features
First Release (Aug. 12th 1998 )
The first working release.
Release 0.2 (1998 )
Java, Javabeans, Swing and the others are trademarks of SUN Microsystems inc.

© 1998-2008 Mauro Marinilli