The Design of the CIEE Web Database

About Articles How to contact me Projects Site Map

Joseph Koshy > Projects > The CIEE Web Database Project > The Design of the CIEE Web Database

The Design of the CIEE Web Database

In these pages we will look at the architecture of the CIEE Web Database.

Architecture


project architecture
CIEE WebDB Architecture

The architecture has two major components; the Data Entry Front End and the Community Website. The linkage between these two components uses structured text files containing information about the schools surveyed.

The structure of these text files is defined formally in an XML Document Type Definition.

School information thus exists in two places in this architecture:

User annotations to web pages and other such web-community features will be handled by the web community toolkit.

Let us look at the major components of the architecture more closely.

The Community Website

The Community Website comprises of a database-backed web server running web community software. The static content of the web site is derived from the CIEE survey information. The web community software provides a layer of user customizability and community support over the static content. We are currently evaluating OpenACS to implement the web community infrastructure since this supports all the necessary functionality that we need.

The design goals of the website front end are to support the kinds of user customized interactions mentioned in the requirements document.

The static web pages will contain, in addition to the school survey information, computed summaries that are likely to be of popular interest.

For example, we may display statistics like the pass/fail rates of girl children vs. boys in the various districts of Karnataka, in an easy to comprehend graphical form.
Rationale
Downsides

Of course there are downsides :) .

The Data Entry Front End

The Data Entry Front End is targeted to run stand-alone, on machines that may not be connected to the Internet. The front end will assist end users in keying in data collected during a survey and will store this information in structured text files.

The key design goals of the Data Entry Front End are:

Standalone operation
The front end will not need to be run on a machine connected to the Internet. It will store data about a school in the form of text files. These files can then be transferred by various means (email, floppy etc) and later uploaded to the main database by another utility.
Platform independence.
The data entry tool should be as platform independent as possible. In particular, the front end will need to run on the commonly available Windows(r) based PCs.
Ease of data entry
The front end will make it easy to input school survey information by providing an appropriately designed user interface (for example, by using appropriate keyboard accelerators, and with a proper layout).

Rationale

Some kind of user interface tool is anyway needed to do the necessary validity checks when entering data. The issue is whether the tool will directly talk to the database or keep data in some intermediate format.

An alternative to the current architecture would be to keep the school database entirely in one form, namely, inside a set of database tables. This approach looks simpler, but is inferior to the current approach for a number of reasons:


An alternate architecture
An alternate architecture for the CIEE WebDB
  1. Having a formal XML DTD connecting the data entry and database modules allows development of the front end and the database to occur in parallel. In particular, since the format of the survey form is nearly fixed, while the structure of the web site is still under design, fixing the data interchange format allows the development of the user interface to proceed independently of the main website software.

  2. Keeping school data in text files allows for easy editing and revision. The alternative, namely entering the school information into database tables and editing them in-situ is more complicated.

  3. There are a number of tools that can edit XML documents which are readily available. If you are comfortable with EMACS for example, you won't even need a separate data entry front end tool.

  4. In the process of developing the web site we may need to drastically reorganize the way the school data is represented inside the database. Keeping the school data in a separate form allows for easy repopulation of the database tables without the need to key-in data again.

  5. Keeping school data is text files greatly eases backup procedures, compared to having the data live in a database.

  6. Keeping school data as structured text files allows standalone operation of the data entry tool. This is an important point as the idea is to allow data entry to occur at CIEE offices in the districts. These districts need not have connectivity to the Internet.

  7. As of January 2001, about 400+ schools have already been surveyed and this information now exists in the form of paper forms with the CIEE. These need to be converted to machine readable form, right away. Keying in some of the forms showed that approximately 1 hour is needed to input the data for one school. Separating the two tasks allows data entry to be decoupled from the implementation of the database module, and for data entry to start nearly immediately.

The downsides of the architecture

This flexibility comes with a cost. Possible objections include:

Too many technologies

Clearly the architecture proposed has a number of new and old technologies (XML, HTTP/HTML, Databases etc.). There are risks associated with every technology introduced into project. Can't we just program everything in C or SQL ?

We could write everything from scratch using any one low-level language or methodology, but we will still need to provide equivalent functionality. Thus we will end up implementing what we need but possibly with tools that are not quite suited for the task at hand. Relying on well-known and "standard" technologies actually ends up saving development time and improving reliability.

Maintenance/Training needs

How do we get people upto speed on so many technologies at a time?

You need to get the right people. Creating software is a hard problem and so far no royal road to software development has been found. That said, in my opinion anyway, there is nothing here that is so complex as to preclude someone with aptitude from learning enough to manage well.

Implementation time

We are not taking on any more tasks than absolutely needed for the efficient working of the project.

So, like the CSRG team at University California, Berkeley used to say about their BSD distribution: "it will be ready when it is ready" :) .

Software Components

In this section we will look at some of the major software building blocks that comprise the system.

FreeBSD

Since we want our system to be replicable, and low cost, it is essential that it runs on low end commodity computing hardware and that it not require costly proprietary software to run.

As far as commodity hardware goes, the x86 PC is nearly ubiquituous in India these days, so this has been selected as our hardware platform. The kind of load that we envisage our website to be subject to can easily be handled by a low-end PC of today.

Next, comes the choice of operating system. There are a number of "free" operating systems available for the PC platform today. Linux based systems are one example. The many BSD derived OSes also exist for this platform.

I personally prefer FreeBSD over the alternatives.

ArsDigita Community Server

One of the more featureful web community toolkits, the ArsDigita Community server is an open-source product being maintained by ArsDigita Inc. ArsDigita Inc. is in the business of building and maintaining very large database backed websites.

However, the ACS toolkit uses the Oracle database engine. Since the Oracle DB software is neither free nor open-source, a spin-off volunteer project has ported the ACS toolkit to the open-source relational database PostgreSQL. This toolkit is called OpenACS and it offers nearly all of the functionality of the ACS toolkit.

PostgreSQL

PostgreSQL is a solid, stable object-relational database engine that is open-source and volunteer developed. It is now a mature offering and is in use in a number of organizations.

We need PostgreSQL in order to run OpenACS.

AolServer

Earlier called NaviServer, this open-source web server powers websites of industry giants like AOL. The web server is extensible using the TCL command language.

AOLServer is required because the ACS (and OpenACS) rely heavily on its features to run.


Contact: jkoshy@FreeBSD.org
Last Modified: Sat Apr 21 22:53:24 2007
Site Search Google