The Design of the CIEE Web Database |
| About | Articles | How to contact me | Projects | Site Map |
In these pages we will look at the architecture of the CIEE Web Database.
The architecture has two major components; the Data Entry Front End and the Community Website. The linkage between these two components uses structured text files containing information about the schools surveyed.
The structure of these text files is defined formally in an XML Document Type Definition.
School information thus exists in two places in this architecture:
User annotations to web pages and other such web-community features will be handled by the web community toolkit.
Let us look at the major components of the architecture more closely.
The Community Website comprises of a database-backed web server running web community software. The static content of the web site is derived from the CIEE survey information. The web community software provides a layer of user customizability and community support over the static content. We are currently evaluating OpenACS to implement the web community infrastructure since this supports all the necessary functionality that we need.
The design goals of the website front end are to support the kinds of user customized interactions mentioned in the requirements document.
The static web pages will contain, in addition to the school survey information, computed summaries that are likely to be of popular interest.
For example, we may display statistics like the pass/fail rates of girl children vs. boys in the various districts of Karnataka, in an easy to comprehend graphical form.
The school information that we are processing changes very infrequently. Any given school is not likely to be surveyed more frequently than twice a year. Thus, statically generated content is perfectly fine for the school information.
Given the small amount of data (approximately 55,000 schools), and a data set size of a few tens of MB of structured text, simple queries can be implemented using text tools like grep and webglimpse .
The major plus point for me personally, is that such a design is simple. Simple to design, implement, verify and maintain.
Of course there are downsides :) .
Firstly, by not keeping the school information in full-fledged relational database, we lose some generality that a database engine could provide. For example, one cannot run arbitrary queries using a declarative language like SQL , and would instead need to write out scripts to process the text files that hold the school information.
While this is true, we need to remember that:
implementing and maintaining a relational database is a lot of effort and not a task to be taken lightly.
Further, we may not want to allow arbitrary people on the Internet to run queries (using SQL or any programming language) on a live database. The potential for malicious acts, denial of service attacks etc. are too large to justify the risks.
I am not convinced that we are losing much functionality: how many social workers can comprehend a database table structure document and then formulate the relevant SQL queries on their own anyhow?
Performance in absolute terms may be slower than with a real relational database.
While this is perhaps true, it remains to be seen if this is really an issue in real life. If really necessary, we can think of moving to a real database when we reach the 100,000 school mark.
If a particular kind of query is common, we can precompute the information and serve it up as a static page.
Writing scripts is more difficult than formulating an SQL query.
Well, it depends. Using a clean language like Python with its powerful support libraries greatly eases the difficulty of writing scripts. I've seen secretaries and managers use and enjoy Python. It is not so easy to get data out of relational databases using SQL either; this is what keeps database programmers prosperous :) .
The Data Entry Front End is targeted to run stand-alone, on machines that may not be connected to the Internet. The front end will assist end users in keying in data collected during a survey and will store this information in structured text files.
The key design goals of the Data Entry Front End are:
Some kind of user interface tool is anyway needed to do the necessary validity checks when entering data. The issue is whether the tool will directly talk to the database or keep data in some intermediate format.
An alternative to the current architecture would be to keep the school database entirely in one form, namely, inside a set of database tables. This approach looks simpler, but is inferior to the current approach for a number of reasons:
Having a formal XML DTD connecting the data entry and database modules allows development of the front end and the database to occur in parallel. In particular, since the format of the survey form is nearly fixed, while the structure of the web site is still under design, fixing the data interchange format allows the development of the user interface to proceed independently of the main website software.
Keeping school data in text files allows for easy editing and revision. The alternative, namely entering the school information into database tables and editing them in-situ is more complicated.
There are a number of tools that can edit XML documents which are readily available. If you are comfortable with EMACS for example, you won't even need a separate data entry front end tool.
In the process of developing the web site we may need to drastically reorganize the way the school data is represented inside the database. Keeping the school data in a separate form allows for easy repopulation of the database tables without the need to key-in data again.
Keeping school data is text files greatly eases backup procedures, compared to having the data live in a database.
Keeping school data as structured text files allows standalone operation of the data entry tool. This is an important point as the idea is to allow data entry to occur at CIEE offices in the districts. These districts need not have connectivity to the Internet.
As of January 2001, about 400+ schools have already been surveyed and this information now exists in the form of paper forms with the CIEE. These need to be converted to machine readable form, right away. Keying in some of the forms showed that approximately 1 hour is needed to input the data for one school. Separating the two tasks allows data entry to be decoupled from the implementation of the database module, and for data entry to start nearly immediately.
This flexibility comes with a cost. Possible objections include:
Clearly the architecture proposed has a number of new and old technologies (XML, HTTP/HTML, Databases etc.). There are risks associated with every technology introduced into project. Can't we just program everything in C or SQL ?
We could write everything from scratch using any one low-level language or methodology, but we will still need to provide equivalent functionality. Thus we will end up implementing what we need but possibly with tools that are not quite suited for the task at hand. Relying on well-known and "standard" technologies actually ends up saving development time and improving reliability.
How do we get people upto speed on so many technologies at a time?
You need to get the right people. Creating software is a hard problem and so far no royal road to software development has been found. That said, in my opinion anyway, there is nothing here that is so complex as to preclude someone with aptitude from learning enough to manage well.
We are not taking on any more tasks than absolutely needed for the efficient working of the project.
So, like the CSRG team at University California, Berkeley used to say about their BSD distribution: "it will be ready when it is ready" :) .
In this section we will look at some of the major software building blocks that comprise the system.
Since we want our system to be replicable, and low cost, it is essential that it runs on low end commodity computing hardware and that it not require costly proprietary software to run.
As far as commodity hardware goes, the x86 PC is nearly ubiquituous in India these days, so this has been selected as our hardware platform. The kind of load that we envisage our website to be subject to can easily be handled by a low-end PC of today.
Next, comes the choice of operating system. There are a number of "free" operating systems available for the PC platform today. Linux based systems are one example. The many BSD derived OSes also exist for this platform.
I personally prefer FreeBSD over the alternatives.
One of the more featureful web community toolkits, the ArsDigita Community server is an open-source product being maintained by ArsDigita Inc. ArsDigita Inc. is in the business of building and maintaining very large database backed websites.
However, the ACS toolkit uses the Oracle database engine. Since the Oracle DB software is neither free nor open-source, a spin-off volunteer project has ported the ACS toolkit to the open-source relational database PostgreSQL. This toolkit is called OpenACS and it offers nearly all of the functionality of the ACS toolkit.
PostgreSQL is a solid, stable object-relational database engine that is open-source and volunteer developed. It is now a mature offering and is in use in a number of organizations.
We need PostgreSQL in order to run OpenACS.
Earlier called NaviServer, this open-source web server powers websites of industry giants like AOL. The web server is extensible using the TCL command language.
AOLServer is required because the ACS (and OpenACS) rely heavily on its features to run.
| Contact: jkoshy@FreeBSD.org |
|