2. Management Systems for Media Databases

In the previous chapter, a database was defined as a collection of related data representing some logically coherent aspect of the real world.
With this definition, no limitations are given as to the type of:

The focus of this text is on on-line - electronic and web accessible - databases containing multiple media data, thus restricting our interest/focus to multimedia databases stored on one or more computers (DB servers) and accessible from the Internet. Examples of such databases include the image collections of the Hermitage Museum, the catalog and full text materials of the ACM digital library, and the customer records for the 7 sites of Amazon.com

Electronic databases are important since they contain data recording the products and services, as well as the economic history and current status of the owner organization. They are also a source of information for the organization's employees and customers/users. However, databases can not be used effectively unless there exist efficient and secure data management systems, DMS for the data in the databases.


2.1 Data Management Functions

The primary objective for a data management system, DMS, is to provide efficient and effective management of the database. This includes providing functions for data storage, retrieval, secure modification, DB integrity and maintenance. There are two principle quality measures for a DMS; efficiency and effectiveness.

Techniques used to reach these goals include:

Figure 2.1 gives a generic DMS architecture, highlighting principal components.

  1. The storage and retrieval sub-system manages the data in the database using a storage structure suited to the type of data in the collection. Two additional data collections are maintained by this subsystem:
  2. The query processing sub-system supports one or more query languages, presented through the user interface for specification of database creation and data insertion, retrieval and update tasks for the DMS.
    This subsystem establishes the metadata framework for the application, stored in the schema, and then uses the schema data for query interpretation.

As discussed in chapter 1, the interpretation of data requires knowledge of the conventions used for the data representation. For electronic databases, these conventions are stored as metadata, defined as data about data, in either a:

The metadata types stored in a DB schema are specified using a data definition language, DDL, The storage subsystem of the DMS supplements the metadata with data describing the DB structure. The query processing subsystem uses the metadata for the interpretation of user queries (requests for data).

The storage subsystem also generates indexes for the data as it is stored into the DB. The retrieval subsystem uses both the schema and indexes as aids for location and retrieval of DB data in response to a human or program requester.

The DMS user interface, that part of the query processor that receives user queries, must support multiple user types and requirements and commonly consists of multiple query languages, each providing some specialized functions. For example:


2.1.1 Database design & creation

Prior to establishing a database, the database administrator, DBA, should create a data model to describe the intended content and structure for the DB. It is this model that is the basis for specification of the Data Definition Language, DDL statements necessary for construction of the DB schema and the structure for the DB storage areas. Figure 2.2 illustrates part of a data model for an example administrative database containing various media attributes: images, text and video.
(The data model is formed using the graphic syntax of the Structural Semantic Model, SSM, defined in (Nordbotten, 1993b) and presented in Ch.3.5).

Current data management systems for administrative applications, commonly based on relational dbms technology, support data types for the attributes indicated in the Figure 2.2 model. However, IF these data are stored within their 'parent' table as regular, fixed length values, this will require excessive storage because of the variable lengths of text and video data. There will also be a retrieval cost, since the media data will be retrieved each time the other attributes are requested.

One tactic commonly used to address this problem is to store media data in separate areas (as separate files) and store their address in their primary table location. Note that according to our DB definition, these data remain within the scope of the database. The advantage is that storage space is used efficiently at a cost of a more complex DB design (5 extra 'tables' for the Figure 2.2 DB) and thus a more complex schema for the system, as well as indirect (2 step) retrieval of the media data. A further 'result' of this strategy is the establishment of a set of media specific files/tables.

Index generation is commonly done after the initial insert of DB data, followed by index modification as new data records are added. Indexing techniques for regular, relational data can be used for the central entity-type and attribute types given in the Figure 2.2 example. However, standard relational technology does not include indexing algorithms for text, image or streamed data, making it necessary to extend relational dbms technology with media indexing functionality, assuming that content search for the media data is to be supported.


2.1.2 Information & data retrieval
The terms
information and data are often used interchangeably in the data management literature causing some confusion in interpretation of the goals of different data management system types. It is important to remember that despite the name of a data management system type, it can only manage data. These data are representations of information. However, historically (since the late 1950's) a distinction has been made between:

Both retrieval types match the query specifications to database values. However, while data retrieval only retrieves items that match the query specification exactly, information retrieval systems return items that are deemed (by the retrieval system) to be similar to the query terms. In the later case, the information requester must select the items that are actually relevant to his/her request. Quick examples include the request for the balance of a bank account vs selecting relevant links from a google.com result list.

User requests for data are typically formed as "retrieval-by-content", i.e. the user asks for data related to some desired property or information characteristic. These requests or queries must be specified using one of the query languages supported by the DMS query processing subsystem. A query language is tailored to the data type(s) of the data collection. Figure 2.3 models a multiple media database and illustrates 2 query types:

  1. A data retrieval query expressed in SQL, shown in Figure 2.3b, based on attribute-value matches to locate titles of documents authored by "Joan Nordbotten" and
  2. A document retrieval query, shown in Figures 2.3c, for those documents containing the keywords: database, management, sql3 or msql.

The query in Fig.2.3b is stated in standard SQL2, while the query in Fig.2.3c is an example of a content query (Baeza-Yates & Ribeiro-Neto, 1999; Guojun, 1999) and is typical of those used with information retrieval systems.

Note that 2 different query languages are needed to retrieve data from the structured and non structured (document) data in the DB.


2.2 The Evolution of Data Management Systems

Development of today's generalized data management systems began in the mid '60s within separate application domains. Then, as now, a given DMS type managed a particular type of media data for applications focused on the use of that media.
Examples include:

Table 2.1 shows data management system, DMS, types ordered by the primary type of data stored in the DB, giving the (approximate) start date for the first commercial systems and the name of 1 or 2 of the first commercial systems developed for each data type.

2.2.1 Administrative database management systems
Major research and development effort has been directed at developing systems that can provide efficient, stable, and safe systems for processing operational level data within large organizations, such as banks, insurance, transport, marketing, and production, among many others. Early systems developed by IT vendors were based on network and hierarchic models and were focused on preserving control over relationships between the data representing different 'real world' entities, such as the ternary relationship between employee, product and customer shown in
Figure 2.2.

These systems were criticized by some in the academic community as too complex and without an underlying theory for design and structure. This led to Ted Codd's proposal for using set theory as the basis for a relational model of data (Codd, 1970), the following development of Relational database management systems (rdbms), and to a near universal acceptance of rdbms technology for management of administrative (tabular) data. Stonebraker & Brown (1999) estimated that business database management systems built on the relational model support about 80% of commercial operational-level applications.

An early criticism of the relational model and data management systems was that the 'normalization' requirement for use of simple, atomic elements constrained the representation of complex data. In the late '80s, researchers working with data management for industrial design applications implemented with Object-Oriented, OO, technology proposed techniques for adding user-defined data types, UDTs, and user-defined functions, UDFs (methods) into the data type set provided by a DMS. Further, techniques were proposed and implemented for management of attributes represented by sets and arrays, as well as attribute and method inheritance within classification hierarchies. These ideas have been included in several commercial OO-DBM Systems, including GemstoneTM, JasmineTM, O2TM, and ObjectStoreTM. For details see any of the numerous books on OO data management, including Cattell (1994), Embley (1998), and Loomis (1995).

Problems with using OO-DMS technology for many business applications have included a lower level of standardization, scalability, and integrity control than that provided by existing relational systems. This has led to substantial effort, during the '90s, to integrate the flexibility of OO concepts and methods with the dependability and scalability of relational data management systems. The resulting systems, Object-Relational, OR, DBMSs, represent the current state-of-the-art for general-purpose data management systems (Chamberlin, 1998; Dunckley, 2003; Saracco, 1998; Stonebraker, 1999).

OR-data management systems implement the SQL3 standard (1999), which extends the functionality of both SQL2 (1992) and the query processor to provide:

OR-DMS support for complex data-types and media objects is illustrated in Figure 2.4. (Note again that the data model is formed using the graphic syntax of the Structural Semantic Model, SSM type to be presented in Ch.3.5.) In Figure 2.4 the attributes and data types requiring object-oriented implementation support include the composite attributes name and address, in which the latter contains a user-defined data type Pcode, which defines the valid set of postcodes for the application. Other O-O data-types shown in the example are the set of telephone numbers and the age function. Other user-defined data types illustrated in the figure include the geographic point, image, and text. If the data model of Figure 2.4 is implemented using an OR-DBMS with document, spatial, and image data management sub-systems, only the UDTs, Pcode and the age function, will require programmer defined UDFs for processing.

A problem with or-dbms' is that they are founded on the relational model and the notion of well defined and formatted, atomic data elements. This restricts the applicability of these systems for management of unstructured data representing media objects such as text documents, images, maps, and audio/video material. Also, sql3 and the core or-dbms do not provide indexing, retrieval and manipulation support for media data.

2.2.2 Text-based Information Retrieval Systems
As indicated in
Table 2.1, text-based information retrieval systems, or more correctly document retrieval systems, (ref. the IFIP definition of information), have as long a development history as systems for management of structured, administrative data. The basic architecture for document retrieval systems has remained relatively constant, while research and development have focused on improving the indexing and retrieval (similarity) algorithms used.

Recently, Baeze-Yates & Ribeiro-Neto, (1999) estimated that 90% of computerized data is in the form of text documents. This data is accessible using the retrieval technology developed for off-line document/information retrieval systems and adapted for the newer Digital Libraries and Web search engines.

In comparison to the structured/regular data used by administrative applications, documents are unstructured, consisting of a series of characters that represent words, sentences and paragraphs of unequal length. This requires different techniques for indexing, search and retrieval than that used for structured administrative data. Rather than indexing attribute values separately, a document retrieval system develops a term index similar to the ones found in the back of books, i.e. a list of the terms found in the documents with lists of where each term is located in the document collection. The frequency of term occurrence within a document is assumed to indicate the semantic content of the document.

Search for relevant documents is commonly based on the semantic content of the document, rather than on the descriptive attribute values connected to it. For example, if we assume that the data stored in the attribute Document.Body in Figure 2.3a is the actual text of the document, than the retrieval algorithm, when processing Q2 in Figure 2.3c, searches the term index and selects those documents that contain one or more of the query terms database, management, sql3 and msql. It then sorts the resulting document list according to the frequency of these terms in each document.

There are two principle problems in using term matching for document retrieval:

  1. Terms can be ambiguous, having meaning dependent on context, and
  2. There is frequently a mismatch between the terms used in a query and the terms used in the document collections.
Techniques and tools developed to address these problems and thus improve retrieval quality include:

Though document retrieval indexes can be generated for attribute values, the techniques of document retrieval are not generally efficient for retrieval of other media types. In addition, none of these techniques or tools is supported by the standard for relational database management systems. However, since there is a need to store text data with regular administrative data, various text management techniques are being added to or-dbm systems.

2.2.3 Image retrieval systems initial draft
In addition to text, many organizations - such as insurance companies, medical, news media, museums, and libraries - also have digital collections of images, audio, and video data. Given the availability of increased storage capacity and bandwidth for data transfer, large quantities of image material, taken by digital cameras and mobile phones, are being made available on the internet. Though image data collections have a similar architecture with text document collections, i.e. media data plus metadata, the techniques for indexing and retrieval are quite different.

As noted earlier, images are strings of pixels with no other explicit relationship to the following pixel(s) than their serial position. There is no image vocabulary that can be used to index the semantic content. Current image retrieval systems use 2 sets of tools to assist image selection:

  1. Metadata, generated manually, that describe the content, meaning/interpretation, and context for each image. and
  2. Automatically selected, low level features such as color and texture distribution and identifiable shapes.

Once collected, metadata can be used to retrieve images using either exact match on attribute values or text-search on text descriptive fields. Most image retrieval systems utilize these techniques.

In addition, an image retrieval system can generate a signature for each image based on an extraction and analysis of its low-level feature patterns. The signatures are indexed and used to match a similar signature generated for visual queries, i.e. queries based on an image example.

Unfortunately, using low-level features does not give a good 'semantic' result for image retrieval, as shown by using the example input in the VISI prototype to search for images of pairs of humpback whales in an image database with 300 images of marine animals. This search resulted in 13 images that included 1 of a pair of dolphins, 6 lone whales of different species, 3 sharks, and 3 birds. Obviously more work is needed to improve the quality of image retrieval and this is an active research field.

2.2.4 Multimedia Information Retrieval Systems, MIRS
Increasingly, organizations need to integrate their media data with their administrative data, giving rise to a requirement for multiple media systems - i.e. systems incorporating tabular/administrative data with document, image, spatial, historical, audio, and/or video data.

A Multimedia Information Retrieval System, MIRS can be defined as: In practice, an MIRS is a composite system, which can be constructed from known technology from relational and object-oriented database management systems, as well as functionality from specialized systems that manage text documents, image, spatial, historical, audio, and/or video data.

The major vendors of Object-Relational, O-R systems such as IBM's DB2TM, InformixTM, and OracleTM, have included data management subsystems for such media types as text documents, images, spatial data, audio and video. These 'new' (to sql) data types have been defined using the user defined type functionality. Management functions/methods, based on those developed for the media types in specialized systems, have been implemented as user defined functions. The result is a standard for SQL3 with system dependent implementations for the management of multimedia data.

2.2.5 Management of Web Databases, W-DBM

2.2.6 Multi, multimedia DB management
The next generation of data management systems must support transparent access to multiple, heterogeneous and autonomous database systems, called multi-database systems. A presentation of research efforts in integration of administrative database systems, begun in the 1980s, is given in Elmagarmid, Rusinkiewicz, & Sheth (1999). Major problems in realizing these systems are grounded in the autonomy of local data managers and system owners and in the fundamental heterogeneity of the component sub-systems arising from independent system development.