10. Multi- Multimedia Information Retrieval Systems - MMIRS

The techniques presented so far in this text are fundamental to the development of single, multiple media DB systems. Many of these techniques have been implemented in the media extensions that supplement or-dbms platforms making it relatively easy to add and manage multiple media data within extended relational DB systems. The resulting system can be considered a basic form of a multiple media information retrieval system, MIRS, as envisioned by (Lu,G., 1999).

However, it is a common observation that the Internet now consists of vast quantities of thematically related information that are 'scattered' geographically. These information sources include tabular and/or varying types of media data, stored on web-sites or in web-accessible databases and administered by numerous types of data management systems. The next challenge - and a current research area - is to provide single-point user access to 'all' the data from multiple data sources that are relevant to his/her information need, as expressed in a query.

There are numerous approaches to addressing this need that span from 'traditional' (from the early 1980's) work in distributed database management to current work on developing the semantic web. Tim Berners-Lee, commonly credited as the "father of the Semantic Web", and his co-authors envisioned easy access to "all" relevant information through a web of semantic links that linked related information/data sources. (Their 2001 paper The Semantic Web is a seminal article in this field.) Realizing the vision of a Semantic Web linking Web documents and content databases is the subject of a substantial and on-going research effort, some of which is coordinated by the W3C Semantic Web team.

10.1 Characteristics of Multiple Multiple Media Systems

Multi-database systems, MDBS, can be simply defined as systems that support access to, information retrieval from, and frequently, transactions on multiple databases. Applications that may use multi-database systems are many and include banking, airline and e-commerce. Traditionally, the component databases in an MDBS have been assumed to be well structured, i.e. without complex and/or large media data components.

Multi- multiple media information retrieval systems, MMIRS, can be equally simply defined as multi-database systems in which more than one of the component databases is a multimedia database. Again, application examples are many and include Digital libraries, Museum consortiums and cooperating news agencies. As an example, a set of or-dbms supported MIRS' containing thematically related data can define the scope of a multi- multimedia database system.

Given the variety of data types and data management systems on the Internet, we need to revisit our definition of databases to ensure that it covers all types of data collections: traditional structured/relational databases, media data collections, Web-sites, as well as combinations there of. For example, news agencies and museums maintain both extensive web-sites of media rich data and content databases with both media and tabular/administrative data.

Since a multi-database system, MDBS, is viewed as a 'single' system, the component databases need to constitute a logically coherent collection of related data, i.e. there is some thematic overlap among the component databases. MDBS components are autonomous and operational database systems in which each component (or local) DBS supports current applications with local data. Other 'defining' characteristics of MDBS' include:

A multi- multimedia information retrieval system, MMIRS will have all of the above characteristics. However, by nature media objects are seldom updated so that an MMIRS may not include transaction support.

10.2 Architectures for Multiple MIRS Integration

Since it is not reasonable to expect that a user will know the physical locations and/or DB identities and logon procedures, of all of the data sources that might be relevant for his/her information request, some form of front-end system, consisting of an interface, search engine, and integrated database system, needs to be developed to provide access to the potentially multiple, relevant DBSs. Since it is also unreasonable to require that the user access relevant data one system at a time using potentially varying local system query languages, there is a need to develop a common user query language and let the underlying query processor do the necessary query translations. Thus, a desirable MMIRS (and MDBS) would offer a single interface and query language to the data in any number of multimedia database systems and then integrate and rank the results from the user search query.

In traditional distributed database literature for structured DBs, it is common to distinguish between types of distributed system architectures based on the degree to which the component schemas are/can be integrated (Elmagarmid,A., Rusinkiewicz,M. and Sheth,A. ed., 1999; Litwin,W. and Abdellatif,A., 1986; Nordbotten, J.C., 1988; Sheth,A. and Larson,J., 1990). The alternatives listed below are illustrated in Figure 10.1, which is based on the classification of (Sheth,A. and Larson,J., 1990).

Work initiated in the mid-1990's on the development of Digital Libraries has produced numerous proposals, prototypes and operative systems for management of multiple MIRS'. Unfortunately, this work has been done without close cooperation from the traditional (relational-based) database mangement field. One consequence is that a 'new' and varied terminology has been introduced. For example, Cruz,I. and James,K. (1999) describe (briefly) the DelaunayMM architecture for accessing multiple multimedia DBs in which a common Metadata Warehouse functions as an integrated, global schema to an underlying set of image and text data collections.

10.2.1 Homogeneous systems
Homogeneous multi-DB systems are tightly coupled in the sense that they are designed from a single DB and then geographically distributed. Each component DB has the same schema structure so that the schema metadata are the same and synonyms are avoided. Examples include VISA payment systems and the perspective on digital libraries that envisions that each local library system utilizes a common specification framework such as Dublin Core. These systems are relatively easy to query and manage, but can be difficult to extend to cover new DB components.

10.2.2 Heterogeneous systems
Heterogeneous multi-DB systems are predominately loosely coupled in the sense that they are 'constructed' as an integration of existing heterogeneous systems, each of which has been independently designed and implemented and is in use for a local application set. Independent design practically ensures that there will be semantic heterogeneity. A single global schema may exist, alternatively there may be a set of federated schemas. In either case, the integration schema is constructed through the union of the schemas for the participating databases. A synonym table and a thesaurus may be constructed to support single query access to the multiple component databases. The objective of the global or federated schema is to hide the diversity of structure, location and naming conventions used in the component schemas.

Examples include the database set resulting from the merger of an organization or a "union catalog" digital library system, in which the union catalog functions as the global schema and participating libraries maintain existing on-line systems (OPACs).

Another proposal is to use an integration of the metadata for participating media databases as the global access point, as demonstrated in the DelaunayMM system, re. Figure 1 in (Cruz,I., 1999). This system supports both visual and text queries to Web documents using a combination of image and text-based information retrieval with distributed (relational) database management.

Using a similar approach and IF we consider a Web-site to be a database, then search engines also support multi-database location. In this approach, search engine crawlers retrieve and index Web pages (up to 6 billion in 2005 for Google). The resulting indexes function as a global access point (or portal) to websites containing terms matching the query search terms.
One can argue if the source data accessed by search engines really constitute a set of databases, or if the system 'simply' consists of a huge set of disparate Web pages for which the search engine's crawler has made a term location index that facilitates a keyword based query processor for Web page/site location.

10.2.3 Interoperable systems
Disjoint or language based systems are very loosely connected. They have no or at best a very primitive, locally stored 'global' schema that defines the location and access paths to cooperating database systems. These systems have an extended query language processor that can access the local DB schema/metadata of cooperating systems and use domain ontologies to map a user query to relevant databases and documents within these. An alternative to installing a cooperative query processor at each location is to use agents for query interpretation and date retrieval (James Hendler, 2001).

The W3C has also taken a language-based approach to data integration in their development of tools for the semantic web, which include XML, DTD, RDF schemas, and OWL for specifying Web ontologies. The latter are the key to 'understanding' Web data and for its integration. Much of W3C's focus in this area has been on developing tag-based tools to facilitate exchange of Web based data, 'on line' or extracted from underlying databases, from one application to another - so called "peer-to-peer" communication. The strategy used is to 'package' each data element within a tag set defined by XML and its DTD or RDF schema that is accessible to both the sender and receiver. The primary application area has been that of e-commerce. Less focus has been (to date) placed on access to multiple heterogeneous underlying databases. Note that this approach does not (yet) support search by media content.

10.3 Synonym Identification and Resolution Strategies

The central problem in working with or creating multi-database systems is that of identifying and resolving the semantic heterogeneity that exists between the component databases (Elmagarmid,A., Rusinkiewicz,M. and Sheth,A. ed., 1999; Nordbotten, J.C., 1988; Sheth,A. and Larson,J., 1990). Semantic heterogeneity exists whenever databases are designed independently, over time, by different design teams and/or in different organizations. It is represented, in structured (relational) databases by differing attribute names and structures used to model the same data and/or concepts in different systems. It is formalized, and to some degree recognizable, in the individual data models and schemas used to implement the set of component databases.

Semantic heterogeneity also exists between collections of semi-structured and unstructured data, such as between different XML document collections and image or text metadata, as well as between Web services, and ontologies (Halevy,2005; Jacobsen, 2005). Thus, semantic heterogeneity exists whenever there is more than one way to structure a data collection and is a problem whenever one wants integrated access to multiple data collections.

10.3.1 Synonym identification

under construction

10.3.2 Synonym resolution

under construction

10.3.3 The auxiliary schema

under construction

10.4 Query Processing in Multiple Multimedia Systems

The tools needed to facilitate access to Web-data include those that are familiar to database management, i.e. data description (specifying metadata), indexing, search & retrieval, and presentation. Thus it should surprise no one that known tools from both traditional SQL3 and Information Retrieval Systems are being adapted for use in the multi DB environment of the Web.

under construction

10.5 Response Merging

under construction

10.6 Current status of MMIRS

Though multi-database management has been a research area for 30+ years, only tightly coupled systems are well understood and even these still need to be hand crafted. There have been numerous research projects and prototypes for the loosely coupled and language-based systems, but no viable general system has yet evolved. One approach, at least to study the problems, could be to define new SQL3 functions to define a multi-database set and to search them. I.e., we could design a new multi-database extension (Stonebraker, 1999). This approach would allow utilization of the basic SQL processor of the host or-dbms.

As we have learned, OR-DBMS technology can be a powerful tool for development of database systems for organizations that have a combination of structured (relational) and multimedia data. Current or-dbm systems also support Web applications.
The question is if or-dbms technology can also be used as a fundament for The Semantic Web, as envisioned by Tim Berners-Lee, James Hendler and Ora Lassila (2001). For this, some form for multi-database management (or at least access support) must be provided. As far as I know, current or-dbm systems do not provide multi-DB support. However, it should be possible to use the UDT/UDF support to construct a multi-DB module.