Computer Science Notes: RDBMS

Data manipulation and information processing have become the major tasks of any organization,

small or big, whether it is an educational institution, government concern, scientific, commercial

or any other. It is the plural of a Greek word datum, which means any raw facts, or figure like

numbers, events, letters, transactions, etc, based on which we cannot reach any conclusion. It

can be useful after processing, e.g. 78, it is simply a number (data) but if we say physics (78) then

it will becomes information. It means somebody got distinctional marks in physics. Information is

processed data. The user can take decision based on information.

An organization is only a mechanism for processing information and considers that the traditional

management of information can be viewed in the context of information and process. The

manager may be considered as a planning and decision center. Established routes of information

flow are used to determine the effectiveness of the organization in achieving its objectives. Thus,

information is often described as the key to success in business.

In essence a database is nothing more than a collection of information that exists over a long

period of time, often many years. In common parlance, the term database refers to a collection of

data that is managed by a DBMS.

The DBMS is expected to:

1. Allow users to create new databases and specify their schema (logical structure of the data),

using a specialized language called a data-definition language.

Data Processing Information

RDBMS

2. Give users the ability to query the data (a \query" is database lingo for a question about the

data) and modify the data, using an appropriate language, often called a query language or datamanipulation

language.

3. Support the storage of very large amounts of data many gigabytes or more | over a long period

of time, keeping it secure from accident or unauthorized use and allowing efficient access to the

data for queries and database modifications.

4. Control access to data from many users at once, without allowing the actions of one user to

accept other users and without allowing simultaneous

A database is a collection of related data or operational data extracted from any firm or

organization. For example, consider the names, telephone number, and address of people you

know. You may have recorded this data in an indexed address book, or you may have stored it on

a diskette, using a personal computer and software such as Microsoft Access of MS Office or

ORACLE, SQL SERVER etc.

A Database Management System (DBMS) is a computer program you can use to store, organize

and access a collection of interrelated data. The collection of data is usually referred to as the

database. The primary goal of a DBMS is to provide a convenient and effective way to store and

retrieve data from the database. There are several types of data models (a data model is used to

describe the structure of a database) and Empress is a Relational Database Management System

(RDBMS) with Object Oriented extensions. Empress is capable of managing data in multiple

databases. The data stored in each database is organized as tables with rows and columns. In

relational database terminology, these tables are referred to as relations, rows are referred to as

records, and columns are referred to as attributes.

A database management system, therefore, is a combination of hardware and software

that can be used to set up and monitor a database, and can manage the updating and retrieval of

database, and can manage that can be used to set up and monitor a database, and can manage

the updating and retrieval of database that has been stored in it.

Most database management systems have the following facilities:

a) Creating of a file, addition to data, modification of data; creation, addition and deletion of

entire files.

b) Retrieving data collectively or selection.

c) The Data stored or indexed at the user’s discretion and direction.

d) Various reports can be produced from the system.

e) Mathematically function can be performed and the data will be stored in the database can be

manipulated with these functions.

f) To maintain data integrity and database use.

The DBMS response to a query by involving the appropriate subgroups, each of which performs

its special functions to interpret the query, or to locate the desired data in the database and

present it in the desired order.

As already mentioned, a database consists of a group of related files of different record types, and

the database allows users to access data anywhere in the database without the knowledge of how

data are actually organized on the storage device.

1.2 TRADITIONAL FILE ORIENTED APPROACH

Queries

COBOL/PL

DBMS OS Data

Base

RDBMS

The traditional file-oriented approach to information processing has for each application a

separate master file and its own set of personal files. An organization needs flow of information

across these applications also and this requires sharing of data, which is significantly lacking in

the traditional approach. One major limitations of such a file-based approach is that the

programs become dependent on the files and the files become dependent upon the programs.

Disadvantages

· Data Redundancy: The same piece of information may be stored in two or more files. For

example, the particulars of an individual who may be a customer or client may be stored

in two or more files. Some of this information may be changing, such as the address, the

payment maid, etc. It is therefore quite possible that while the address in the master file

for one application has been updated the address in the master file for another application

may have not been. It may be not easy to even find out as to in how many files the

repeating items such as the name occur.

· Program/Data Dependency: In the traditional approach if a data field is to be added to a

master file, all such programs that access the master file would have to be changed to

allow for this new field which would have been added to the master record.

· Lack of Flexibility: In view of the strong coupling between the program and the data,

most information retrieval possibilities would be limited to well-anticipated and predetermined

requests for data, the system would normally be capable of producing

scheduled records and queries which it has been programmed to create.

1.3 MOTIVATION FOR DATABASE APPROACH

Having pointed out some difficulties that arise in a straightforward file-oriented approach towards

information system development. The work in the organization may not require significant sharing

of data or complex access. In other words the data and the way it is used in the functioning of the

organization are not appropriate to database processing. Apart from needing a more powerful

hardware platform, the software for database management systems is also quite expensive. This

means that a significant extra cost has to be incurred by an organization if it wants to adopt this

approach.

Advantages gained by the possibility of sharing of the data with others, also carries with it the

risk of unauthorized access of data. This may range from violation of office procedures to violation

of privacy rights of information to down right thefts. The organizations, therefore, have to be ready

to cope with additional managerial problems.

A database management processing system is complex and it could lead to a more inefficient

system than the equivalent file-based one.

The use of the database and its possibility of being shared will, therefore affect many departments

within the organization. If die integrity of the data is not maintained, it is possible that one

relevant piece of data could have been used by many programs in different applications by

different users without they are being aware of it. The impact of this therefore may be very

widespread. Since data can be input from a variety sources, the control over the quality of data

become very difficult to implement.

However, for most large organization, the difficulties in moving over to a database approach are

still worth getting over in view of the advantages that are gained, namely, avoidance of data

duplication, sharing of data by different programs, greater flexibility and data independence.

1.4 The Evolution of Database Systems

Early days: database applications built on top of file systems. Drawbacks of using file systems to

store data:

· Data redundancy and inconsistency

· Difficulty in accessing data

· Atomicity of updates

RDBMS

· Concurrency control

· Security

· Data isolation — multiple files and formats

· Integrity problems

RDBMS

1.5 DATABASE BASICS

Since the DBMS of an organization will in some sense reflect the nature of activities in the

organization, some familiarity with the basic concepts, principles and terms used in the field are

important.

· Data-items: The term data item is the word for what has traditionally been called the field

in data processing and is the smallest unit of data that has meaning to its users. The

phrase data element or elementary item is also sometimes used. Although the data item

may be treated as a molecule of the database, data items are grouped together to form

aggregates described by various names. For example, the data record is used to refer to a

group of data items and a program usually reads or writes the whole records. The data

items could occasionally be further broken down into what may be called an automatic

level for processing purposes.

· Entities and Attributes: The real world would consist of occasionally a tangible object such

as an employee; a component in an inventory or a space or it may be intangible such as an

event, a job description, identification numbers, or an abstract construct. All such items

about which relevant information is stored in the database are called Entities. The

qualities of the entity that we store as information are called the attributes. An attribute

may be expressed as a number or as a text. It may even be a scanned picture, a sound

sequence, and a moving picture that is now possible in some visual and multi-media

databases.

Data processing normally concerns itself with a collection of similar entities and records

information about the same attributes of each of them. In the traditional approach, a

programmer usually maintains a record about each entity and a data item in each record

relates to each attribute. Similar records are grouped into files and such a 2-dimensional

array is sometimes referred to as a flat file.

· Logical and Physical Data: One of the key features of the database approach is to bring

about a distinction between the logical and the physical structures of the data. The term

logical structure refers to the way the programmers see it and the physical structure refers

to the way data are actually recorded on the storage medium. Even in the early stages of

records stored on tape, the length of the inter-record tape requires that many logical

records be grouped into one physical record to several storage places on tape. It was the

software, which separated them when used in an application program, and combined them

again before writing back on tape. In today's system the complexities are even greater and

as will be seen when one is referring to distributed databases that some records may

physically be located at significantly remote places.

· Schema and Subschema: Having seen that the database does not focus on the logical

organization and decouples it from the physical representation of data, it is useful to have

a term to describe the logical database description. A schema is a logical database

description and is drawn as a chart of the types of data that are used. It gives the names of

the entities and attributes, and specifies the relationships between them. It is a framework

into which the values of the data item can be fitted. Like an information display system

such as that giving arrival and departure time at airports and railway stations, the schema

will remain the same though the values displayed in the system will change from time to

time. The relationships that has specified between the different entities occurring in the

schema may be a one to one, one to many, many to many, or conditional.

The term schema is used to mean an overall chart of all the data item types and recordtypes

stored in a database. The term sub schema refers to the same view but for the dataitem

types and record types which a particular user uses in a particular application or.

Therefore, many different sub schemas can be derived from one schema.

· Data Dictionary: It holds detailed information about the different structures and data

types: the details of the logical structure that are mapped into the different structure,

details of relationship between data items, details of all users privileges and access rights,

performance of resource with

RDBMS

· details.

1.6 THREE VIEWS OF DATA

DBMS is a collection of interrelated files and a set of programs that allow several users to access

and modify these files. A major purpose of a database system is to provide users with an abstract

view of the data. However, in order for the system to be usable, data must be retrieved efficiently.

The concern for efficiently leads to the design of complex data structure for the representation of

data in the database. By defining levels of abstract as which the database may be viewed, there

are logical view or external, conceptual view and internal view or physical view.

· External view: This is the highest level of abstraction as seen by a user. This level of

abstraction describes only the part of entire database.

· Conceptual view: This is the next higher level of abstraction which is the sum total of Data

Base Management System user's views. In this we consider; what data are actually stored

in the database. Conceptual level contains information about entire database in terms of a

small number of relatively simple structures.

· Internal level: The lowest level of abstraction at which one describes how the data are

physically stored. The interrelationship of any three levels of abstraction is illustrated in

figure 2.

Fig: The three views of data

To illustrate the distinction among different views of data, it can be compared with the concept of

data types in programming languages. Most high level programming language such as C, VC++,

etc. support the notion of a record or structure type. For example in the ‘C’ language we declare

structure (record) as follows:

struct Emp{

char name [30];

RDBMS

char address [100];

}

This defines a new record called Emp with two fields. Each field has a name and data type

associated with it.

In an Insurance organization, we may have several such record types, including among others:

-Customer with fields name and Salary

-Premium paid and Due amount at what date

-Insurance agent name and salary + Commission

At the internal level, a customer, Premium account, or employee (insurance agent) can be

described as a sequence of consecutive bytes. At the conceptual level each such record is

described by a type definition, illustrated above and also die interrelation among these record

types is defined and describing the rights or privileges assign to individual customer or end-users.

Finally at the external level, we define several views of the database. For example, for preparing

the Insurance checks of Customer_details’, only information about them is required; one does not

need to access information about customer accounts. Similarly, tellers can access only account

information. They cannot access information concerning about the premium paid or amount

received.

1.7 The Three Level Architecture Of DBMS

A database management system that provides these three levels of data is said to follow threelevel

architecture as shown in fig. . These three levels are the external level, the conceptual level,

and the internal level.

Fig : The three level architecture for a DBM

A schema describes the view at each of these levels. A schema as mentioned earlier is an outline

or a plan that describes the records and relationships existing in the view. The schema also

describes the way in which entities at one level of abstraction can be mapped to the next level.

The overall design of the database is called the database schema. A database schema includes

such information as:

· Characteristics of data items such as entities and attributes

· Format for storage representation

· Integrity parameters such as physically authorization and backup politics.

· Logical structure and relationship among those data items

The concept of a database schema corresponds to programming language notion of type

definition. A variable of a given type has a particular value at a given instant in time. The concept

of the value of a variable in programming languages corresponds to the concept of an instance of

a database schema.

RDBMS

Since each view is defined by a schema, there exists several schema in the database and these

exists several schema in the database and these schema are partitioned following three levels of

data abstraction or views. At the lower level we have the physical schema; at the intermediate

level we have the conceptual schema, while at the higher level we have a subschema. In general,

database system supports one physical schema, one conceptual schema, and several subschemas.

1.7.1 External Level or Subschema

The external level is at the highest level of database abstraction where only those portions of the

database of concern to a user or application program are included. Any number of user views may

exist for a given global or conceptual view.

Each external view is described by means of a schema called an external schema or subschema.

The external schema consists of the, definition of the logical records and the relationships in the

external view. The external schema also contains the method of deriving the objects in the

external view from the objects in the conceptual view. The objects include entities, attributes, and

relationships.

1.7.2 Conceptual Level or Conceptual Schema

One conceptual view represents the entire database. The conceptual schema defines this

conceptual view. It describes all the records and relationships included in the conceptual view

and, therefore, in the database. There is only one conceptual schema per database. This schema

also contains the method of deriving the objects in the conceptual view from the objects in the

internal view.

The description of data at this level is in a format independent of its physical representation. It

also includes features that specify the checks to retain data consistency and integrity.

1.7.3 Internal Level or Physical Schema

It indicates how the data will be stored and describes the data structures and access methods to

be used by the database. The internal schema, which contains the definition of the stored record,

the method of representing the data fields, expresses the internal view and the access aids used.

1.7.4 Mapping between different Levels

Two mappings are required in a database system with three different views. A mapping between

the external and conceptual level gives the correspondence among the records and the

relationships of the external and conceptual levels.

a) EXTERNAL to CONCEPTUAL: Determine how the conceptual record is viewed by the user

b) INTERNAL to CONCEPTUAL: Enable correspondence between conceptual and internal levels. It

represents how the conceptual record is represented in storage.

An internal record is a record at the internal level, not necessarily a stored record on a physical

storage device. The internal record of figure 3 may be split up into two or more physical records.

The physical database is the data that is stored on secondary storage devices. It is made up of

records with certain data structures and organized in files. Consequently, there is an additional

mapping from the internal record to one or more stored records on secondary storage devices.

1.8 Relational Database Systems

The relational model, invented by IBM researcher Ted CODD in 1970, wasn't turned into a

commercial product until almost 1980. Since then database systems based on the relational

model, called relational database management systems or RDBMS, have come to dominate the

database software market. Today few people know about any other kind of database management

system.

Few RDBMS implement the relational model completely. Although commercial RDBMS have a lot

in common, each system has quirks and non-standard extensions. You must understand

relational theory to correctly design a database -just learning a particular RDBMS won't get you

all the way there.

A good RDBMS and a well-designed relational database give you some important benefits:

RDBMS

· Data integrity and consistency maintained and/or enforced by the RDBMS.

· Redundant data eliminated or kept to a practical minimum.

· Data retrieved by unique keys.

· Relationships expressed through matching keys.

· Physical organization of data managed by RDBMS.

· Optimization of storage and database operation execution times.

1.9 Data Models

Collections of conceptual tools for describing data, data relationships, data semantics and

consistency constraints. The various data models that have been proposed fall into three different

groups. Object based logical models, record-based logical models and physical models.

Object-Based Logical Models: They are used in describing data at the logical and view levels. They

are characterized by the fact that they provide fairly flexible structuring capabilities and allow

data constraints to be specified explicitly. There are many different models and more are likely to

come. Several of the more widely known ones are:

· The E-R model

· The object-oriented model

· The semantic data model

· The functional data model

Data Modeling Using E-R Model

The (E-R) data model is based on a perception of a real worker that consists of a collection of basic

objects, called entities, and of relationships among these objects.

The overall logical structure of a database can be expressed graphically by an E-R diagram. Which

is built up by the following components:

· Rectangles, which represent entity sets

· Ellipses, which represent attributes

· Diamonds, which represent relationships among entity sets

· Lines, which link attributes to entity sets and entity sets to relationships.

E.g. suppose we have two entities like customer and account, then these two entities can be

modeled as follow:

A sample E-R diagram

The Object-Oriented Model

Like the E-R model the object-oriented model is based on a collection of objects. An object

contains values stored in instance variables within the object. An object also contains bodies of

code that operate on the object. These bodies of code are called methods.

Customer name

Customer

Customer city Balance

Account

number

Depos

it Account

CUSTOMER (Table name)

Customer–name Customer-street Customer-City Account-number

Johnsons Alma Pala Alto A-101

Smith North Ryc A-215

Hayes Main Harrison A-102

Turner Dutnam Stanford A-305

Johnson Alma PalaAlto A-201

Jones Main Harrison A-217

Lindsay Park Pittifield A-222

Smith North Rye A-201

RDBMS

Classes- It is the collection of objects which consist of the same types of values and the same

methods.

E.g. account number & balance are instance variables; pay-interest is a method that uses the

above two variables and adds interest to the balance.

Semantic Models

These include the extended relational, the semantic network and the functional models. They are

characterized by their provision of richer facilities for capturing the meaning of data objects and

hence of maintaining database integrity. Systems based on these models exist in monotype for at

the time of writing and will begin to filter through the next decade.

Record-Based Logical Models

Record based logical models are used in describing data at the logical and view levels. In contrast

to object-based data models, they are used both to specify the overall logical structures of the

database, and to provide a higher-level description of the implementation.

Record-based models are so named because the database is structured in fixed-format records of

several types. Each record type defines a fixed number of fields, or attributes, and each field is

usually of a fixed length.

The three most widely accepted record-based data models are the relational, network, and

hierarchical models.

Relational Model

The relational model uses a collection of tables to represent both data and the relationships

among those data. Each table has multiple columns, and each column has a unique name as

follows:

A sample relational database

Network Model

Data in the network model are represented by collections of records, and relationship among data

is represented by links, which can be viewed as pointers. The records in the database are

organised as collections of arbitrary graphs. Such type of database is shown in the figure given

bellow.

Johnsons Alma Pala Alto A-101 500

Smith North Rye A-215 700

Hayes Main Harrison A-102 400

A-222 700

RDBMS

Turner Dutnam Stanford A-305 350

Jones Main Harrison A-201 900

Lindsay Park Pittifield A-217 750

A sample network database

Hierarchical Model

The hierarchical model is similar to the network model in the sense that data and relationships

among data one represented by records and links, respectively. It differs from the network model

in that records are organised as collections of trees rather than arbitrary graphs.

A sample Hierachical database

Physical Data Models

Physical data models are used to describe data at the lowest level. In contrast to logical data

models, there are few physical data models in use. Two of the widely known ones are the unifying

model and the frame-memory model.

10. Database Languages

A-222 700

A-217 350

Johnson customer

street -------

Smith North -------

Hayes Main -------

Turner Putnam -------

Jones Main -------

Lindsay Park -------

A-101 500

A-201 900

A-215 700

A-201 900

A-102 400

A-305 350

CUSTOMER

RDBMS

A database system provides two different types of languages: one to specify the database schema,

and the other to express database queries and updates.

1.10 Database Languages

Data Definition Language (DDL)

A database schema is specified by a set of definition expressed by a special language called a

data-definition language (DDL). The result of compilation of DDL statement is a set of tables that

is stored in a special file called data dictionary, or data directory.

A data dictionary is a file that contain metadata-that is about data. This file is consulted before

actual data are read or modified in the database system.

The storage structure and access methods used by the database system are specified by as set of

definitions in a special type of DDL called a data storage and definition language. The result of

compilation of these definitions is a set of instructions to specify the implementation details of the

database schema - details are usually hidden from the users.

Data Manipulation Language

By data manipulation, we mean

· The retrieval of information stored in the database.

· The insertion of new information into the database.

· The deletion of information stored in the database.

A data-manipulation language (DML) is a language that enables user to access or manipulate data

as organised by the appropriate data model.

There are basically two types:

Procedural DML

It requires a user to specify what data are needed and how to get those data.

Nonprocedural DML

It requires a user to specify what data are needed without specifying how to get those data.

Normalization

About relational databases, you probably know about normalization. The process of normalization

transforms data into forms that conform to the relational model. Normalized data enables the

RDBMS to enforce integrity rules, guarantee consistency, and optimize database access. Learning

how to normalize data takes significant time and practice. Data modelers spend a lot of time

understanding the meaning of data so they can properly normalize it, but programmers frequently

downplay normalization, or dismiss it outright as an academic problem. Most databases come

from power users and programmers, not data modelers, and most databases suffer from unnormalized

data, redundancy, integrity and performance problems. Un-normalized databases

usually need a lot of application code to protect the database from corruption.

1.11 Client-Server and Multi-Tier Architectures

Client/Server Technology

Client/server technology is the computer architecture used in almost all automated library

systems now being offered to libraries. The simple definition is:

Client/server is a computer architecture that divides functions into client (requestor) and server

(provider) subsystems, with standard communication methods (such as TCP/IP and z39.50) to

facilitate the sharing of information between them. Among the characteristics of a client/server

architecture are the following:

RDBMS

· The client and server can be distinguished from one another by the differences in tasks they

perform

· The client and server usually operate on different computer platforms

· Either the client or server may be upgraded without affecting the other. Clients may connect

to one or more servers; servers may connect to multiple clients concurrently.

· Clients always initiate the dialogue by requesting a service.

Client/server is most easily differentiated from hierarchical processing, which uses a host and

slave, by the way a PC functions within a system. In client/server the PC-based client

communicates with the server as a computer; in hierarchical processing the PC emulates a

"dumb" terminal to communicate with the host. In client/server the client controls part of the

activity, but in hierarchical processing the host controls all activity. A client PC almost always

does the following in a client/server environment: screen handling, menu or command

interpretation, data entry, help processing, and error recovery.

The dividing line between the client and a server can be anywhere along a broad continuum: at

one end only the user interface has been moved onto the client; at the other, almost all

applications have been moved onto the client and the database may be distributed. There are at

least five points along the continuum:

Distributed presentation:

The presentation is handled partly by the server and partly by the client.

Remote presentation:

The presentation is controlled and handled entirely by the client.

Distributed logic:

The application logic is handled partly by the server and partly by the client.

Remote data management:

Database management is controlled and handled entirely by the server.

Distributed database:

Database management is handled partly by the server and partly by the client. There are,

therefore, two major applications for client/server in a library environment:

1) as the architecture for an automated library system, and

2) as an approach to linking heterogeneous systems.

In the first application, a vendor designs a system using client/server architecture to facilitate use

of that system to access multiple servers, to facilitate bringing together multiple product lines,

and/or to improve productivity. In the second application, a vendor designs a client to facilitate

RDBMS

transparent access to systems of other vendors, and a server to facilitate transparent access to its

system from others. While the underlying principles are the same, the vendor has considerable

latitude in the design of its own client/server system, but must strictly conform to standards

when using client/server to link its system with those of other libraries.

While it has been possible to access a wide variety of electronic resources through an automated

library system for a number of years, client/server technology has made it possible to tailor the

user interface to provide a personalized interface which meets the needs of any particular user

based on an analysis of tasks performed or on an individual's expressed preferences. An example

of this tailoring is the recent introduction of portals, common user interfaces to a wide variety of

electronic resources with the portal. [See the Tech Note on Portal Technology]. The portal can be

tailored to groups of staff or patrons, or to each individual.

Vendors with multiple product lines can build a single client to work with any of their server

products. This substantially reduces development costs. Client/server can also improve

productivity. Many vendors are now offering different clients for technical services, circulation,

and patron access catalog applications.

A GUI (graphical user interface) -- a presentation of information to the user using icons and other

graphics -- is sometimes called client/server, but unless information moves from the server to the

client in machine-readable (raw) form, and the client does the formatting to make it humanreadable,

it is not true client/server. Further, there is nothing in the client/server architecture

that requires a GUI. Nevertheless, most vendors of automated library systems use GUI for staff

applications. The GUIs are proprietary to each vendor. Web browsers are preferred for patron

applications because they are more likely to be familiar to them than a proprietary GUI.

An important computer industry development which has facilitated client/server architecture is

referred to as "open systems" C a concept which features standardized connectivity so that

components from several vendors may be combined. The trend to open systems began in the

1970s as a reaction against proprietary systems that required that all hardware and system

software come from a single source, and gained momentum in the 1980s, as networking became

common. While various parts of an organization might not hesitate to purchase proprietary

systems to meet their own needs, the desire to provide access from other parts of the organization,

or to exchange information, would be an incentive to select an open system. For client/server,

open systems are essential.

Most client/server systems offered by automated library system vendors use an open operating

system such as UNIX or one of its variations, or Windows NT or 2000 server. UNIX is the most

popular operating system for servers because of the large range of platform sizes available, but

Windows NT or 2000 server has been growing in popularity, especially for systems supporting

fewer than 100 concurrent users.

The most popular client operating systems are Windows 95/98/Me/2000 and Linux. By

supporting multiple client operating systems, a vendor of an automated library system makes it

possible for the client to conform to a staff member’s or patron's accustomed operating system

environment.

Almost all client/server systems use a relational database management system (RDBMS) for

handling the storage and retrieval of records in the database using a series of tables of values.

There is a common misconception that client/server is synonymous with networked SQL

(Structured Query Language) databases. SQL, a popular industry-standard data definition and

access language for relational databases, is only one approach -- albeit the one selected by almost

all automated library system vendors. While one can reasonably expect the use of an RDBMS and

SQL, the absence of either does not mean that a system is not client server.

A network computer is a PC without a hard disk drive. They have been little used in

libraries because most libraries use their PCs for a variety of applications in addition to accessing

the automated library system. Even when there are applications that lend themselves for use on

thin clients, most libraries have preferred to use older PCs that are no longer suitable for

applications that require robust machines. They use a two-tier PC strategy that involves the

purchase and deployment of new PCs for applications that require robust machines and

redeployment of the replaced machines for applications that they can support. For example, new

PCs are used for most staff applications and patron access to the Internet: older PCs are used as

“express catalogs,” devices that have a Web browser, but are limited to accessing the library’s

patron access catalog. The two-tier PC strategy can extend the life of a PC by as much as three

years.

RDBMS

Network computers are most widely used in large organizations that have to support thousands of

users. Almost all applications, including word processing, spreadsheets, and other office

applications, are loaded on a server. It is then not necessary to load new product releases on each

machine; only the applications on the server have to be updated. Most libraries do not have

enough users to realize significant savings by taking this approach. For libraries that have

hundreds of PCs, remote PC management is an alternative to thin clients; for libraries that have

fewer than 100 PCs, it is possible to support each individually.

A PDA is a handheld device that combines computing, telephone, fax, and networking features.

While originally pen-based (i.e., using a stylus), many models now come with a small keyboard.

The Palm Pilot is an example of a PDA. A number of libraries now encourage their users to access

the patron access catalog with a PDA. It is this application which holds the most promise for the

use of thin clients in libraries. As the bandwidth available for wireless applications increases and

the costs of PDAs drops, the use of PDAs for access to databases is expected to increase

dramatically.

The key to thin client technology is Java, a general purpose programming language with a number

of features that make it well suited for use on the Web. Small Java applications are called Java

applets and can be downloaded from a Web server and run on a device which includes a Javacompatible

Web browser such as Netscape Navigator or Microsoft Internet Explorer. This means

that the thin client does not need to be loaded with applications software.

A thin client can also be GUI-based. In that case, the client handles only data presentation and

deals with the user interaction; all applications and database management is found on the server.

Vendors of automated library systems favor proprietary GUI-based clients for staff because that

makes it possible to exploit the features of their systems.

1.12 Multimedia Data

The weblicon PIM has been built on a multi-tier architecture to support high scalability through

load-balancing and high availability through redundancy. The HTTP Client tier communicates

with the HTTP Server tier which routes incoming requests to the Application Server tier. The

Application Server tier connects to the Database Server tier consisting of RDBMS, LDAP and IMAP

Servers. All tiers can be scaled individually by building clusters of servers for load-balancing and

high redundancy embedding information into multimedia data that should be imperceptible but

unmemorable.

For model-based compression and animation of face models as defined in MPEG-4, a watermark

can be embedded into the transmitted facial animation parameters. The watermark can be

retrieved from the animation parameters or from video sequences rendered with the watermarked

animation parameters, even after subsequent low-quality video compression of the rendered

sequence. Entering the derived or generated data into the database and associating it with the

original information gives other users automatic access to the conclusions, thoughts, and

annotations of previous users of the information. This ability to modify, adjust, enhance, or add to

the global set of information and then share that information with others is a powerful and

important service. This type of service requires cooperation between the multimedia data

manipulation tools described above and the information repositories scattered across the network.

Generated or extracted information must be deposited and linked with existing information so

that future users will not only benefit from the original information but also from the careful

analysis and insight of previous users.

RDBMS

1.13 Information Integration

Information integration is an important feature of Oracle9i Database. Oracle provides many

features that allow customers to synchronously and asynchronously integrate their their data

including Oracle Streams, Oracle Advanced Queuing, replication, and distributed SQL.

Oracle also provides gateways to non-Oracle database servers, providing seamless interoperation

in heterogeneous environments.

Asynchronous Integration with Oracle Streams

Oracle Streams enables the propagation and management of data, transactions and events in a

data stream either within a database, or from one database to another. The stream routes

published information to subscribed destinations. The result is a new feature that provides

greater functionality and flexibility than traditional solutions for capturing and managing events,

and sharing the events with other databases and applications. Oracle Streams provides both

message queuing and replication functionality within the Oracle database. To simplify

deployment of these specialized information integration solutions, Oracle provides customized

APIs and tools for message queuing, replication, and data protection solutions. These include:

· Advanced Queuing for message queuing

· Replication with Oracle Streams

Asynchronous Integration with Oracle Advanced Replication

Oracle9i also includes Advanced Replication. This is a powerful replication feature first

introduced in Oracle7. Advanced Replication is useful for replicating object between Oracle9i

Database and older versions of the Oracle Server (that do not support Oracle Streams-based

Replication). Advanced Replication also provides Oracle Materialized Views; a replication solution

targeted for mass deployments and disconnected replicas.

Synchronous Integration with Distributed SQL

Oracle Distributed SQL enables synchronous access from one Oracle database, to other Oracle or

non-Oracle databases. Distributed SQL supports location independence, making remote objects

appear local, facilitating movement of objects between databases without application code

RDBMS

changes. Distributed SQL supports both queries and DML operations, and can intelligently

optimize execution plans to access data in the most efficient manner.

1.14 Data-Definition Language Commands

Empress provides a set of Structured Query Language (SQL) commands that allows users to

request information from the database. The Empress SQL language has three levels of operations:

1. A basic level at which data management commands are typed in without any prompting.

The full range of data management commands is available at this level.

2. Data Definition Language commands are concerned with the structure of the database and

its tables.

3. Data Manipulation Language commands are concerned with the maintenance and retrieval

of the data stored in the database tables.

4. Data Control Language commands provide facilities for assuring the integrity and security

of the data.

Data Definition Language Command

The overall structure of the database is defined by the Data Definition Language (DDL)

commands. The result of compiling DDL commands is a set of tables called the data dictionary

which contains information about the data in the database. Detailed descriptions of the Empress

DDL commands are provided in the Empress SQL: Reference manual. A summary of the available

DDL commands follows:

Command Description

ALTER TABLE Changes the structure of an existing table without having to dump and reload

its records. This also includes enable/disable trigger, define table type for

replication, enabling and disabling replication relations, and setting

checksum for a table.

CREATE COMMENT Attaches a comment to a table or attribute.

CREATE INDEX Sets up a search-aiding mechanism for an attribute.

CREATE MODULE Creates the definition of a persistent stored module into the data dictionary.

CREATE RANGE

CHECK Sets up data validation checks on an attribute.

CREATE

REFERENTIAL Sets up data referential constraints on attributes.

CREATE

REPLICATION

MASTER

Assings replication master entries to a replication table.

CREATE

REPLICATION

REPLICATE

Assings replication replicate entries to a replication table.

CREATE REPLICATE

TABLE Creates replicate table from a replication master table.

CREATE TABLE Creates a new table or replicate table including its name and the name and

data type of each of its attributes.

CREATE TRIGGER Sets up trigger events into data dictionary.

CREATE VIEW Creates a logical table from parts of one or more tables.

DISPLAY DATABASE Shows the tables in the database.

DISPLAY GRANT

PRIVILEGE Shows privilege grant options for a table.

DISPLAY MODULE Shows the persistent stored module definition.

RDBMS

DISPLAY PRIVILEGE Shows access privileges for a table.

DISPLAY TABLE Shows the structure of a table.

DROP COMMENT Removes a comment on a table or attribute.

DROP INDEX Removes an index on an attribute.

DROP MODULE Removes a persistent stored module definition from the data dictionary.

DROP RANGE

CHECK Removes data validation checks on an attribute.

DROP

REFERENTIAL Removes a data referential constraints from attributes.

DROP REPLICATION

MASTER Removes replication master entry from a replication table.

DROP REPLICATION

REPLICATE Removes replication replicate entry from a replication table.

DROP TABLE Removes an existing table.

DROP TRIGGER Removes a trigger event from the data dictionary.

DROP VIEW Removes a logical table.

GRANT PRIVILEGE Changes access privileges for tables or attributes.

LOCK LEVEL Sets the level of locking on a table.

RENAME Changes the name of a table or attribute.

REVOKE PRIVILEGE Removes table or attribute access privileges.

UPDATE MODULE Links a persistent stored module definition with the module shared library.

1.15 Overview of Query Processing

The techniques used by a DBMS to process, optimize, and execute high-level queries. A query

expressed in a high-level query language such as SQL must first be scanned, parsed, and

validated. The scanner identifies the language tokens—such as SQL keywords, attributes names

and relation names—in the text of the query, whereas the parser checks the query syntax to

determine whether it is formulated according to the syntax rules (rules of grammar) of the query

language. The query must also be validated, by checking that all attribute and relation names are

valid and semantically meaningful names in the schema of the particular database being queried.

An internal representation of the query is then created, usually as a tree data structure called

query tree. It is also possible to represent the query using a graph data structure called a query

graph. The DBMS must then devise an execution strategy for retrieving the result of the query

from the database files. A query typically has many possible execution strategies, and the process

of choosing a suitable one for processing a query is known as query optimization.

Query Processing Strategies

The steps involved in processing a query are illustrated in Figure 5.2. The basic steps are:

1. Parsing and translation

2. Optimization

3. Evaluation

Before query processing can begin, the system must translate the query into a usable form. A

language such as SQL is suitable for human use, but is ill suited to be the system’s internal

representation of a query. A more useful internal representation is one based on the extended

relational algebra.

RDBMS

Steps in Query Processing

Thus, the first action the system must take in query processing is to translate a given query into

its internal form. This translation process is similar to the work performed by the parser of a

compiler. In generating the internal form of the query, the parser checks the syntax of the user's

query, verifies that the relation names appearing in the query are names of relations in the

database, and so on.

Query optimization is left, for the most part, to the application programmer. That choice is made

because the data-manipulation-language statements of these two models are usually embedded in

a host programming language, and it is not easy to transform a network or hierarchical query into

an equivalent one without knowledge of the entire application program. In contrast, relationalquery

languages are either declarative or algebraic. Declarative languages permit users to specify

what a query should generate without saying how the system should do the generating. Algebraic

languages allow for algebraic transformation of users' queries. Based on the query specification, it

is relatively easy for an optimizer to generate a variety of equivalent plans for a query, and to

choose the least expensive one.

RDBMS

1.16 Storage and Buffer Management

Despite its many virtues, the relational data model is a poor fit for many types of data now

common across the enterprise. In fact, object databases owe much of their existence to the

inherent limitations of the relational model as reflected in the SQL2 standard. In recent years, a

growing chorus of demands has arisen from application developers seeking more flexibility and

functionality in the data model, as well as from system administrators asking for a common

database technology managed by a common set of administrative tools. As a result, vendors and

the SQL3 standard committees to include object capabilities are now extending the relational

model.

Object/relational (O/R) database products are still quite new, and the production databases to

which they have been applied are usually modest in size--50GB or less. As O/R technology

becomes more pervasive and memory and storage costs continue to fall, however, databases

incorporating this new technology should grow to a size comparable to that of pure relational

databases. Indeed, growth in the new technology is likely if for no other reason than that much of

this new data is inherently larger than the record/field type data of traditional relational

applications.

However, while limits to the growth of individual pure relational databases have been imposed as

much by hardware evolution as by software, the limits for O/R databases will arise primarily from

software. In this article, I'll explore the implications of the architecture approaches chosen by the

principal O/R database product designers--IBM, Informix, NCR, Oracle, Sybase, and Computer

Associates--for scalability of complex queries against very large collections of O/R data. The

powerful new data type extension mechanisms in these products limit the ability of vendors to

assume as much of the burden of VLDB complexity as they did for pure relational systems.

Instead, these mechanisms impose important additional responsibilities on the designers of new

types and methods, as well as on application designers and DBAs; these responsibilities become

more crucial and complex as the size of the databases and the complexity of the queries grow.

Finally, I'll explain how parallel execution is the key to the cost effectiveness--or even in some

cases to the feasibility--of applications that exploit the new types and methods, just as with pure

relational data. In contrast to the pure relational approach, however, achieving O/R parallelism is

much more difficult.

The term "VLDB" is overused; size is but one descriptive parameter of a database, and generally

not the most important one for the issues I'll raise here. Most very large OLTP relational

databases, for example, involve virtually none of these issues because high-volume OLTP queries

are almost always short and touch few data and metadata items. In addition, these OLTP

databases are frequently created and administered as collections of semi-independent smaller

databases, partitioned by key value. In contrast, the VLDB issues discussed here arise in very

large databases accessed by queries that individually touch large amounts of data and metadata

and involve join operations, aggregations, and other operations touching large amounts of data

and metadata. Such databases and applications today are usually found in data warehousing and

data mining applications, among other places.

The VLDB environments with these attributes are characterized by many I/O operations

within a single query involving multiple complex SQL operators and frequently generating large

intermediate result sets. Individual queries regularly cross any possible key partition boundary

and involve data items widely dispersed throughout the database. For these reasons, such a

database must normally be administered globally as a single entity. As a "stake in the ground," I'll

focus on databases that are at least 250GB in size, are commonly accessed by complex queries,

require online administration for reorganization, backup and recovery, and are regularly subject

to bulk operations (such as insert, delete, and update) in single work-unit volumes of 25GB or

more. For these databases, an MPP system, or possibly a very large SMP cluster, is required.

However, many of the issues we raise will apply to smaller databases as well.

Parallel Execution of Individual Queries

In a pure relational database, success in such an environment of 250GB or more may require the

highest degree of parallelism possible in execution of each individual query. The first key to

parallel success is the structure of the SQL query language itself. In most queries, SQL acts as a

set operator language that imposes few constraints on the order of execution; that is why

RDBMS

relational database optimizers have so much flexibility in choosing parallel execution plans. Thus,

SQL as an application language is extremely hospitable to parallel execution.

Given this characterization, one must devise execution strategies that are optimized for

parallel execution of an individual query, rather than merely providing for parallel execution of

strategies optimized for serial execution. Note that individual DDL and DML operations--as well as

other utilities such as backup and recovery--must execute in parallel. As indicated earlier, within

the context of SQL2, RDBMS vendors did most of the work necessary to provide the necessary

parallelism. Furthermore, it was possible to do so in ways that were largely transparent to the

application developer and the DBA. They were greatly aided not only by the structure of SQL, but

also by the fact that the core language and data types were mostly fixed during the '80s and early

'90s, at least as insofar as they affect parallel execution. Even so, it takes a vendor at least two or

three years to evolve a mature relational product that fully supports parallelism within a query for

VLDBs.

It’s worth noting that the complexity of adding parallelism for VLDB has two sources. The first is

replacing serial algorithms with parallel ones. The second source is subtler: There is a qualitative

change of mindset that must take place if one is to design competitive software for these very large

collections of data. For example, the first reaction of a designer building a database load program

is to make it use the INSERT function. This technique works for small- and medium-sized

databases but not for giant ones; it is simply too slow. Instead, one must design for multiple

streams of data to be loaded in parallel, capitalizing on the fact that multiple CPUs will be

available on the MPP or SMP cluster. If the data is coming from a single serial source, such as a

tape drive, the initial load task should focus only on getting each data item into memory as

quickly as possible. The power of the other CPUs should then be invoked in parallel through tasks

to get each datum to the right place, to index it properly, and so on.

These problems become more complex as O/R databases enter the mainstream; new and

potentially exotic data types must be accommodated, and new methods must be written and

optimized for parallel execution. Complex relationships far beyond the simple row and column

structure of the relational model can and will exist among instances of the new data types and

even among instances of current SQL data types--in a parts-explosion database, for example, or

when a "purchase order" object view is superimposed on an existing on-order database. New data

access methods to support the new operations and data types will also be required and must also

be optimized for parallel execution. Finally, expect giant ratios of column sizes in O/R tables with

the largest column entries being hundreds to thousands of times larger than the smallest

columns within a table. This change will create a need for new approaches to storage and buffer

management in servers as well as clients.

1.17 Transaction Processing

Is the following schedule serial? serialisable? Why or why not?

T1 T2 T3

-------------------------------------------------

1 start

2 read Z

3 start

4 read Y

5 Y = Y + 1

6 write Y

7 start

8 read Y

9 read X

10 Z = Z + 1

11 X = X + 1

12 checkpoint

13 write X

14 write Z

15 Y = Y + 1

16 write Y

17 read X

RDBMS

18 read Y

19 Y = Y + 1

20 write Y

21 commit

22 X = X + 1

23 write X

24 commit

25 read W

26 W = W + 1

27 write W

28 commit

For the above sequence of transactions, show the log file entries, assuming initially Z = 1, Y = 2, X

= 1, and W = 3. What happens to the transactions at the checkpoint?

1) The following transaction increases the total sales (A) and number of sales (B) in a store

inventory database. Write the values of A and B for each statement, assuming A is initially 100

and B is initially 2. The correctness criteria is that the average sale is $50. For each statement

determine if the transaction dies during the statement whether the database is in a correct state.

What should be done to repair the damage if the database is left in an incorrect state?

Transaction Start

1) Read A

2) A  A + 50

3) Read B

4) B  B + 1

5) Write B

6) Write A

7) Transaction ends

2) Which protocols potentially cause cascading rollbacks and why?

1.18 The Query Processor

The query optimizer module has the task of producing an execution plan, and the code generator

generates the code to execute the plan. The runtime database processor has the task of running

the query code, whether in compiled or interpreted mode, to produce the query result. If a runtime

error results, an error message is generated by the runtime database processor.

Figure: Typical Steps of a Query processor

The term optimization is actually a misnomer because in some cases the chosen execution plan is

not the optimal (best) strategy—it is just a reasonably efficient strategy for executing the query.

Finding the optimal strategy is usually too time-consuming except for the simplest of queries and

RDBMS

may require information on how the files are implemented and even on the contents of theseinformation

that may not be fully available in the DBMS catalog. Hence, planning of an execution

strategy may be a more accurate description than query optimization.

For lower-level navigational database languages in legacy systems–such as the network DML or

the hierarchical HDML- the programmer must choose the query execution strategy while writing a

database program. If a DBMS provides only a navigational language, there is limited need or

opportunity for extensive query optimization by the DBMS; instead, the programmer is given the

capability to choose the "optimal" execution strategy. On the other hand, a high-level query

language—such as SQL for relational DBMSs (RDBMSs) or OQL for object DBMSs (ODBMSs)—is

more declarative in nature because it specifies what the intended results of the query are, rather

than the details of how the result should be obtained. Query optimization's thus necessary for

queries that are specified in a high-level query language.

Computer Science Notes

Tuesday, August 12, 2008

RDBMS

No comments:

Archives

Categories