Basics of computer science such as Operating system, Theory of Computation, Algorithms, Programming, Discete mathematics, Software Development life cycle
Thursday, December 11, 2008
Introduction to Flowcharting
for programs that appear in Chapters 1 through 6. In addition, a complete lesson
on flowcharting is included on the Student CD. The lesson is stored in a PowerPoint file
named
Flowcharting.ppt
.
A flowchart is a diagram that depicts the “flow” of a program. It contains symbols that
represent each step in the program. The figure shown here is a flowchart for Program 1-1,
the pay-calculating program in Chapter 1.
START
END
Display message
”How many
hours did
you work?”
Display message
”How much
do you get
paid per hour?”
Multiply hours
by payRate.
Store result in
grossPay.
Read hours
Read payRate
Display
grossPay
2
Appendix D: Introduction to Flowcharting
Notice there are three types of symbols in this flowchart: rounded rectangles (representing
terminal points), parallelograms (representing input/output operations), and a rectangle
(representing a process).
The rounded rectangles, or terminal points, indicate the flowchart’s starting and ending
points. The parallelograms designate input or output operations. The rectangle depicts a
process such as a mathematical computation, or a variable assignment. Notice that the
symbols are connected with arrows that indicate the direction of program flow.
Connectors
Sometimes a flowchart is broken into two or more smaller flowcharts. This is usually done
when a flowchart does not fit on a single page, or must be divided into sections. A connector
symbol, which is a small circle with a letter or number inside it, allows you to connect
two flowcharts.
In the figure below, the A connector indicates that the second flowchart segment begins
where the first flowchart segment ends.
Terminal Input/Output Processes
Operations
A
END
START A
A
Appendix D: Introduction to Flowcharting
3
Flowchart Structures
There are four general flowchart structures:
•
Sequence
•
Decision
•
Repetition
•
Case
A sequence structure is a series of actions or steps, performed in order. The flowchart for
the pay-calculating program is an example of a sequence structure. The following flowchart
is also a sequence structure. It depicts the steps performed in Program 2-20, from
Chapter 2.
Flowchart for Program 2-20
The following flowchart, which is another sequence structure, depicts the steps performed
in Program 3-15 (from Chapter 3). Notice the use of connector symbols to link the two
flowchart segments.
END
START
Display
Total Pay
Calculate Total
Wages as Regular
Wages plus
Overtime Wages.
Calculate Regular
Wages as Base
Pay times
Regular Hours.
Calculate Overtime
Wages as
Overtime Pay
times Overtime
Hours.
4
Appendix D: Introduction to Flowcharting
Flowchart for Program 3–15
The Decision Structure
In a decision structure, one of two possible actions is performed, depending on a condition.
In a decision structure, a new symbol, the diamond, represents a yes/no question. If
the answer to the question is “yes,” the program flow follows one path. If the answer to
the question is “no,” the program flow follows another path. The following figure shows
the general form of a decision structure.
START
END
Ask user to enter
beginning
inventory value
for all three stores.
Store begInv
in store1,
store2, and
store3.
Subtract sold
from store1.
Read begInv.
Read sold.
Ask user to enter
number of widgets
sold at Store 1.
A
A
Ask user to enter
number of widgets
sold at Store 2.
Subtract sold
from sold2.
Subtract sold
from store3.
Read sold.
Read sold.
Ask user to enter
number of widgets
sold at Store 3.
Display inventory
values for all
three stores.
Appendix D: Introduction to Flowcharting
5
In the following flowchart segment, the question “is x < y?” is asked. If the answer is
“no,” then process A is performed. If the answer is “yes,” then process B is performed.
The following flowchart depicts the logic of Program 4-8, from Chapter 4. The decision
structure shows that one of two possible messages is displayed on the screen, depending
on the value of the expression
number % 2
.
NO YES
NO
Process A Process B
YES
x < y?
6
Appendix D: Introduction to Flowcharting
Flowchart for Program 4–8
The Case Structure
In a case structure, one of several possible actions is taken, depending on the contents of a
variable. The following flowchart segment shows the general form of a case structure.
The following flowchart depicts the logic of Program 4–31, which is also from Chapter 4.
One of 4 possible paths is followed, depending on the value stored in the variable
feedGrade
.
NO
Display
”Number is
odd.”
number % 2
equal to 0?
Read number.
Ask user to enter
an integer.
Display
”Number is
even.”
YES
START
END
Appendix D: Introduction to Flowcharting
7
Flowchart for Program 4–31
Repetition Structures
A repetition structure represents part of the program that repeats. This type of structure
is commonly known as a loop. The following figure shows an example of a repetition
structure.
Notice the use of the diamond symbol. A repetition structure tests a condition, and if the
condition exists, it performs an action. Then it tests the condition again. If the condition
still exists, the action is repeated. This continues until the condition no longer exists. In
the flowchart segment above, the question “is x < y?” is asked. If the answer is yes, then
Start
Ask the user to
enter the desired
feed grade.
Read input into
feedGrade.
End
'a'
'A'
'b'
'B'
'c'
'C' Other
case feedGrade
Display "30 cents
per pound".
Display "20 cents
per pound".
Display "15 cents
per pound".
Display "That is an
invalid choice".
YES
x < y? Process A
8
Appendix D: Introduction to Flowcharting
Process A is performed. The question “is x < y?” is asked again. Process A is repeated as
long as x is less than y. When x is no longer less than y, the repetition stops and the structure
is exited.
There are two forms of repetition structure: pre-test and post-test. A pre-test repetition
structure tests its condition
before
it performs an action. The flowchart segment above
shows a pre-test structure. Notice that Process A does not execute at all if the condition “x
< y” is not true. The pre-test repetition structure is coded in C++ as a
while
loop.
A post-test repetition structure tests its condition
after
it performs an action. This type of
loop always performs its action at least once. The following flowchart segment shows an
example.
The post-test repetition structure is coded in C++ as a
do-while
loop.
The following flowchart depicts the logic of Program 5-6, which appears in Chapter 5.
Flowchart for Program 5-6
YES
NO
Process A
x < y?
START
number = 1.
Display Table
Headings.
Display number
and number
Squared.
number <=
10?
YES
NO
Add 1 to number.
END
Appendix D: Introduction to Flowcharting
9
Modules
A program module (such as a function in C++) is represented by the special symbol shown
below:
The position of the module symbol indicates the point the module is executed, as shown in
the following figure:
A separate flowchart can then be constructed for the module. The following figure shows
a flowchart for Program 6-19 from Chapter 6. The
getBasePay
and
getOvertimePay
modules appear as separate flowcharts.
START
END
Read Input.
Call calc_pay
function.
Display results.
10
Appendix D: Introduction to Flowcharting
Flowchart for Program 6-19
Start
Ask the user to
enter the number
of hours worked.
hours >
BASE_HOURS?
Call
getBasePay
Yes
Call
getOvertimePay
Calculate total pay.
No
Display base pay,
overtime pay, and total
pay.
End
Start of
getBasePay
No hours > Yes
BASE_HOURS?
basePay =
BASE_HOURS *
PAY_RATE
basePay =
hoursWorked *
PAY_RATE
End of
getBasePay
Start of
getOvertimePay
No hours > Yes
BASE_HOURS?
overtimePay = (hoursWorked
- BASE_HOURS) *
PAY_RATE *
OT_MULTIPLIER
overtimePay = 0.0
End of
getOvertimePay
Using UML in Class Design
UML
stands for
Unified
Modeling Language
. The UML provides a set of standard diagrams for graphically
depicting object-oriented systems. Figure E-1 shows the general layout of a UML diagram
for a class. Notice that the diagram is a box that is divided into three sections. The top
section is where you write the name of the class. The middle section holds a list of the
class’s member variables. The bottom section holds a list of the class’s member functions.
For example, in Chapter 13, Introduction to Classes, you studied a
Rectangle
class that
could be used in a program that works with rectangles. The class has the following member
variables:
•
width
•
length
The class also has the following member functions:
•
setWidth
•
setLength
•
getWidth
•
getLength
•
getArea
Figure E-1
Class name goes here
Member variables are listed here
Member functions are listed here
2
Appendix E: Using UML in Class Design
From this information alone we can construct a simple UML diagram for the class, as
shown in Figure E-2.
The UML diagram in Figure E-2 tells us the name of the class, the names of the member
variables, and the names of the member functions. Compare this diagram to the actual
C++ class declaration, which follows.
class Rectangle
{
private:
double width;
double length;
public:
void setWidth(double);
void setLength(double);
double getWidth() const;
double getLength() const;
double getArea() const;
};
The UML diagram in Figure E-2 does not convey many of the class details, such as access
specification, member variable data types, parameter data types, and function return
types. The UML provides optional notation for these types of details.
Showing Access Specification in UML Diagrams
The UML diagram in Figure E-2 lists all of the members of the
Rectangle
class but does
not indicate which members are private and which are public. In a UML diagram you may
optionally place a
-
character before a member name to indicate that it is private, or a
+
character to indicate that it is public. Figure E-3 shows the UML diagram modified to
include this notation.
Figure E-2
Rectangle
width
length
setWidth()
setLength()
getWidth()
getLength()
getArea()
Appendix E: Using UML in Class Design
3
Data Type and Parameter Notation in UML Diagrams
The Unified Modeling Language also provides notation that you may use to indicate the
data types of member variables, member functions, and parameters. To indicate the data
type of a member variable, place a colon followed by the name of the data type after the
name of the variable. For example, the
width
variable in the
Rectangle
class is a
double
.
It could be listed as follows in the UML diagram:
- width : double
The return type of a member function can be listed in the same manner: After the function’s
name, place a colon followed by the return type. The
Rectangle
class’s
getLength
function returns a
double
, so it could be listed as follows in the UML diagram:
+ getLength() : double
Parameter variables and their data types may be listed inside a member function’s parentheses.
For example, the
Rectangle
class’s
setLength
function has a
double
parameter
named
len
, so it could be listed as follows in the UML diagram:
+ setLength(len : double) : void
Figure E-4 shows a UML diagram for the
Rectangle
class with parameter and data type
notation.
Figure E-3
NOTE:
In UML notation the variable name is listed first, then the data type. This is
opposite of C++ syntax, which requires the data type to appear first.
Rectangle
- width
- length
+ setWidth()
+ setLength()
+ getWidth()
+ getLength()
+ getArea()
4
Appendix E: Using UML in Class Design
Showing Constructors and Destructors in a UML Diagram
There is more than one accepted way of showing a class’s constructor in a UML diagram.
In this text we show a constructor just as any other method, except we list no return type.
Figure E-5 shows a UML diagram for the first version of the
InventoryItem
class, which
was discussed in Chapter 13. This version of the class has one constructor, which accepts
three arguments.
Figure E-6 shows a UML diagram for the second version of the
InventoryItem
class,
which has three overloaded constructors, as well as some additional member functions.
Figure E-4
Figure E-5
Rectangle
- width : double
- length : double
+ setWidth(w : double) : void
+ setLength(len : double) : void
+ getWidth() : double
+ getLength() : double
+ getArea() : double
InventoryItem
- description : char*
- cost : double
- units : int
+ InventoryItem(desc : char*,
c : double, u : int) :
+ ~Inventory() :
+ getDescription() : char*
+ getCost() : double
+ getUnits() : int
Appendix E: Using UML in Class Design
5
Aggregation UML Diagrams
Aggregation occurs when a class contains an instance of another class as a member. You show
aggregation in a UML diagram by connecting two classes with a line that has an open diamond
at one end. The diamond is closest to the class that contains instances of other classes.
For example, suppose we have the
PersonalInfo
class shown in Figure E-7, which holds
information about a person.
Figure E-6
Figure E-7
InventoryItem
- description : char*
- cost : double
- units : int
+ InventoryItem():
+ InventoryItem(desc : char*) :
+ InventoryItem(desc : char*,
c : double, u : int) :
+ ~Inventory() :
+ setDescription(d : char*) : void
+ setCost(c : double) : void
+ setUnits(u : int) : void
+ getDescription() : char*
+ getCost() : double
+ getUnits() : int
- customerName : char*
- customerAddress : char*
- customerCity : char*
- customerState : char*
- customerZip : char*
+ PersonalInfo(name : char*,
address : char*,
city : char*,
state : char*,
zip : char*) :
+ getName() : char*
+ getAddress() : char*
+ getCity() : char*
+ getState() : char*
+ getZip() : char*
PersonalInfo
6
Appendix E: Using UML in Class Design
Suppose we also have the
BankAccount
class shown in Figure E-8, which holds the balance
of a bank account, and can perform operations such as making deposits and withdrawals.
Figure E-9 shows a UML diagram for another class,
BankCustomer
, which contains
instances of the
PersonalInfo
and
BankAccount
classes as members. The relationship
between the classes is shown by the connecting lines with the open diamond. The open
diamond is closest to the
BankCustomer
class because it contains instances of the other
classes as members.
Inheritance in UML Diagrams
You show inheritance in a UML diagram by connecting two classes with a line that has an
open arrowhead at one end. The arrowhead points to the base class. For example,
Figure E-10 shows a UML diagram depicting the relationship between the
GradedActivity
and
FinalExam
classes that you studied in Chapter 15. The arrowhead points toward the
GradedActivity
class, which is the base class.
Showing Protected Members
Protected class members may be denoted in a UML diagram with the
#
symbol. In the second
version of the
GradedActivity
class, in Chapter 15, the
score
member variable was
declared
protected
. Figure E-11 shows a UML diagram for the class.
Figure E-8
BankAccount
- balance : double
- interestRate : double
- interest : double
+ BankAccount(startBalance : double,
intRate : double) :
+ deposit(amount : double) : void
+ withdraw(amount : double) : void
+ addInterest() : void
+ getBalance() : double
+ getInterest() : double
Appendix E: Using UML in Class Design
7
Figure E-9
- info : PersonalInfo
- checking : BankAccount
- savings : BankAccount
+ BankCustomer(i : PersonalInfo,
c : BankAccount,
s : BankAccount) :
+ getCheckingBalance() : double
+ getSavingsBalance() : double
+ checkingDeposit(amt : double)
: void
+ checkingWithdrawal(amt : double)
: void
+ savingsDeposit(amt: double) : void
+ savingsWithdrawal(amt : double)
: void
BankCustomer
BankAccount
- balance : double
- interestRate : double
- interest : double
+ BankAccount(startBalance : double,
intRate : double) :
+ deposit(amount : double) : void
+ withdraw(amount : double) : void
+ addInterest() : void
+ getBalance() : double
+ getInterest() : double
- customerName : char*
- customerAddress : char*
- customerCity : char*
- customerState : char*
- customerZip : char*
+ PersonalInfo(name : char*,
address : char*,
city : char*,
state : char*,
zip : char*) :
+ getName() : char*
+ getAddress() : char*
+ getCity() : char*
+ getState() : char*
+ getZip() : char*
PersonalInfo
8
Appendix E: Using UML in Class Design
Figure E-10
Figure E-11
- score : double
+ GradedActivity() :
+ GradedActivity(s : double) :
+ setScore(s : double) : void
+ getScore() : double
+ getLetterGrade() : char
GradedActivity
- numQuestions : int
- pointsEach : double
- numMissed : int
+ FinalExam() :
+ FinalExam(questions : int,
missed : int) :
+ set(questions : int,
missed : int) : void
+ getNumQuestions() : double
+ getPointsEach() : double
+ getNumMissed() : int
FinalExam
# score : double
+ GradedActivity() :
+ GradedActivity(s : double) :
+ setScore(s : double) : void
+ getScore() : double
+ getLetterGrade() : char
GradedActivity
Tuesday, August 12, 2008
Introduction of Database management System
Data manipulation and information processing have become the major tasks of any organization,
small or big, whether it is an educational institution, government concern, scientific, commercial
or any other. It is the plural of a Greek word datum, which means any raw facts, or figure like
numbers, events, letters, transactions, etc, based on which we cannot reach any conclusion. It
can be useful after processing, e.g. 78, it is simply a number (data) but if we say physics (78) then
it will becomes information. It means somebody got distinctional marks in physics. Information is
processed data. The user can take decision based on information.
An organization is only a mechanism for processing information and considers that the traditional
management of information can be viewed in the context of information and process. The
manager may be considered as a planning and decision center. Established routes of information
flow are used to determine the effectiveness of the organization in achieving its objectives. Thus,
information is often described as the key to success in business.
In essence a database is nothing more than a collection of information that exists over a long
period of time, often many years. In common parlance, the term database refers to a collection of
data that is managed by a DBMS.
The DBMS is expected to:
1. Allow users to create new databases and specify their schema (logical structure of the data),
using a specialized language called a data-definition language.
Data Processing Information
RDBMS
2. Give users the ability to query the data (a \query" is database lingo for a question about the
data) and modify the data, using an appropriate language, often called a query language or datamanipulation
language.
3. Support the storage of very large amounts of data many gigabytes or more | over a long period
of time, keeping it secure from accident or unauthorized use and allowing efficient access to the
data for queries and database modifications.
4. Control access to data from many users at once, without allowing the actions of one user to
accept other users and without allowing simultaneous
.
A database is a collection of related data or operational data extracted from any firm or
organization. For example, consider the names, telephone number, and address of people you
know. You may have recorded this data in an indexed address book, or you may have stored it on
a diskette, using a personal computer and software such as Microsoft Access of MS Office or
ORACLE, SQL SERVER etc.
A Database Management System (DBMS) is a computer program you can use to store, organize
and access a collection of interrelated data. The collection of data is usually referred to as the
database. The primary goal of a DBMS is to provide a convenient and effective way to store and
retrieve data from the database. There are several types of data models (a data model is used to
describe the structure of a database) and Empress is a Relational Database Management System
(RDBMS) with Object Oriented extensions. Empress is capable of managing data in multiple
databases. The data stored in each database is organized as tables with rows and columns. In
relational database terminology, these tables are referred to as relations, rows are referred to as
records, and columns are referred to as attributes.
A database management system, therefore, is a combination of hardware and software
that can be used to set up and monitor a database, and can manage the updating and retrieval of
database, and can manage that can be used to set up and monitor a database, and can manage
the updating and retrieval of database that has been stored in it.
Most database management systems have the following facilities:
a) Creating of a file, addition to data, modification of data; creation, addition and deletion of
entire files.
b) Retrieving data collectively or selection.
c) The Data stored or indexed at the user’s discretion and direction.
d) Various reports can be produced from the system.
e) Mathematically function can be performed and the data will be stored in the database can be
manipulated with these functions.
f) To maintain data integrity and database use.
The DBMS response to a query by involving the appropriate subgroups, each of which performs
its special functions to interpret the query, or to locate the desired data in the database and
present it in the desired order.
As already mentioned, a database consists of a group of related files of different record types, and
the database allows users to access data anywhere in the database without the knowledge of how
data are actually organized on the storage device.
1.2 TRADITIONAL FILE ORIENTED APPROACH
8
Queries
COBOL/PL
DBMS OS Data
Base
RDBMS
The traditional file-oriented approach to information processing has for each application a
separate master file and its own set of personal files. An organization needs flow of information
across these applications also and this requires sharing of data, which is significantly lacking in
the traditional approach. One major limitations of such a file-based approach is that the
programs become dependent on the files and the files become dependent upon the programs.
Disadvantages
· Data Redundancy: The same piece of information may be stored in two or more files. For
example, the particulars of an individual who may be a customer or client may be stored
in two or more files. Some of this information may be changing, such as the address, the
payment maid, etc. It is therefore quite possible that while the address in the master file
for one application has been updated the address in the master file for another application
may have not been. It may be not easy to even find out as to in how many files the
repeating items such as the name occur.
· Program/Data Dependency: In the traditional approach if a data field is to be added to a
master file, all such programs that access the master file would have to be changed to
allow for this new field which would have been added to the master record.
· Lack of Flexibility: In view of the strong coupling between the program and the data,
most information retrieval possibilities would be limited to well-anticipated and predetermined
requests for data, the system would normally be capable of producing
scheduled records and queries which it has been programmed to create.
1.3 MOTIVATION FOR DATABASE APPROACH
Having pointed out some difficulties that arise in a straightforward file-oriented approach towards
information system development. The work in the organization may not require significant sharing
of data or complex access. In other words the data and the way it is used in the functioning of the
organization are not appropriate to database processing. Apart from needing a more powerful
hardware platform, the software for database management systems is also quite expensive. This
means that a significant extra cost has to be incurred by an organization if it wants to adopt this
approach.
Advantages gained by the possibility of sharing of the data with others, also carries with it the
risk of unauthorized access of data. This may range from violation of office procedures to violation
of privacy rights of information to down right thefts. The organizations, therefore, have to be ready
to cope with additional managerial problems.
A database management processing system is complex and it could lead to a more inefficient
system than the equivalent file-based one.
The use of the database and its possibility of being shared will, therefore affect many departments
within the organization. If die integrity of the data is not maintained, it is possible that one
relevant piece of data could have been used by many programs in different applications by
different users without they are being aware of it. The impact of this therefore may be very
widespread. Since data can be input from a variety sources, the control over the quality of data
become very difficult to implement.
However, for most large organization, the difficulties in moving over to a database approach are
still worth getting over in view of the advantages that are gained, namely, avoidance of data
duplication, sharing of data by different programs, greater flexibility and data independence.
1.4 The Evolution of Database Systems
Early days: database applications built on top of file systems. Drawbacks of using file systems to
store data:
· Data redundancy and inconsistency
· Difficulty in accessing data
· Atomicity of updates
9
RDBMS
· Concurrency control
· Security
· Data isolation — multiple files and formats
· Integrity problems
10
RDBMS
1.5 DATABASE BASICS
Since the DBMS of an organization will in some sense reflect the nature of activities in the
organization, some familiarity with the basic concepts, principles and terms used in the field are
important.
· Data-items: The term data item is the word for what has traditionally been called the field
in data processing and is the smallest unit of data that has meaning to its users. The
phrase data element or elementary item is also sometimes used. Although the data item
may be treated as a molecule of the database, data items are grouped together to form
aggregates described by various names. For example, the data record is used to refer to a
group of data items and a program usually reads or writes the whole records. The data
items could occasionally be further broken down into what may be called an automatic
level for processing purposes.
· Entities and Attributes: The real world would consist of occasionally a tangible object such
as an employee; a component in an inventory or a space or it may be intangible such as an
event, a job description, identification numbers, or an abstract construct. All such items
about which relevant information is stored in the database are called Entities. The
qualities of the entity that we store as information are called the attributes. An attribute
may be expressed as a number or as a text. It may even be a scanned picture, a sound
sequence, and a moving picture that is now possible in some visual and multi-media
databases.
Data processing normally concerns itself with a collection of similar entities and records
information about the same attributes of each of them. In the traditional approach, a
programmer usually maintains a record about each entity and a data item in each record
relates to each attribute. Similar records are grouped into files and such a 2-dimensional
array is sometimes referred to as a flat file.
· Logical and Physical Data: One of the key features of the database approach is to bring
about a distinction between the logical and the physical structures of the data. The term
logical structure refers to the way the programmers see it and the physical structure refers
to the way data are actually recorded on the storage medium. Even in the early stages of
records stored on tape, the length of the inter-record tape requires that many logical
records be grouped into one physical record to several storage places on tape. It was the
software, which separated them when used in an application program, and combined them
again before writing back on tape. In today's system the complexities are even greater and
as will be seen when one is referring to distributed databases that some records may
physically be located at significantly remote places.
· Schema and Subschema: Having seen that the database does not focus on the logical
organization and decouples it from the physical representation of data, it is useful to have
a term to describe the logical database description. A schema is a logical database
description and is drawn as a chart of the types of data that are used. It gives the names of
the entities and attributes, and specifies the relationships between them. It is a framework
into which the values of the data item can be fitted. Like an information display system
such as that giving arrival and departure time at airports and railway stations, the schema
will remain the same though the values displayed in the system will change from time to
time. The relationships that has specified between the different entities occurring in the
schema may be a one to one, one to many, many to many, or conditional.
The term schema is used to mean an overall chart of all the data item types and recordtypes
stored in a database. The term sub schema refers to the same view but for the dataitem
types and record types which a particular user uses in a particular application or.
Therefore, many different sub schemas can be derived from one schema.
· Data Dictionary: It holds detailed information about the different structures and data
types: the details of the logical structure that are mapped into the different structure,
details of relationship between data items, details of all users privileges and access rights,
performance of resource with
11
RDBMS
· details.
1.6 THREE VIEWS OF DATA
DBMS is a collection of interrelated files and a set of programs that allow several users to access
and modify these files. A major purpose of a database system is to provide users with an abstract
view of the data. However, in order for the system to be usable, data must be retrieved efficiently.
The concern for efficiently leads to the design of complex data structure for the representation of
data in the database. By defining levels of abstract as which the database may be viewed, there
are logical view or external, conceptual view and internal view or physical view.
· External view: This is the highest level of abstraction as seen by a user. This level of
abstraction describes only the part of entire database.
· Conceptual view: This is the next higher level of abstraction which is the sum total of Data
Base Management System user's views. In this we consider; what data are actually stored
in the database. Conceptual level contains information about entire database in terms of a
small number of relatively simple structures.
· Internal level: The lowest level of abstraction at which one describes how the data are
physically stored. The interrelationship of any three levels of abstraction is illustrated in
figure 2.
Fig: The three views of data
To illustrate the distinction among different views of data, it can be compared with the concept of
data types in programming languages. Most high level programming language such as C, VC++,
etc. support the notion of a record or structure type. For example in the ‘C’ language we declare
structure (record) as follows:
struct Emp{
char name [30];
12
RDBMS
char address [100];
}
This defines a new record called Emp with two fields. Each field has a name and data type
associated with it.
In an Insurance organization, we may have several such record types, including among others:
-Customer with fields name and Salary
-Premium paid and Due amount at what date
-Insurance agent name and salary + Commission
At the internal level, a customer, Premium account, or employee (insurance agent) can be
described as a sequence of consecutive bytes. At the conceptual level each such record is
described by a type definition, illustrated above and also die interrelation among these record
types is defined and describing the rights or privileges assign to individual customer or end-users.
Finally at the external level, we define several views of the database. For example, for preparing
the Insurance checks of Customer_details’, only information about them is required; one does not
need to access information about customer accounts. Similarly, tellers can access only account
information. They cannot access information concerning about the premium paid or amount
received.
1.7 The Three Level Architecture Of DBMS
A database management system that provides these three levels of data is said to follow threelevel
architecture as shown in fig. . These three levels are the external level, the conceptual level,
and the internal level.
Fig : The three level architecture for a DBM
A schema describes the view at each of these levels. A schema as mentioned earlier is an outline
or a plan that describes the records and relationships existing in the view. The schema also
describes the way in which entities at one level of abstraction can be mapped to the next level.
The overall design of the database is called the database schema. A database schema includes
such information as:
· Characteristics of data items such as entities and attributes
· Format for storage representation
· Integrity parameters such as physically authorization and backup politics.
· Logical structure and relationship among those data items
The concept of a database schema corresponds to programming language notion of type
definition. A variable of a given type has a particular value at a given instant in time. The concept
of the value of a variable in programming languages corresponds to the concept of an instance of
a database schema.
13
RDBMS
Since each view is defined by a schema, there exists several schema in the database and these
exists several schema in the database and these schema are partitioned following three levels of
data abstraction or views. At the lower level we have the physical schema; at the intermediate
level we have the conceptual schema, while at the higher level we have a subschema. In general,
database system supports one physical schema, one conceptual schema, and several subschemas.
1.7.1 External Level or Subschema
The external level is at the highest level of database abstraction where only those portions of the
database of concern to a user or application program are included. Any number of user views may
exist for a given global or conceptual view.
Each external view is described by means of a schema called an external schema or subschema.
The external schema consists of the, definition of the logical records and the relationships in the
external view. The external schema also contains the method of deriving the objects in the
external view from the objects in the conceptual view. The objects include entities, attributes, and
relationships.
1.7.2 Conceptual Level or Conceptual Schema
One conceptual view represents the entire database. The conceptual schema defines this
conceptual view. It describes all the records and relationships included in the conceptual view
and, therefore, in the database. There is only one conceptual schema per database. This schema
also contains the method of deriving the objects in the conceptual view from the objects in the
internal view.
The description of data at this level is in a format independent of its physical representation. It
also includes features that specify the checks to retain data consistency and integrity.
1.7.3 Internal Level or Physical Schema
It indicates how the data will be stored and describes the data structures and access methods to
be used by the database. The internal schema, which contains the definition of the stored record,
the method of representing the data fields, expresses the internal view and the access aids used.
1.7.4 Mapping between different Levels
Two mappings are required in a database system with three different views. A mapping between
the external and conceptual level gives the correspondence among the records and the
relationships of the external and conceptual levels.
a) EXTERNAL to CONCEPTUAL: Determine how the conceptual record is viewed by the user
b) INTERNAL to CONCEPTUAL: Enable correspondence between conceptual and internal levels. It
represents how the conceptual record is represented in storage.
An internal record is a record at the internal level, not necessarily a stored record on a physical
storage device. The internal record of figure 3 may be split up into two or more physical records.
The physical database is the data that is stored on secondary storage devices. It is made up of
records with certain data structures and organized in files. Consequently, there is an additional
mapping from the internal record to one or more stored records on secondary storage devices.
1.8 Relational Database Systems
The relational model, invented by IBM researcher Ted CODD in 1970, wasn't turned into a
commercial product until almost 1980. Since then database systems based on the relational
model, called relational database management systems or RDBMS, have come to dominate the
database software market. Today few people know about any other kind of database management
system.
Few RDBMS implement the relational model completely. Although commercial RDBMS have a lot
in common, each system has quirks and non-standard extensions. You must understand
relational theory to correctly design a database -just learning a particular RDBMS won't get you
all the way there.
A good RDBMS and a well-designed relational database give you some important benefits:
14
RDBMS
· Data integrity and consistency maintained and/or enforced by the RDBMS.
· Redundant data eliminated or kept to a practical minimum.
· Data retrieved by unique keys.
· Relationships expressed through matching keys.
· Physical organization of data managed by RDBMS.
· Optimization of storage and database operation execution times.
1.9 Data Models
Collections of conceptual tools for describing data, data relationships, data semantics and
consistency constraints. The various data models that have been proposed fall into three different
groups. Object based logical models, record-based logical models and physical models.
Object-Based Logical Models: They are used in describing data at the logical and view levels. They
are characterized by the fact that they provide fairly flexible structuring capabilities and allow
data constraints to be specified explicitly. There are many different models and more are likely to
come. Several of the more widely known ones are:
· The E-R model
· The object-oriented model
· The semantic data model
· The functional data model
Data Modeling Using E-R Model
The (E-R) data model is based on a perception of a real worker that consists of a collection of basic
objects, called entities, and of relationships among these objects.
The overall logical structure of a database can be expressed graphically by an E-R diagram. Which
is built up by the following components:
· Rectangles, which represent entity sets
· Ellipses, which represent attributes
· Diamonds, which represent relationships among entity sets
· Lines, which link attributes to entity sets and entity sets to relationships.
E.g. suppose we have two entities like customer and account, then these two entities can be
modeled as follow:
A sample E-R diagram
The Object-Oriented Model
Like the E-R model the object-oriented model is based on a collection of objects. An object
contains values stored in instance variables within the object. An object also contains bodies of
code that operate on the object. These bodies of code are called methods.
15
Customer name
Customer
Customer city Balance
Account
number
Depos
it Account
CUSTOMER (Table name)
Customer–name Customer-street Customer-City Account-number
Johnsons Alma Pala Alto A-101
Smith
Hayes Main
Turner Dutnam Stanford A-305
Johnson Alma PalaAlto A-201
Jones Main
Lindsay Park Pittifield A-222
Smith
RDBMS
Classes- It is the collection of objects which consist of the same types of values and the same
methods.
E.g. account number & balance are instance variables; pay-interest is a method that uses the
above two variables and adds interest to the balance.
Semantic Models
These include the extended relational, the semantic network and the functional models. They are
characterized by their provision of richer facilities for capturing the meaning of data objects and
hence of maintaining database integrity. Systems based on these models exist in monotype for at
the time of writing and will begin to filter through the next decade.
Record-Based Logical Models
Record based logical models are used in describing data at the logical and view levels. In contrast
to object-based data models, they are used both to specify the overall logical structures of the
database, and to provide a higher-level description of the implementation.
Record-based models are so named because the database is structured in fixed-format records of
several types. Each record type defines a fixed number of fields, or attributes, and each field is
usually of a fixed length.
The three most widely accepted record-based data models are the relational, network, and
hierarchical models.
Relational Model
The relational model uses a collection of tables to represent both data and the relationships
among those data. Each table has multiple columns, and each column has a unique name as
follows:
A sample relational database
Network Model
Data in the network model are represented by collections of records, and relationship among data
is represented by links, which can be viewed as pointers. The records in the database are
organised as collections of arbitrary graphs. Such type of database is shown in the figure given
bellow.
Johnsons Alma Pala Alto A-101 500
Smith
Hayes Main
16
A-222 700
RDBMS
Turner Dutnam Stanford A-305 350
Jones Main
Lindsay Park Pittifield A-217 750
A sample network database
Hierarchical Model
The hierarchical model is similar to the network model in the sense that data and relationships
among data one represented by records and links, respectively. It differs from the network model
in that records are organised as collections of trees rather than arbitrary graphs.
A sample Hierachical database
Physical Data Models
Physical data models are used to describe data at the lowest level. In contrast to logical data
models, there are few physical data models in use. Two of the widely known ones are the unifying
model and the frame-memory model.
10. Database Languages
17
A-222 700
A-217 350
Johnson customer
street -------
Smith North -------
Hayes
Turner Putnam -------
Jones
Lindsay Park -------
A-101 500
A-201 900
A-215 700
A-201 900
A-102 400
A-305 350
CUSTOMER
RDBMS
A database system provides two different types of languages: one to specify the database schema,
and the other to express database queries and updates.
1.10 Database Languages
Data Definition Language (DDL)
A database schema is specified by a set of definition expressed by a special language called a
data-definition language (DDL). The result of compilation of DDL statement is a set of tables that
is stored in a special file called data dictionary, or data directory.
A data dictionary is a file that contain metadata-that is about data. This file is consulted before
actual data are read or modified in the database system.
The storage structure and access methods used by the database system are specified by as set of
definitions in a special type of DDL called a data storage and definition language. The result of
compilation of these definitions is a set of instructions to specify the implementation details of the
database schema - details are usually hidden from the users.
Data Manipulation Language
By data manipulation, we mean
· The retrieval of information stored in the database.
· The insertion of new information into the database.
· The deletion of information stored in the database.
A data-manipulation language (DML) is a language that enables user to access or manipulate data
as organised by the appropriate data model.
There are basically two types:
Procedural DML
It requires a user to specify what data are needed and how to get those data.
Nonprocedural DML
It requires a user to specify what data are needed without specifying how to get those data.
Normalization
About relational databases, you probably know about normalization. The process of normalization
transforms data into forms that conform to the relational model. Normalized data enables the
RDBMS to enforce integrity rules, guarantee consistency, and optimize database access. Learning
how to normalize data takes significant time and practice. Data modelers spend a lot of time
understanding the meaning of data so they can properly normalize it, but programmers frequently
downplay normalization, or dismiss it outright as an academic problem. Most databases come
from power users and programmers, not data modelers, and most databases suffer from unnormalized
data, redundancy, integrity and performance problems. Un-normalized databases
usually need a lot of application code to protect the database from corruption.
1.11 Client-Server and Multi-Tier Architectures
Client/Server Technology
Client/server technology is the computer architecture used in almost all automated library
systems now being offered to libraries. The simple definition is:
Client/server is a computer architecture that divides functions into client (requestor) and server
(provider) subsystems, with standard communication methods (such as TCP/IP and z39.50) to
facilitate the sharing of information between them. Among the characteristics of a client/server
architecture are the following:
18
RDBMS
· The client and server can be distinguished from one another by the differences in tasks they
perform
· The client and server usually operate on different computer platforms
· Either the client or server may be upgraded without affecting the other. Clients may connect
to one or more servers; servers may connect to multiple clients concurrently.
· Clients always initiate the dialogue by requesting a service.
Client/server is most easily differentiated from hierarchical processing, which uses a host and
slave, by the way a PC functions within a system. In client/server the PC-based client
communicates with the server as a computer; in hierarchical processing the PC emulates a
"dumb" terminal to communicate with the host. In client/server the client controls part of the
activity, but in hierarchical processing the host controls all activity. A client PC almost always
does the following in a client/server environment: screen handling, menu or command
interpretation, data entry, help processing, and error recovery.
The dividing line between the client and a server can be anywhere along a broad continuum: at
one end only the user interface has been moved onto the client; at the other, almost all
applications have been moved onto the client and the database may be distributed. There are at
least five points along the continuum:
Distributed presentation:
The presentation is handled partly by the server and partly by the client.
Remote presentation:
The presentation is controlled and handled entirely by the client.
Distributed logic:
The application logic is handled partly by the server and partly by the client.
Remote data management:
Database management is controlled and handled entirely by the server.
Distributed database:
Database management is handled partly by the server and partly by the client. There are,
therefore, two major applications for client/server in a library environment:
1) as the architecture for an automated library system, and
2) as an approach to linking heterogeneous systems.
In the first application, a vendor designs a system using client/server architecture to facilitate use
of that system to access multiple servers, to facilitate bringing together multiple product lines,
and/or to improve productivity. In the second application, a vendor designs a client to facilitate
19
RDBMS
transparent access to systems of other vendors, and a server to facilitate transparent access to its
system from others. While the underlying principles are the same, the vendor has considerable
latitude in the design of its own client/server system, but must strictly conform to standards
when using client/server to link its system with those of other libraries.
While it has been possible to access a wide variety of electronic resources through an automated
library system for a number of years, client/server technology has made it possible to tailor the
user interface to provide a personalized interface which meets the needs of any particular user
based on an analysis of tasks performed or on an individual's expressed preferences. An example
of this tailoring is the recent introduction of portals, common user interfaces to a wide variety of
electronic resources with the portal. [See the Tech Note on Portal Technology]. The portal can be
tailored to groups of staff or patrons, or to each individual.
Vendors with multiple product lines can build a single client to work with any of their server
products. This substantially reduces development costs. Client/server can also improve
productivity. Many vendors are now offering different clients for technical services, circulation,
and patron access catalog applications.
A GUI (graphical user interface) -- a presentation of information to the user using icons and other
graphics -- is sometimes called client/server, but unless information moves from the server to the
client in machine-readable (raw) form, and the client does the formatting to make it humanreadable,
it is not true client/server. Further, there is nothing in the client/server architecture
that requires a GUI. Nevertheless, most vendors of automated library systems use GUI for staff
applications. The GUIs are proprietary to each vendor. Web browsers are preferred for patron
applications because they are more likely to be familiar to them than a proprietary GUI.
An important computer industry development which has facilitated client/server architecture is
referred to as "open systems" C a concept which features standardized connectivity so that
components from several vendors may be combined. The trend to open systems began in the
1970s as a reaction against proprietary systems that required that all hardware and system
software come from a single source, and gained momentum in the 1980s, as networking became
common. While various parts of an organization might not hesitate to purchase proprietary
systems to meet their own needs, the desire to provide access from other parts of the organization,
or to exchange information, would be an incentive to select an open system. For client/server,
open systems are essential.
Most client/server systems offered by automated library system vendors use an open operating
system such as UNIX or one of its variations, or Windows NT or 2000 server. UNIX is the most
popular operating system for servers because of the large range of platform sizes available, but
Windows NT or 2000 server has been growing in popularity, especially for systems supporting
fewer than 100 concurrent users.
The most popular client operating systems are Windows 95/98/Me/2000 and Linux. By
supporting multiple client operating systems, a vendor of an automated library system makes it
possible for the client to conform to a staff member’s or patron's accustomed operating system
environment.
Almost all client/server systems use a relational database management system (RDBMS) for
handling the storage and retrieval of records in the database using a series of tables of values.
There is a common misconception that client/server is synonymous with networked SQL
(Structured Query Language) databases. SQL, a popular industry-standard data definition and
access language for relational databases, is only one approach -- albeit the one selected by almost
all automated library system vendors. While one can reasonably expect the use of an RDBMS and
SQL, the absence of either does not mean that a system is not client server.
A network computer is a PC without a hard disk drive. They have been little used in
libraries because most libraries use their PCs for a variety of applications in addition to accessing
the automated library system. Even when there are applications that lend themselves for use on
thin clients, most libraries have preferred to use older PCs that are no longer suitable for
applications that require robust machines. They use a two-tier PC strategy that involves the
purchase and deployment of new PCs for applications that require robust machines and
redeployment of the replaced machines for applications that they can support. For example, new
PCs are used for most staff applications and patron access to the Internet: older PCs are used as
“express catalogs,” devices that have a Web browser, but are limited to accessing the library’s
patron access catalog. The two-tier PC strategy can extend the life of a PC by as much as three
years.
20
RDBMS
Network computers are most widely used in large organizations that have to support thousands of
users. Almost all applications, including word processing, spreadsheets, and other office
applications, are loaded on a server. It is then not necessary to load new product releases on each
machine; only the applications on the server have to be updated. Most libraries do not have
enough users to realize significant savings by taking this approach. For libraries that have
hundreds of PCs, remote PC management is an alternative to thin clients; for libraries that have
fewer than 100 PCs, it is possible to support each individually.
A PDA is a handheld device that combines computing, telephone, fax, and networking features.
While originally pen-based (i.e., using a stylus), many models now come with a small keyboard.
The Palm Pilot is an example of a PDA. A number of libraries now encourage their users to access
the patron access catalog with a PDA. It is this application which holds the most promise for the
use of thin clients in libraries. As the bandwidth available for wireless applications increases and
the costs of PDAs drops, the use of PDAs for access to databases is expected to increase
dramatically.
The key to thin client technology is Java, a general purpose programming language with a number
of features that make it well suited for use on the Web. Small Java applications are called Java
applets and can be downloaded from a Web server and run on a device which includes a Javacompatible
Web browser such as Netscape Navigator or Microsoft Internet Explorer. This means
that the thin client does not need to be loaded with applications software.
A thin client can also be GUI-based. In that case, the client handles only data presentation and
deals with the user interaction; all applications and database management is found on the server.
Vendors of automated library systems favor proprietary GUI-based clients for staff because that
makes it possible to exploit the features of their systems.
1.12 Multimedia Data
The weblicon PIM has been built on a multi-tier architecture to support high scalability through
load-balancing and high availability through redundancy. The HTTP Client tier communicates
with the HTTP Server tier which routes incoming requests to the Application Server tier. The
Application Server tier connects to the Database Server tier consisting of RDBMS, LDAP and IMAP
Servers. All tiers can be scaled individually by building clusters of servers for load-balancing and
high redundancy embedding information into multimedia data that should be imperceptible but
unmemorable.
For model-based compression and animation of face models as defined in MPEG-4, a watermark
can be embedded into the transmitted facial animation parameters. The watermark can be
retrieved from the animation parameters or from video sequences rendered with the watermarked
animation parameters, even after subsequent low-quality video compression of the rendered
sequence. Entering the derived or generated data into the database and associating it with the
original information gives other users automatic access to the conclusions, thoughts, and
annotations of previous users of the information. This ability to modify, adjust, enhance, or add to
the global set of information and then share that information with others is a powerful and
important service. This type of service requires cooperation between the multimedia data
manipulation tools described above and the information repositories scattered across the network.
Generated or extracted information must be deposited and linked with existing information so
that future users will not only benefit from the original information but also from the careful
analysis and insight of previous users.
21
RDBMS
1.13 Information Integration
Information integration is an important feature of Oracle9i Database. Oracle provides many
features that allow customers to synchronously and asynchronously integrate their their data
including Oracle Streams, Oracle Advanced Queuing, replication, and distributed SQL.
Oracle also provides gateways to non-Oracle database servers, providing seamless interoperation
in heterogeneous environments.
Asynchronous Integration with Oracle Streams
Oracle Streams enables the propagation and management of data, transactions and events in a
data stream either within a database, or from one database to another. The stream routes
published information to subscribed destinations. The result is a new feature that provides
greater functionality and flexibility than traditional solutions for capturing and managing events,
and sharing the events with other databases and applications. Oracle Streams provides both
message queuing and replication functionality within the Oracle database. To simplify
deployment of these specialized information integration solutions, Oracle provides customized
APIs and tools for message queuing, replication, and data protection solutions. These include:
· Advanced Queuing for message queuing
· Replication with Oracle Streams
Asynchronous Integration with Oracle Advanced Replication
Oracle9i also includes Advanced Replication. This is a powerful replication feature first
introduced in Oracle7. Advanced Replication is useful for replicating object between Oracle9i
Database and older versions of the Oracle Server (that do not support Oracle Streams-based
Replication). Advanced Replication also provides Oracle Materialized Views; a replication solution
targeted for mass deployments and disconnected replicas.
Synchronous Integration with Distributed SQL
Oracle Distributed SQL enables synchronous access from one Oracle database, to other Oracle or
non-Oracle databases. Distributed SQL supports location independence, making remote objects
appear local, facilitating movement of objects between databases without application code
22
RDBMS
changes. Distributed SQL supports both queries and DML operations, and can intelligently
optimize execution plans to access data in the most efficient manner.
1.14 Data-Definition Language Commands
Empress provides a set of Structured Query Language (SQL) commands that allows users to
request information from the database. The Empress SQL language has three levels of operations:
1. A basic level at which data management commands are typed in without any prompting.
The full range of data management commands is available at this level.
2. Data Definition Language commands are concerned with the structure of the database and
its tables.
3. Data Manipulation Language commands are concerned with the maintenance and retrieval
of the data stored in the database tables.
4. Data Control Language commands provide facilities for assuring the integrity and security
of the data.
Data Definition Language Command
The overall structure of the database is defined by the Data Definition Language (DDL)
commands. The result of compiling DDL commands is a set of tables called the data dictionary
which contains information about the data in the database. Detailed descriptions of the Empress
DDL commands are provided in the Empress SQL: Reference manual. A summary of the available
DDL commands follows:
Command Description
ALTER TABLE Changes the structure of an existing table without having to dump and reload
its records. This also includes enable/disable trigger, define table type for
replication, enabling and disabling replication relations, and setting
checksum for a table.
CREATE COMMENT Attaches a comment to a table or attribute.
CREATE INDEX Sets up a search-aiding mechanism for an attribute.
CREATE MODULE Creates the definition of a persistent stored module into the data dictionary.
CREATE RANGE
CHECK Sets up data validation checks on an attribute.
CREATE
REFERENTIAL Sets up data referential constraints on attributes.
CREATE
REPLICATION
MASTER
Assings replication master entries to a replication table.
CREATE
REPLICATION
REPLICATE
Assings replication replicate entries to a replication table.
CREATE REPLICATE
TABLE Creates replicate table from a replication master table.
CREATE TABLE Creates a new table or replicate table including its name and the name and
data type of each of its attributes.
CREATE TRIGGER Sets up trigger events into data dictionary.
CREATE VIEW Creates a logical table from parts of one or more tables.
DISPLAY DATABASE Shows the tables in the database.
DISPLAY GRANT
PRIVILEGE Shows privilege grant options for a table.
DISPLAY MODULE Shows the persistent stored module definition.
23
RDBMS
DISPLAY PRIVILEGE Shows access privileges for a table.
DISPLAY TABLE Shows the structure of a table.
DROP COMMENT Removes a comment on a table or attribute.
DROP INDEX Removes an index on an attribute.
DROP MODULE Removes a persistent stored module definition from the data dictionary.
CHECK Removes data validation checks on an attribute.
DROP
REFERENTIAL Removes a data referential constraints from attributes.
DROP REPLICATION
MASTER Removes replication master entry from a replication table.
DROP REPLICATION
REPLICATE Removes replication replicate entry from a replication table.
DROP TABLE Removes an existing table.
DROP TRIGGER Removes a trigger event from the data dictionary.
DROP VIEW Removes a logical table.
GRANT PRIVILEGE Changes access privileges for tables or attributes.
LOCK LEVEL Sets the level of locking on a table.
RENAME Changes the name of a table or attribute.
REVOKE PRIVILEGE Removes table or attribute access privileges.
UPDATE MODULE Links a persistent stored module definition with the module shared library.
1.15 Overview of Query Processing
The techniques used by a DBMS to process, optimize, and execute high-level queries. A query
expressed in a high-level query language such as SQL must first be scanned, parsed, and
validated. The scanner identifies the language tokens—such as SQL keywords, attributes names
and relation names—in the text of the query, whereas the parser checks the query syntax to
determine whether it is formulated according to the syntax rules (rules of grammar) of the query
language. The query must also be validated, by checking that all attribute and relation names are
valid and semantically meaningful names in the schema of the particular database being queried.
An internal representation of the query is then created, usually as a tree data structure called
query tree. It is also possible to represent the query using a graph data structure called a query
graph. The DBMS must then devise an execution strategy for retrieving the result of the query
from the database files. A query typically has many possible execution strategies, and the process
of choosing a suitable one for processing a query is known as query optimization.
Query Processing Strategies
The steps involved in processing a query are illustrated in Figure 5.2. The basic steps are:
1. Parsing and translation
2. Optimization
3. Evaluation
Before query processing can begin, the system must translate the query into a usable form. A
language such as SQL is suitable for human use, but is ill suited to be the system’s internal
representation of a query. A more useful internal representation is one based on the extended
relational algebra.
24
RDBMS
Steps in Query Processing
Thus, the first action the system must take in query processing is to translate a given query into
its internal form. This translation process is similar to the work performed by the parser of a
compiler. In generating the internal form of the query, the parser checks the syntax of the user's
query, verifies that the relation names appearing in the query are names of relations in the
database, and so on.
Query optimization is left, for the most part, to the application programmer. That choice is made
because the data-manipulation-language statements of these two models are usually embedded in
a host programming language, and it is not easy to transform a network or hierarchical query into
an equivalent one without knowledge of the entire application program. In contrast, relationalquery
languages are either declarative or algebraic. Declarative languages permit users to specify
what a query should generate without saying how the system should do the generating. Algebraic
languages allow for algebraic transformation of users' queries. Based on the query specification, it
is relatively easy for an optimizer to generate a variety of equivalent plans for a query, and to
choose the least expensive one.
25
RDBMS
1.16 Storage and Buffer Management
Despite its many virtues, the relational data model is a poor fit for many types of data now
common across the enterprise. In fact, object databases owe much of their existence to the
inherent limitations of the relational model as reflected in the SQL2 standard. In recent years, a
growing chorus of demands has arisen from application developers seeking more flexibility and
functionality in the data model, as well as from system administrators asking for a common
database technology managed by a common set of administrative tools. As a result, vendors and
the SQL3 standard committees to include object capabilities are now extending the relational
model.
Object/relational (O/R) database products are still quite new, and the production databases to
which they have been applied are usually modest in size--50GB or less. As O/R technology
becomes more pervasive and memory and storage costs continue to fall, however, databases
incorporating this new technology should grow to a size comparable to that of pure relational
databases. Indeed, growth in the new technology is likely if for no other reason than that much of
this new data is inherently larger than the record/field type data of traditional relational
applications.
However, while limits to the growth of individual pure relational databases have been imposed as
much by hardware evolution as by software, the limits for O/R databases will arise primarily from
software. In this article, I'll explore the implications of the architecture approaches chosen by the
principal O/R database product designers--IBM, Informix, NCR, Oracle, Sybase, and Computer
Associates--for scalability of complex queries against very large collections of O/R data. The
powerful new data type extension mechanisms in these products limit the ability of vendors to
assume as much of the burden of VLDB complexity as they did for pure relational systems.
Instead, these mechanisms impose important additional responsibilities on the designers of new
types and methods, as well as on application designers and DBAs; these responsibilities become
more crucial and complex as the size of the databases and the complexity of the queries grow.
Finally, I'll explain how parallel execution is the key to the cost effectiveness--or even in some
cases to the feasibility--of applications that exploit the new types and methods, just as with pure
relational data. In contrast to the pure relational approach, however, achieving O/R parallelism is
much more difficult.
The term "VLDB" is overused; size is but one descriptive parameter of a database, and generally
not the most important one for the issues I'll raise here. Most very large OLTP relational
databases, for example, involve virtually none of these issues because high-volume OLTP queries
are almost always short and touch few data and metadata items. In addition, these OLTP
databases are frequently created and administered as collections of semi-independent smaller
databases, partitioned by key value. In contrast, the VLDB issues discussed here arise in very
large databases accessed by queries that individually touch large amounts of data and metadata
and involve join operations, aggregations, and other operations touching large amounts of data
and metadata. Such databases and applications today are usually found in data warehousing and
data mining applications, among other places.
The VLDB environments with these attributes are characterized by many I/O operations
within a single query involving multiple complex SQL operators and frequently generating large
intermediate result sets. Individual queries regularly cross any possible key partition boundary
and involve data items widely dispersed throughout the database. For these reasons, such a
database must normally be administered globally as a single entity. As a "stake in the ground," I'll
focus on databases that are at least 250GB in size, are commonly accessed by complex queries,
require online administration for reorganization, backup and recovery, and are regularly subject
to bulk operations (such as insert, delete, and update) in single work-unit volumes of 25GB or
more. For these databases, an MPP system, or possibly a very large SMP cluster, is required.
However, many of the issues we raise will apply to smaller databases as well.
Parallel Execution of Individual Queries
In a pure relational database, success in such an environment of 250GB or more may require the
highest degree of parallelism possible in execution of each individual query. The first key to
parallel success is the structure of the SQL query language itself. In most queries, SQL acts as a
set operator language that imposes few constraints on the order of execution; that is why
26
RDBMS
relational database optimizers have so much flexibility in choosing parallel execution plans. Thus,
SQL as an application language is extremely hospitable to parallel execution.
Given this characterization, one must devise execution strategies that are optimized for
parallel execution of an individual query, rather than merely providing for parallel execution of
strategies optimized for serial execution. Note that individual DDL and DML operations--as well as
other utilities such as backup and recovery--must execute in parallel. As indicated earlier, within
the context of SQL2, RDBMS vendors did most of the work necessary to provide the necessary
parallelism. Furthermore, it was possible to do so in ways that were largely transparent to the
application developer and the DBA. They were greatly aided not only by the structure of SQL, but
also by the fact that the core language and data types were mostly fixed during the '80s and early
'90s, at least as insofar as they affect parallel execution. Even so, it takes a vendor at least two or
three years to evolve a mature relational product that fully supports parallelism within a query for
VLDBs.
It’s worth noting that the complexity of adding parallelism for VLDB has two sources. The first is
replacing serial algorithms with parallel ones. The second source is subtler: There is a qualitative
change of mindset that must take place if one is to design competitive software for these very large
collections of data. For example, the first reaction of a designer building a database load program
is to make it use the INSERT function. This technique works for small- and medium-sized
databases but not for giant ones; it is simply too slow. Instead, one must design for multiple
streams of data to be loaded in parallel, capitalizing on the fact that multiple CPUs will be
available on the MPP or SMP cluster. If the data is coming from a single serial source, such as a
tape drive, the initial load task should focus only on getting each data item into memory as
quickly as possible. The power of the other CPUs should then be invoked in parallel through tasks
to get each datum to the right place, to index it properly, and so on.
These problems become more complex as O/R databases enter the mainstream; new and
potentially exotic data types must be accommodated, and new methods must be written and
optimized for parallel execution. Complex relationships far beyond the simple row and column
structure of the relational model can and will exist among instances of the new data types and
even among instances of current SQL data types--in a parts-explosion database, for example, or
when a "purchase order" object view is superimposed on an existing on-order database. New data
access methods to support the new operations and data types will also be required and must also
be optimized for parallel execution. Finally, expect giant ratios of column sizes in O/R tables with
the largest column entries being hundreds to thousands of times larger than the smallest
columns within a table. This change will create a need for new approaches to storage and buffer
management in servers as well as clients.
1.17 Transaction Processing
Is the following schedule serial? serialisable? Why or why not?
T1 T2 T3
-------------------------------------------------
1 start
2 read Z
3 start
4 read Y
5 Y = Y + 1
6 write Y
7 start
8 read Y
9 read X
10 Z = Z + 1
11 X = X + 1
12 checkpoint
13 write X
14 write Z
15 Y = Y + 1
16 write Y
17 read X
27
RDBMS
18 read Y
19 Y = Y + 1
20 write Y
21 commit
22 X = X + 1
23 write X
24 commit
25 read W
26 W = W + 1
27 write W
28 commit
For the above sequence of transactions, show the log file entries, assuming initially Z = 1, Y = 2, X
= 1, and W = 3. What happens to the transactions at the checkpoint?
1) The following transaction increases the total sales (A) and number of sales (B) in a store
inventory database. Write the values of A and B for each statement, assuming A is initially 100
and B is initially 2. The correctness criteria is that the average sale is $50. For each statement
determine if the transaction dies during the statement whether the database is in a correct state.
What should be done to repair the damage if the database is left in an incorrect state?
Transaction Start
1) Read A
2) A A + 50
3) Read B
4) B B + 1
5) Write B
6) Write A
7) Transaction ends
2) Which protocols potentially cause cascading rollbacks and why?
1.18 The Query Processor
The query optimizer module has the task of producing an execution plan, and the code generator
generates the code to execute the plan. The runtime database processor has the task of running
the query code, whether in compiled or interpreted mode, to produce the query result. If a runtime
error results, an error message is generated by the runtime database processor.
Figure: Typical Steps of a Query processor
The term optimization is actually a misnomer because in some cases the chosen execution plan is
not the optimal (best) strategy—it is just a reasonably efficient strategy for executing the query.
Finding the optimal strategy is usually too time-consuming except for the simplest of queries and
28
RDBMS
may require information on how the files are implemented and even on the contents of theseinformation
that may not be fully available in the DBMS catalog. Hence, planning of an execution
strategy may be a more accurate description than query optimization.
For lower-level navigational database languages in legacy systems–such as the network DML or
the hierarchical HDML- the programmer must choose the query execution strategy while writing a
database program. If a DBMS provides only a navigational language, there is limited need or
opportunity for extensive query optimization by the DBMS; instead, the programmer is given the
capability to choose the "optimal" execution strategy. On the other hand, a high-level query
language—such as SQL for relational DBMSs (RDBMSs) or OQL for object DBMSs (ODBMSs)—is
more declarative in nature because it specifies what the intended results of the query are, rather
than the details of how the result should be obtained. Query optimization's thus necessary for
queries that are specified in a high-level query language.