UNIT-I
1: Introduction Purpose of Database Systems View of Data Database Languages Relational Databases Database Design Object-based and semistructured databases Data Storage and Querying Transaction Management Database Architecture Database Users and Administrators Overall Structure
1
Database Management System (DBMS) DBMS contains information about a particular enterprise Database Applications: Banking: all transactions Airlines: reservations, schedules Universities: registration, grades Sales: customers, products, purchases Online retailers: order tracking, customized recommendations Manufacturing: production, inventory, orders, supply chain
Human resources: employee records, salaries, tax deductions
Purpose of Database Systems In the early days, database applications were built directly on top of file systems Drawbacks of using file systems to store data: Atomicity of updates Failures may leave database in an inconsistent state with partial updates carried out Example: Transfer of funds from one account to another should either complete or not happen at all Concurrent access by multiple users Concurrent accessed needed for performance Uncontrolled concurrent accesses can lead to inconsistencies – Example: Two people reading a balance and updating it at the same time Security problems
2
S.NO 1
Module as per DBS Application and
Lecture L1
DBMS Vs File Systems 2.
View of DATA
L2
3.
DB Language (DML, DDL)
L3
4.
DB Users and Administrator
L4
5.
Data storage and Querying
L5
6.
DBMS Architecture
L6
3
Database System Applications •
•
•
DBMS contains information about a particular enterprise –
Collection of interrelated data
–
Set of programs to access the data
–
An environment that is both convenient and efficient to use
Database Applications: –
Banking: all transactions
–
Airlines: reservations, schedules
–
Universities: registration, grades
–
Sales: customers, products, purchases
–
Online retailers: order tracking, customized recommendations
–
Manufacturing: production, inventory, orders, supply chain
–
Human resources: employee records, salaries, tax deductions
Databases touch all aspects of our lives
What Is a DBMS? •
A very large, integrated collection of data.
•
Models real-world enterprise.
•
–
Entities (e.g., students, courses)
–
Relationships (e.g., Madonna is taking CS564)
A Database Management System (DBMS) is a software package designed to store and manage databases.
Why Use a DBMS? •
Data independence and efficient access.
•
Reduced application development time.
•
Data integrity and security.
•
Uniform data administration. 4
•
Concurrent access, recovery from crashes.
Why Study Databases?? •
•
•
Shift from computation to information –
at the “low end”: scramble to webspace (a mess!)
–
at the “high end”: scientific applications
Datasets increasing in diversity and volume. –
Digital libraries, interactive video, Human Genome project, EOS project
–
... need for DBMS exploding
DBMS encompasses most of CS –
OS, languages, theory, AI, multimedia, logic
Files vs. DBMS •
Application must stage large datasets between main memory and secondary storage (e.g., buffering, page-oriented access, 32-bit addressing, etc.)
•
Special code for different queries
•
Must protect data from inconsistency due to multiple concurrent users
•
Crash recovery
•
Security and access control
Purpose of Database Systems •
In the early days, database applications were built directly on top of file systems
•
Drawbacks of using file systems to store data: –
Data redundancy and inconsistency •
–
Multiple file formats, duplication of information in different files
Difficulty in accessing data •
Need to write a new program to carry out each new task
–
Data isolation — multiple files and formats
–
Integrity problems 5
•
•
Integrity constraints (e.g. account balance > 0) become “buried” in program code rather than being stated explicitly
•
Hard to add new constraints or change existing ones
Drawbacks of using file systems (cont.) –
–
Atomicity of updates •
Failures may leave database in an inconsistent state with partial updates carried out
•
Example: Transfer of funds from one account to another should either complete or not happen at all
Concurrent access by multiple users •
Concurrent accessed needed for performance
•
Uncontrolled concurrent accesses can lead to inconsistencies •
–
Security problems •
•
Example: Two people reading a balance and updating it at the same time
Hard to provide user access to some, but not all, data
Database systems offer solutions to all the above problems
Levels of Abstraction •
Physical level: describes how a record (e.g., customer) is stored.
•
Logical level: describes data stored in database, and the relationships among the data. type customer = record customer_id : string; customer_name : string; customer_street : string; customer_city : string;
end; •
View level: application programs hide details of data types. Views can also hide information (such as an employee’s salary) for security purposes. 6
•
DBMS used to maintain, query large datasets.
•
Benefits include recovery from system crashes, concurrent access, quick application development, data integrity and security.
•
Levels of abstraction give data independence.
•
A DBMS typically has a layered architecture.
•
DBAs hold responsible jobs
•
DBMS R&D is one of the broadest,
and are well-paid! J most exciting areas in CS.
View of Data An architecture for a database system
Instances and Schemas •
Similar to types and variables in programming languages
•
Schema – the logical structure of the database –
Example: The database consists of information about a set of customers and accounts and the relationship between them)
–
Analogous to type information of a variable in a program 7
–
Physical schema: database design at the physical level
–
Logical schema: database design at the logical level
Instances and Schemas •
Instance – the actual content of the database at a particular point in time –
•
Analogous to the value of a variable
Physical Data Independence – the ability to modify the physical schema without changing the logical schema –
Applications depend on the logical schema
–
In general, the interfaces between the various levels and components should be well defined so that changes in some parts do not seriously influence others.
Data Models •
A collection of tools for describing –
Data
–
Data relationships
–
Data semantics
–
Data constraints
•
Relational model
•
Entity-Relationship data model (mainly for database design)
•
Object-based data models (Object-oriented and Object-relational)
•
Semi structured data model (XML)
•
Other older models: –
Network model
-
Hierarchical model
Data Models •
A data model is a collection of concepts for describing data.
8
•
A schema is a description of a particular collection of data, using the a given data model.
•
The relational model of data is the most widely used model today. –
Main concept: relation, basically a table with rows and columns.
–
Every relation has a schema, which describes the columns, or fields.
Example: University Database •
Conceptual schema: –
Students(sid: string, name: string, login: string, age: integer, gpa:real)
•
•
–
Courses(cid: string, cname:string, credits:integer)
–
Enrolled(sid:string, cid:string, grade:string)
Physical schema: –
Relations stored as unordered files.
–
Index on first column of Students.
External Schema (View): –
Course_info(cid:string,enrollment:integer)
Data Independence •
Applications insulated from how data is structured and stored.
•
Logical data independence: Protection from changes in logical structure of data.
Physical data independence: Protection from changes in physical structure of data. DATA BASE LANGUAGE Data Manipulation Language (DML) •
Language for accessing and manipulating the data organized by the appropriate data model –
•
DML also known as query language
Two classes of languages –
Procedural – user specifies what data is required and how to get those data 9
–
•
Declarative (nonprocedural) – user specifies what data is required without specifying how to get those data
SQL is the most widely used query language
Data Definition Language (DDL) •
Specification notation for defining the database schema
Example:
create table account ( account_number char(10), branch_name balance
char(10), integer)
•
DDL compiler generates a set of tables stored in a data dictionary
•
Data dictionary contains metadata (i.e., data about data) –
Database schema
–
Data storage and definition language •
–
–
Specifies the storage structure and access methods used
Integrity constraints •
Domain constraints
•
Referential integrity (e.g. branch_name must correspond to a valid branch in the branch table)
Authorization
Relational Model •
Example of tabular data in the relational model
10
A Sample Relational Database
SQL •
SQL: widely used non-procedural language –
Example: Find the name of the customer with customer-id 192-83-7465 select customer.customer_name from customer where customer.customer_id = ‘192-83-7465’
–
Example: Find the balances of all accounts held by the customer with customer-id 192-83-7465 select account.balance from depositor, account 11
where depositor.customer_id = ‘192-83-7465’ and depositor.account_number = account.account_number SQL •
Application programs generally access databases through one of –
Language extensions to allow embedded SQL
–
Application program interface (e.g., ODBC/JDBC) which allow SQL queries to be sent to a database
Database Users Users are differentiated by the way they expect to interact with the system •
Application programmers – interact with system through DML calls
•
Sophisticated users – form requests in a database query language
•
Specialized users – write specialized database applications that do not fit into the traditional data processing framework
•
Naïve users – invoke one of the permanent application programs that have been written previously –
Examples, people accessing database over the web, bank tellers, clerical staff
Database Administrator •
Coordinates all the activities of the database system –
•
has a good understanding of the enterprise’s information resources and needs.
Database administrator's duties include: –
Storage structure and access method definition
–
Schema and physical organization modification
–
Granting users authority to access the database
–
Backing up data
–
Monitoring performance and responding to changes •
Database tuning 12
Data storage and Querying •
Storage management
•
Query processing
•
Transaction processing
Storage Management •
Storage manager is a program module that provides the interface between the lowlevel data stored in the database and the application programs and queries submitted to the system.
•
The storage manager is responsible to the following tasks:
•
–
Interaction with the file manager
–
Efficient storing, retrieving and updating of data
Issues: –
Storage access
–
File organization
–
Indexing and hashing
Query Processing 1.Parsing and translation 2.
Optimization
3.
Evaluation
13
•
Alternative ways of evaluating a given query –
Equivalent expressions
–
Different algorithms for each operation
•
Cost difference between a good and a bad way of evaluating a query can be enormous
•
Need to estimate the cost of operations –
Depends critically on statistical information about relations which the database must maintain
Need to estimate statistics for intermediate results to compute cost of complex expressions
Transaction Management •
A transaction is a collection of operations that performs a single logical function in a database application
•
Transaction-management component ensures that the database remains in a consistent (correct) state despite system failures (e.g., power failures and operating system crashes) and transaction failures. 14
•
Concurrency-control manager controls the interaction among the concurrent transactions, to ensure the consistency of the database.
Database Architecture The architecture of a database systems is greatly influenced by the underlying computer system on which the database is running: •
Centralized
•
Client-server
•
Parallel (multiple processors and disks)
•
Distributed
Overall System Structure
15
Database Application Architectures 16
UNIT-II 17
History of Database Systems 1950s and early 1960s: Data processing using magnetic tapes for storage Tapes provide only sequential access Punched cards for input Late 1960s and 1970s: Hard disks allow direct access to data Network and hierarchical data models in widespread use Ted Codd defines the relational data model Would win the ACM Turing Award for this work IBM Research begins System R prototype UC Berkeley begins Ingres prototype High-performance (for the era) transaction processing
History (cont.) 1980s: Research relational prototypes evolve into commercial systems SQL becomes industry standard Parallel and distributed database systems Object-oriented database systems 18
1990s: Large decision support and data-mining applications Large multi-terabyte data warehouses Emergence of Web commerce 2000s: XML and XQuery standards Automated database administration Increasing use of highly parallel database systems Web-scale distributed data storage systems Database design: Conceptual design: (ER Model is used at this stage.) What are the entities and relationships in the enterprise? What information about these entities and relationships should we store in the database? What are the integrity constraints or business rules that hold? A database `schema’ in the ER Model can be represented pictorially (ER diagrams). Can map an ER diagram into a relational schema Modeling: A database can be modeled as: a collection of entities, relationship among entities. An entity is an object that exists and is distinguishable from other objects. Example: specific person, company, event, plant Entities have attributes Example: people have names and addresses An entity set is a set of entities of the same type that share the same properties. 19
Example: set of all persons, companies, trees, holidays Entity Sets customer and loan:
An entity is represented by a set of attributes, that is descriptive properties possessed by all members of an entity set.
Domain – the set of permitted values for each attribute Attribute types: Simple and composite attributes. Single-valued and multi-valued attributes Example: multivalued attribute: phone_numbers Derived attributes Can be computed from other attributes Example: age, given date_of_birth
20
Mapping Cardinality Constraints: Express the number of entities to which another entity can be associated via a relationship set. Most useful in describing binary relationship sets. For a binary relationship set the mapping cardinality must be one of the following types: One to one One to many Many to one Many to many Mapping Cardinalities:
21
ER Model Basics: Entity: Real-world object distinguishable from other objects. An entity is described (in DB) using a set of attributes. Entity Set: A collection of similar entities. E.g., all employees. All entities in an entity set have the same set of attributes. (Until we consider ISA hierarchies, anyway!) Each entity set has a key. Each attribute has a domain.
ER Model Basics (Contd.): Relationship: Association among two or more entities. E.g., Attishoo works in Pharmacy department. Relationship Set: Collection of similar relationships.
22
An n-ary relationship set R relates n entity sets E1 ... En; each relationship in R involves entities e1 E1, ..., en En Same entity set could participate in different relationship sets, or in different “roles” in same set. A relationship is an association among several entities Example: Hayes depositor customer entity
A-102 relationship setaccount entity
A relationship set is a mathematical relation among n ≥ 2 entities, each taken from entity sets {(e1, e2, … en) | e1 ∈ E1, e2 ∈ E2, …, en ∈ En} where (e1, e2, …, en) is a relationship Example: (Hayes, A-102) ∈ depositor Relationship Set borrower:
Relationship Sets (Cont.): An attribute can also be property of a relationship set. For instance, the depositor relationship set between entity sets customer and account may have the attribute access-date
23
Degree of a Relationship Set: Refers to number of entity sets that participate in a relationship set. Relationship sets that involve two entity sets are binary (or degree two). Generally, most relationship sets in a database system are binary. Relationship sets may involve more than two entity sets. Additional features of the ER model
Participation Constraints
Does every department have a manager? If so, this is a participation constraint: the participation of Departments in Manages is said to be total (vs. partial). Every Departments entity must appear in an instance of the Manages relationship. 24
Weak Entities: A weak entity can be identified uniquely only by considering the primary key of another (owner) entity. Owner entity set and weak entity set must participate in a one-to-many relationship set (one owner, many weak entities). Weak entity set must have total participation in this identifying relationship set. Weak Entity Sets (Cont.): We depict a weak entity set by double rectangles. We underline the discriminator of a weak entity set with a dashed line. payment_number – discriminator of the payment entity set Primary key for payment – (loan_number, payment_number)
25
Note: the primary key of the strong entity set is not explicitly stored with the weak entity set, since it is implicit in the identifying relationship. If loan_number were explicitly stored, payment could be made a strong entity, but then the relationship between payment and loan would be duplicated by an implicit relationship defined by the attribute loan_number common to payment and loan More Weak Entity Set Examples: In a university, a course is a strong entity and a course_offering can be modeled as a weak entity The discriminator of course_offering would be semester (including year) and section_number (if there is more than one section) If we model course_offering as a strong entity we would model course_number as an attribute. Then the relationship with course would be implicit in the course_number attribute
26
27
UNIT-III 1.Introduction to relational model 2.Enforcing integrity constraints 3.Logical Database Design 4.Logical Database Design 5. Introduction to Views 6.Relational Algebra 7.Tuple Relational Calculus 8. Domain Relational Calculus
28
Relational Database: Definitions •
Relational database: a set of relations
•
Relation: made up of 2 parts: –
Instance : a table, with rows and columns. #Rows = cardinality, #fields = degree / arity.
–
Schema : specifies name of relation, plus name and type of each column. •
•
E.G. Students (sid: string, name: string, login: string, age: integer, gpa: real).
Can think of a relation as a set of rows or tuples (i.e., all rows are distinct). Example Instance of Students Relation
sid 53666 53688 53650
name Jones Smith Smith
login jones@ cs smith@ eecs smith@ math
age 18 18 19
gpa 3.4 3.2 3.8
Cardinality = 3, degree = 5, all rows distinct Do all columns in a relation instance have to be distinct? Relational Query Languages •
A major strength of the relational model: supports simple, powerful querying of data.
•
Queries can be written intuitively, and the DBMS is responsible for efficient evaluation. –
The key: precise semantics for relational queries.
–
Allows the optimizer to extensively re-order operations, and still ensure that the answer does not change.
The SQL Query Language SELECT *
sid
na m e
log in
a g e
g pa
5 3 6 6 6
Jone s
jone s@ cs
1 8
3 .4
5 3 6 8 8
S m ith
sm ith@ e e
1 8
3 .2
29
FROM Students S WHERE S.age=18 •
To find just names and logins, replace the first line:
•
SELECT S.name, S.login
Querying Multiple Relations •
What does the following query compute?
•
SELECT S.name, E.cid
•
FROM Students S, Enrolled E
•
WHERE S.sid=E.sid AND E.grade=“A”
sid name login 53666 Jones jones@cs 53688 Smith smith@eecs 53650 Smith smith@math
age 18 18 19
sid 53831 53831 53650 53666 gpa 3.4 3.2 3.8
cid Carnatic101 Reggae203 Topology112 History105
grade C B A B
we get:
S.name E.cid Smith Topology112
Creating Relations in SQL •
Creates the Students relation. Observe that the type of each field enforced by the DBMS whenever tuples are added or modified.
is specified, and
•
As another example, the Enrolled table holds information about courses take.
that students
CREATE TABLE Students •
(sid: CHAR(20), name: CHAR(20), login: CHAR(10), age: INTEGER, gpa: REAL)
CREATE TABLE Enrolled (sid: CHAR(20), cid: CHAR(20), grade: CHAR(2)) Destroying and Altering Relations DROP TABLE Students •
Destroys the relation Students. The schema information and the tuples are deleted.
•
ALTER TABLE Students
ADD COLUMN firstYear: integer
30
The schema of Students is altered by adding a new field; every tuple in the current instance is extended with a null value in the new field. Adding and Deleting Tuples •
Can insert a single tuple using:
INSERT INTO Students (sid, name, login, age, gpa) VALUES (53688, ‘Smith’, ‘smith@ee’, 18, 3.2) Can delete all tuples satisfying some condition (e.g., name = Smith): DELETE FROM Students S WHERE S.name = ‘Smith’ •
Integrity Constraints (ICs)
•
IC: condition that must be true for any instance of the database; e.g., domain constraints.
•
–
ICs are specified when schema is defined.
–
ICs are checked when relations are modified.
A legal instance of a relation is one that satisfies all specified ICs. –
•
DBMS should not allow illegal instances.
If the DBMS checks ICs, stored data is more faithful to real-world meaning. –
Avoids data entry errors, too!
•
Primary Key Constraints
•
A set of fields is a key for a relation if :
1. No two distinct tuples can have same values in all key fields, and 2. This is not true for any subset of the key.
•
–
Part 2 false? A superkey.
–
If there’s >1 key for a relation, one of the keys is chosen (by DBA) to be the primary key.
E.g., sid is a key for Students. (What about name?) The set {sid, gpa} is a superkey. 31
Primary and Candidate Keys in SQL Possibly many candidate keys (specified using UNIQUE), one of which is chosen as the primary key. For a given student and course, there is a single grade.” vs. “Students can take only one course, and receive a single grade for that course; further, no two students in a course receive the same grade.” Used carelessly, an IC can prevent the storage of database instances that arise in practice! CREATE TABLE Enrolled (sid CHAR(20) cid CHAR(20), grade CHAR(2), PRIMARY KEY (sid,cid) ) CREATE TABLE Enrolled (sid CHAR(20) cid CHAR(20), grade CHAR(2), PRIMARY KEY (sid), UNIQUE (cid, grade) ) Foreign Keys, Referential Integrity •
Foreign key : Set of fields in one relation that is used to `refer’ to a tuple in another relation. (Must correspond to primary key of the second relation.) Like a `logical pointer’.
•
E.g. sid is a foreign key referring to Students: –
Enrolled(sid: string, cid: string, grade: string)
–
If all foreign key constraints are enforced, referential integrity is achieved, i.e., no dangling references.
–
Can you name a data model w/o referential integrity? •
Links in HTML! 32
•
Foreign Keys in SQL
•
Only students listed in the Students relation should be allowed to enroll for courses.
•
CREATE TABLE Enrolled
•
(sid CHAR(20), cid CHAR(20), grade CHAR(2),
•
PRIMARY KEY (sid,cid),
•
FOREIGN KEY (sid) REFERENCES Students )
•
Enrolled sid 53666 53666 53650 53666
cid Carnatic101 Reggae203 Topology112 History105
grade C B A B
Students
sid name login age 53666 Jones jones@cs 18 53688 Smith smith@eecs 18 53650 Smith smith@math 19
gpa 3.4 3.2 3.8
•
Enforcing Referential Integrity
•
Consider Students and Enrolled; sid in Enrolled is a foreign key that references Students.
•
What should be done if an Enrolled tuple with a non-existent student id is inserted? (Reject it!)
•
What should be done if a Students tuple is deleted? –
Also delete all Enrolled tuples that refer to it.
–
Disallow deletion of a Students tuple that is referred to.
–
Set sid in Enrolled tuples that refer to it to a default sid.
–
(In SQL, also: Set sid in Enrolled tuples that refer to it to a special value null, denoting `unknown’ or `inapplicable’.)
•
Similar if primary key of Students tuple is updated.
•
Referential Integrity in SQL 33
•
SQL/92 and SQL:1999 support all 4 options on deletes and updates.
•
Default is NO ACTION (delete/update is rejected)
•
CASCADE (also delete all tuples that refer to deleted tuple)
•
SET NULL / SET DEFAULT (sets foreign key value of referencing tuple)
•
CREATE TABLE Enrolled (sid CHAR(20), cid CHAR(20), grade CHAR(2), PRIMARY KEY (sid,cid), FOREIGN KEY (sid) REFERENCES StudentsON DELETE CASCADE ON UPDATE SET DEFAULT )
Where do ICs Come From? •
ICs are based upon the semantics of the real-world enterprise that is being described in the database relations.
•
We can check a database instance to see if an IC is violated, but we can NEVER infer that an IC is true by looking at an instance.
•
–
An IC is a statement about all possible instances!
–
From example, we know name is not a key, but the assertion that sid is a key is given to us.
Key and foreign key ICs are the most common; more general ICs supported too.
Logical DB Design: ER to Relational
CREATE TABLE Employees (ssn CHAR(11), name CHAR(20), lot INTEGER, PRIMARY KEY (ssn)) Relationship Sets to Tables In translating a relationship set to a relation, attributes of the relation must include: Keys for each participating entity set (as foreign keys). This set of attributes forms a superkey for the relation. All descriptive attributes.
34
CREATE TABLE Works_In( ssn CHAR(11), did INTEGER, since DATE, PRIMARY KEY (ssn, did), FOREIGN KEY (ssn) REFERENCES Employees, FOREIGN KEY (did) REFERENCES Departments) •
Review: Key Constraints
•
Each dept has at most one manager, according to the key constraint on Manages.
Translating ER Diagrams with Key Constraints Map relationship to a table: Note that did is the key now! Separate tables for Employees and Departments. Since each department has a unique manager, we could instead combine Manages and Departments. CREATE TABLE Manages( ssn CHAR(11), 35
did INTEGER, since DATE, PRIMARY KEY (did), FOREIGN KEY (ssn) REFERENCES Employees, FOREIGN KEY (did) REFERENCES Departments) CREATE TABLE Dept_Mgr( did INTEGER, dname CHAR(20), budget REAL, ssn CHAR(11),since DATE, PRIMARY KEY (did), FOREIGN KEY (ssn) REFERENCES Employees) Review: Participation Constraints •
Does every department have a manager? –
If so, this is a participation constraint: the participation of Departments in Manages is said to be total (vs. partial). •
Every did value in Departments table must appear in a row of the Manages table (with a non-null ssn value!)
Participation Constraints in SQL We can capture participation constraints involving one entity set in a binary relationship, but little else (without resorting to CHECK constraints). CREATE TABLE Dept_Mgr( did INTEGER, dname CHAR(20), budget REAL, ssn CHAR(11) NOT NULL, since DATE, PRIMARY KEY (did), FOREIGN KEY (ssn) REFERENCES Employees, ON DELETE NO ACTION) 36
•
Review: Weak Entities
•
A weak entity can be identified uniquely only by considering the primary key of another (owner) entity. –
Owner entity set and weak entity set must participate in a one-to-many relationship set (1 owner, many weak entities).
–
Weak entity set must have total participation in this identifying relationship set.
Translating Weak Entity Sets Weak entity set and identifying relationship set are translated into a single table. When the owner entity is deleted, all owned weak entities must also be deleted. CREATE TABLE Dep_Policy ( pname CHAR(20), age INTEGER, cost REAL, ssn CHAR(11) NOT NULL, PRIMARY KEY (pname, ssn), FOREIGN KEY (ssn) REFERENCES Employees, ON DELETE CASCADE) Review: ISA Hierarchies As in C++, or other PLs, attributes are inherited. If we declare A ISA B, every A entity is also considered to be a B entity.
•
Overlap constraints: Can Joe be an Hourly_Emps as well as a Contract_Emps entity? (Allowed/disallowed)
37
•
Covering constraints: Does every Employees entity also have to be an Hourly_Emps or a Contract_Emps entity? (Yes/no)
•
Translating ISA Hierarchies to Relations
•
General approach: •
•
3 relations: Employees, Hourly_Emps and Contract_Emps. •
Hourly_Emps: Every employee is recorded in Employees. For hourly emps, extra info recorded in Hourly_Emps (hourly_wages, hours_worked, ssn); must delete Hourly_Emps tuple if referenced Employees tuple is deleted).
•
Queries involving all employees easy, those involving just Hourly_Emps require a join to get some attributes.
Alternative: Just Hourly_Emps and Contract_Emps. •
Hourly_Emps: ssn, name, lot, hourly_wages, hours_worked.
•
Each employee must be in one of these two subclasses.
Review: Binary vs. Ternary Relationships What are the additional constraints in the 2nd diagram?
38
The key constraints allow us to combine Purchaser with Policies and Beneficiary with Dependents. Participation constraints lead to NOT NULL constraints. What if Policies is a weak entity set? CREATE TABLE Policies ( policyid INTEGER, cost REAL, ssn CHAR(11) NOT NULL, PRIMARY KEY (policyid). FOREIGN KEY (ssn) REFERENCES Employees, ON DELETE CASCADE) CREATE TABLE Dependents ( pname CHAR(20), age INTEGER, policyid INTEGER, PRIMARY KEY (pname, policyid) FOREIGN KEY (policyid) REFERENCES Policies, ON DELETE CASCADE) Views 39
A view is just a relation, but we store a definition, rather than a set of tuples. CREATE VIEW YoungActiveStudents (name, grade) AS SELECT S.name, E.grade FROM Students S, Enrolled E WHERE S.sid = E.sid and S.age<21 Views can be dropped using the DROP VIEW command.
How to handle DROP TABLE if there’s a view on the table? •
DROP TABLE command has options to let the user specify this.
Views and Security Views can be used to present necessary information (or a summary), while hiding details in underlying relation(s).
Given YoungStudents, but not Students or Enrolled, we can find students s who have are enrolled, but not the cid’s of the courses they are enrolled in.
•
View Definition
•
A relation that is not of the conceptual model but is made visible to a user as a “virtual relation” is called a view.
•
A view is defined using the create view statement which has the form create view v as < query expression > where
is any legal SQL expression. The view name is represented by v.
•
Once a view is defined, the view name can be used to refer to the virtual relation that the view generates.
Example Queries A view consisting of branches and their customers create view all_customer as (select branch_name, customer_name from depositor, account where depositor.account_number = account.account_number ) 40
union (select branch_name, customer_name from borrower, loan where borrower.loan_number = loan.loan_number ) n Find all customers of the Perryridge branch select customer_name from all_customer where branch_name = 'Perryridge' •
Uses of Views
•
Hiding some information from some users
•
–
Consider a user who needs to know a customer’s name, loan number and branch name, but has no need to see the loan amount.
–
Define a view (create view cust_loan_data as select customer_name, borrower.loan_number, branch_name from borrower, loan where borrower.loan_number = loan.loan_number )
–
Grant the user permission to read cust_loan_data, but not borrower or loan
Predefined queries to make writing of other queries easier –
Common example: Aggregate queries used for statistical analysis of data
•
Processing of Views
•
When a view is created
•
–
the query expression is stored in the database along with the view name
–
the expression is substituted into any query using the view
Views definitions containing views –
One view may be used in the expression defining another view
–
A view relation v1 is said to depend directly on a view relation v2 if v2 is used in the expression defining v1
–
A view relation v1 is said to depend on view relation v2 if either v1 depends directly to v2 or there is a path of dependencies from v1 to v2
41
–
A view relation v is said to be recursive if it depends on itself.
•
View Expansion
•
A way to define the meaning of views defined in terms of other views.
•
Let view v1 be defined by an expression e1 that may itself contain uses of view relations.
•
View expansion of an expression repeats the following replacement step: repeat Find any view relation vi in e1 Replace the view relation vi by the expression defining vi until no more view relations are present in e1
•
As long as the view definitions are not recursive, this loop will terminate
•
With Clause
•
The with clause provides a way of defining a temporary view whose definition is available only to the query in which the with clause occurs.
•
Find all accounts with the maximum balance with max_balance (value) as select max (balance) from account select account_number from account, max_balance where account.balance = max_balance.value
Complex Queries using With Clause Find all branches where the total account deposit is greater than the average of the total account deposits at all branches. with branch_total (branch_name, value) as select branch_name, sum (balance) from account group by branch_name with branch_total_avg (value) as select avg (value) from branch_total select branch_name
42
from branch_total, branch_total_avg where branch_total.value >= branch_total_avg.value •
Note: the exact syntax supported by your database may vary slightly.
E.g. Oracle syntax is of the form with branch_total as ( select .. ), branch_total_avg as ( select .. ) select … •
Update of a View
•
Create a view of all loan data in the loan relation, hiding the amount attribute create view loan_branch as select loan_number, branch_name from loan
•
Add a new tuple to loan_branch insert into loan_branch values ('L-37‘, 'Perryridge‘) This insertion must be represented by the insertion of the tuple ('L-37', 'Perryridge', null ) into the loan relation
•
Formal Relational Query Languages
•
Two mathematical Query Languages form the basis for “real” languages (e.g. SQL), and for implementation: –
Relational Algebra: More operational, very useful for representing execution plans.
–
Relational Calculus: Lets users describe what they want, rather than how to compute it. (Non-operational, declarative.)
•
Preliminaries
•
A query is applied to relation instances, and the result of a query is also a relation instance. –
Schemas of input relations for a query are fixed (but query will run regardless of instance!)
43
–
•
The schema for the result of a given query is also fixed! Determined by definition of query language constructs.
Positional vs. named-field notation: –
Positional notation easier for formal definitions, named-field notation more readable.
–
Both used in SQL
•
Example Instances
•
“Sailors” and “Reserves” relations for our examples.
•
We’ll use positional or named field notation, assume that names of fields in query results are `inherited’ from names of fields in query input relations.
sid 22 58
bid 101 103
sid 22 31 58
sname dustin lubber rusty
•
Relational Algebra
•
Basic operations:
•
day 10/10/96 11/12/96 rating 7 8 10
age 45.0 55.5 35.0
sid 28 31 44 58
sname rating age yuppy 9 35.0 lubber 8 55.5 guppy 5 35.0 rusty 10 35.0
–
Selection (
) Selects a subset of rows from relation.
–
Projection (
–
Cross-product (
) Allows us to combine two relations.
–
Set-difference (
) Tuples in reln. 1, but not in reln. 2.
–
Union (
) Deletes unwanted columns from relation.
) Tuples in reln. 1 and in reln. 2.
Additional operations: 44
•
sid sname rating age Intersection, join, division, renaming: Not essential, but (very!) useful. 28 yuppy 9 35.0 Since each operation returns a relation, operations can be composed! (Algebra is “closed”.) 58 rusty 10 35.0 Projection
•
Deletes attributes that are not in projection list.
•
Schema of result contains exactly the fields in the projection list, with the same names that they had in the (only) input relation.
–
•
σ rating >8(S 2)
sname rating – Note: real systems typically don’t do duplicate elimination unless the user explicitly yuppy 9 asks for it. (Why not?) age rusty 10 (S 2) 35.0 snam e rating π •
Projection operator has to eliminate duplicates! (Why??)
yuppy lu bber gu p p y ru sty
9 8 5 10
sname,rating
55.5
π sname (σ (S 2)) π ,rating (S 2) rating >8 age
•
Selection
•
Selects rows that satisfy selection condition.
•
No duplicates in result! (Why?)
•
Schema of result identical to schema of (only) input relation.
•
Result relation can be the input for another relational algebra operation! (Operator composition.)
sid sname rating age 28 yuppy sname9 rating35.0 58 rusty 10 35.0
yuppy 9 rusty 10 σ (S 2) rating> 8
45
π sname,rating(σ rating >8(S 2)) Union, Intersection, Set-Difference
•
Cross-Product
46
•
Each row of S1 is paired with each row of R1.
•
Result schema has one field per field of S1 and R1, with field names `inherited’ if possible. –
Conflict: Both S1 and R1 have a field called sid.
(sid)
snam e
22
dustin
22
age
(sid)
bid
day
7
45.0
22
101
10/ 10/ 96
dustin
7
45.0
58
103
11/ 12/ 96
31
lubber
8
55.5
22
101
10/ 10/ 96
31
lubber
8
55.5
58
103
11/ 12/ 96
58
rusty
10
35.0
22
101
10/ 10/ 96
58
rusty
10
35.0
58
103
11/ 12/ 96
Renaming operator:
rating
ρ (C(1→ sid1,5→ sid 2),S1× R1)
•
Joins
•
Equi-Join: A special case of condition join where the condition c contains only equalities.
47
•
Result schema similar to cross-product, but only one copy of fields for which equality is specified.
Natural Join: Equijoin on all common fields. •
Division
•
Not supported as a primitive operator, but useful for expressing queries like: Find sailors who have reserved all boats.
•
Let A have 2 fields, x and y; B have only field y: –
A/B =
–
i.e., A/B contains all x tuples (sailors) such that for every y tuple (boat) in B, there is an xy tuple in A.
–
Or: If the set of y values (boats) associated with an x value (sailor) in A contains all y values in B, the x value is in A/B.
•
In general, x and y can be any lists of fields; y is the list of fields in B, and x y is the list of fields of A.
•
Expressing A/B Using Basic Operators
•
Division is not essential op; just a useful shorthand. 48
– •
(Also true of joins, but joins are so common that systems implement joins specially.)
Idea: For A/B, compute all x values that are not `disqualified’ by some y value in B. –
x value is disqualified if by attaching y value from B, we obtain an xy tuple that is not in A.
π x((π x( A)×B)− A)
Disqualified x values
A/B: all disqualified tuples
π x ( A) − Find names of sailors who’ve reserved boat #103
•
•
Solution 1:
π sname((σ
Reserves)Sailors)
bid =103
Solution 2:
ρ (Temp 1, σbid
ρ( Temp
=103
2, Tem p
π sn a m e
(T em p
Re serves
1 Sailors
) )
2)
Solution 3:
πsname
(σ ( Re serves bid = 103
Sailors
))
Find names of sailors who’ve reserved a red boat •
Information about boat color only available in Boats; so need an extra join:
πsname ((σ
Boats ) Re serves Sailors ) color ='red '
A more efficient solution:
49
π sname (π
sid
(( π σ Boats ) Re s) Sailors ) bid color =' red '
A query optimizer can find this, given the first solution! •
Find sailors who’ve reserved a red or a green boat
•
Can identify all red or green boats, then find sailors who’ve reserved one of these boats:
ρ (T e m p b o a ts π s n a m e
σ c o lo r
,(
(T e m p b o a ts
= 'r e d
'∨ c o lo r
R e s e r v e s
= 'g r e e n
S a ilo r s
'
B o a ts
))
)
Can also define Tempboats using union! (How?) What happens if
is replaced by
in this query?
•
Find sailors who’ve reserved a red and a green boat
•
Previous approach won’t work! Must identify sailors who’ve reserved red boats, sailors who’ve reserved green boats, then find the intersection (note that sid is a key for Sailors):
ρ(Tempred
ρ(Tempgreen
,π ((σ sid color
,π ((σ sid color
πsname (( Tempred
Boats = 'red '
Boats = ' green '
∩ Tempgreen
) Re serves
))
) Re serves
))
) Sailors
)
•
Relational Calculus
•
Comes in two flavors: Tuple relational calculus (TRC) and Domain relational calculus (DRC).
•
Calculus has variables, constants, comparison ops, logical connectives and quantifiers. –
TRC: Variables range over (i.e., get bound to) tuples.
–
DRC: Variables range over domain elements (= field values).
–
Both TRC and DRC are simple subsets of first-order logic. 50
•
Expressions in the calculus are called formulas. An answer tuple is essentially an assignment of constants to variables that make the formula evaluate to true.
Domain Relational Calculus •
Query has the form:
x1, x2,..., xn | p x1, x2,..., xn
Answer includes all tuples that make the formula be true. Formula is recursively defined, starting with simple atomic formulas (getting tuples from
relations or making comparisons of values), and building bigger and better formulas using the logical connectives. DRC Formulas •
•
Atomic formula:
x1, x2,...,
•
, or X op Y, or X op constant
•
op is one of
< ,> ,= ,≤ ,≥ ,≠
Atomic formula: , or X op Y, or X op constant
– –
•
op is one of
Formula: –
an atomic formula, or
–
•
xn ∈ R n am e
, where p and q are formulas, or
–
, where variable X is free in p(X), or
–
, where variable X is free in p(X)
The use of quantifiers –
and
is said to bind X.
A variable that is not bound is free.
•
Free and Bound Variables
•
The use of quantifiers
and
in a formula is said to bind X. 51
–
•
A variable that is not bound is free.
Let us revisit the definition of a query:
x1, x2,..., xn | p x1, x2,..., xn
There is an important restriction: the variables x1, ..., xn that appear to the left of `|’ must be
the only free variables in the formula p(...). Find all sailors with a rating above 7 I , N ,T , A | I , N ,T , A ∈ S a i l o r s
∧ T > 7
I , N ,T , A ∈ S a ilo rs
I , N ,T , A
I , N ,T , A •
The condition ensures that the domain variables I, N, T and A are bound to fields of the same Sailors tuple.
•
The term to the left of `|’ (which should be read as such that) says that every tuple that satisfies T>7 is in the answer.
•
Modify this query to answer: –
Find sailors who are older than 18 or have a rating under 9, and are called ‘Joe’.
Find sailors rated > 7 who have reserved boat #103
I , N ,T , A | I , N ,T , A ∈ S a ilo rs
∃ Ir ,B r ,D Ir , B r ,D ∈ R e serves
∃Ir ,Br ,D(... ) as a shorthand for
∧ T> 7∧
∧ Ir = I∧ B r = 103 ∃ Ir (∃ B r
(∃ D(... )))
•
We have used
•
Note the use of to find a tuple in Reserves that `joins with’ the Sailors tuple under consideration.
Find sailors rated > 7 who’ve reserved a red boat
I , N ,T , A | I , N ,T , A ∈ S a ilo rs
∃ Ir ,B r ,D
Ir , B r ,D ∈ R e s e r v e s
∧ T> 7∧ ∧ Ir = I∧ 52
∃ B,B N ,C
B, B N ,C ∈ B o a ts
∧ B= B r ∧ C= 'r e d
•
Observe how the parentheses control the scope of each quantifier’s binding.
•
This may look cumbersome, but with a good user interface, it is very intuitive. (MS Access, QBE)
Find sailors who’ve reserved all boats
I , N ,T , A | I , N ,T , A ∈ S a i l o r s
∀ B,B N ∃ Ir ,B r ,D
•
,C ¬ B, B N
,C ∈ B o a t s
Ir , B r ,D ∈ R e s e r v e s
'
∧
∨
∧ I= Ir ∧ B r = B
Find all sailors I such that for each 3-tuple either it is not a tuple in Boats or there is a tuple in Reserves showing that sailor I has reserved it.
Find sailors who’ve reserved all boats (again!)
I , N ,T , A | I , N ,T , A ∈ S a ilo r s
∀ B, B N ∃ I r ,B r
,C ∈ B o a t s
,D ∈ R e s e r v e s
•
Simpler notation, same query. (Much clearer!)
•
To find sailors who’ve reserved all red boats:
'r e d .... C ≠
∧
I = I r ∧ B r = B
'∨ ∃ Ir , B r ,D ∈ R e s e r v e s
I= Ir ∧ B r = B
Unsafe Queries, Expressive Power •
e.g., •
It is possible to write syntactically correct calculus queries that have an infinite number of answers! Such queries are called unsafe. S|¬ S∈ S a i l o r s
It is known that every query that can be expressed in relational algebra can be expressed as a safe query in DRC / TRC; the converse is also true.
53
Relational Completeness: Query language (e.g., SQL) can express every query that is expressible in relational algebra/calculus.
•
UNIT-IV 1.The Form of a Basic SQL Queries 2.
Query operations & NESTED Queries
3.
NESTED Queries
4.
Aggregate Operators
5.
Null Values 6.
Complex I.C in SQL-92
7.
Triggers and Active Databases
8.
Designing Active Databases
54
•
History
•
IBM Sequel language developed as part of System R project at the IBM San Jose Research Laboratory
•
Renamed Structured Query Language (SQL)
•
ANSI and ISO standard SQL:
•
–
SQL-86
–
SQL-89
–
SQL-92
–
SQL:1999 (language name became Y2K compliant!)
–
SQL:2003
Commercial systems offer most, if not all, SQL-92 features, plus varying feature sets from later standards and special proprietary features. –
Not all examples here may work on your particular system.
•
Data Definition Language
•
Allows the specification of:
•
The schema for each relation, including attribute types. 55
•
Integrity constraints
•
Authorization information for each relation.
•
Non-standard SQL extensions also allow specification of –
The set of indices to be maintained for each relations.
–
The physical storage structure of each relation on disk.
•
Create Table Construct
•
An SQL relation is defined using the create table command: create table r (A1 D1, A2 D2, ..., An Dn, (integrity-constraint1), ..., (integrity-constraintk)) –
r is the name of the relation
–
each Ai is an attribute name in the schema of relation r
–
Di is the data type of attribute Ai
Example: create table branch (branch_name char(15), branch_city char(30), assets integer) •
Domain Types in SQL
•
char(n). Fixed length character string, with user-specified length n.
•
varchar(n). Variable length character strings, with user-specified maximum length n.
•
int. Integer (a finite subset of the integers that is machine-dependent).
•
smallint. Small integer (a machine-dependent subset of the integer domain type).
•
numeric(p,d). Fixed point number, with user-specified precision of p digits, with n digits to the right of decimal point.
•
real, double precision. Floating point and double-precision floating point numbers, with machine-dependent precision. 56
•
float(n). Floating point number, with user-specified precision of at least n digits.
•
More are covered in Chapter 4.
•
Integrity Constraints on Tables
•
not null
•
primary key (A1, ..., An )
•
Example: Declare branch_name as the primary key for branch
•
. create table branch (branch_name char(15), branch_city char(30) not null, assets integer, primary key (branch_name))
•
•
primary key declaration on an attribute automatically ensures not null in SQL-92 onwards, needs to be explicitly stated in SQL-89
•
Basic Insertion and Deletion of Tuples
•
Newly created table is empty
•
Add a new tuple to account insert into account values ('A-9732', 'Perryridge', 1200) –
•
Insertion fails if any integrity constraint is violated
Delete all tuples from account delete from account Note: Will see later how to delete selected tuples
•
Drop and Alter Table Constructs
•
The drop table command deletes all information about the dropped relation from the database.
•
The alter table command is used to add attributes to an existing relation: alter table r add A D 57
where A is the name of the attribute to be added to relation r and D is the domain of A. – •
All tuples in the relation are assigned null as the value for the new attribute.
The alter table command can also be used to drop attributes of a relation: alter table r drop A
where A is the name of an attribute of relation r –
Dropping of attributes not supported by many databases
–
Basic Query Structure
–
A typical SQL query has the form: select A1, A2, ..., An from r1, r2, ..., rm where P –
Ai represents an attribute
–
Ri represents a relation
–
P is a predicate.
–
This query is equivalent to the relational algebra expression.
–
The result of an SQL query is a relation.
•
The select Clause
•
The select clause list the attributes desired in the result of a query –
∏ A1 , A2 ,, An (σ P (r1 × r2 × × rm ))
corresponds to the projection operation of the relational algebra
•
Example: find the names of all branches in the loan relation: select branch_name from loan
•
In the relational algebra, the query would be: Õbranch_name (loan)
•
NOTE: SQL names are case insensitive (i.e., you may use upper- or lower-case letters.) –
E.g. Branch_Name ≡ BRANCH_NAME ≡ branch_name
–
Some people use upper case wherever we use bold font. 58
•
SQL allows duplicates in relations as well as in query results.
•
To force the elimination of duplicates, insert the keyword distinct after select.
•
Find the names of all branches in the loan relations, and remove duplicates select distinct branch_name from loan
•
The keyword all specifies that duplicates not be removed. select all branch_name from loan
•
An asterisk in the select clause denotes “all attributes” select * from loan
•
The select clause can contain arithmetic expressions involving the operation, +, –, *, and /, and operating on constants or attributes of tuples.
•
E.g.: select loan_number, branch_name, amount * 100 from loan
•
The where Clause
•
The where clause specifies conditions that the result must satisfy –
•
Corresponds to the selection predicate of the relational algebra.
To find all loan number for loans made at the Perryridge branch with loan amounts greater than $1200. select loan_number from loan where branch_name = 'Perryridge' and amount > 1200
•
Comparison results can be combined using the logical connectives and, or, and not.
The from Clause The from clause lists the relations involved in the query Corresponds to the Cartesian product operation of the relational algebra. 59
Find the Cartesian product borrower X loan select * from borrower, loan n
Find the name, loan number and loan amount of all customers having a loan at the Perryridge branch.
select customer_name, borrower.loan_number, amount from borrower, loan where borrower.loan_number = loan.loan_number and branch_name = 'Perryridge' The Rename Operation SQL allows renaming relations and attributes using the as clause: old-name as new-name E.g. Find the name, loan number and loan amount of all customers; rename the column name loan_number as loan_id. select customer_name, borrower.loan_number as loan_id, amount from borrower, loan where borrower.loan_number = loan.loan_number Tuple Variables Tuple variables are defined in the from clause via the use of the as clause. Find the customer names and their loan numbers and amount for all customers having a loan at some branch. select customer_name, T.loan_number, S.amount from borrower as T, loan as S where T.loan_number = S.loan_number n
Find the names of all branches that have greater assets than some branch located in Brooklyn. select distinct T.branch_name from branch as T, branch as S where T.assets > S.assets and S.branch_city = 'Brooklyn'
n
Keyword as is optional and may be omitted borrower as T ≡ borrower T 60
Some database such as Oracle require as to be omitted
n
n Example Instances n We will use these instances of the Sailors and Reserves relations in our examples. n
If the key for the Reserves relation contained only the attributes sid and bid, how would the semantics differ?
•
Basic SQL Query
•
SELECT
•
FROM
•
WHERE
•
relation-list A list of relation names (possibly with a range-variable after each name).
•
target-list A list of attributes of relations in relation-list
•
qualification Comparisons (Attr op const or Attr1 op Attr2, where op is one of ) combined using AND, OR and NOT.
•
DISTINCT is an optional keyword indicating that the answer should not contain duplicates. Default is that duplicates are not eliminated!
•
Conceptual Evaluation Strategy
•
Semantics of an SQL query defined in terms of the following conceptual evaluation strategy: •
[DISTINCT] target-list relation-list qualification
Compute the cross-product of relation-list. 61
•
Discard resulting tuples if they fail qualifications.
•
Delete attributes that are not in target-list.
•
If DISTINCT is specified, eliminate duplicate rows.
•
This strategy is probably the least efficient way to compute a query! An optimizer will find more efficient strategies to compute the same answers.
•
Example of Conceptual Evaluation
•
SELECT S.sname
•
FROM
•
WHERE S.sid=R.sid AND R.bid=103
( s id )
Sailors S, Reserves R
s n a m e
r a t in g
a g e
2 2
d u s t in
7
4 5 .0
2 2
d u s t in
7
3 1 3 1
lu b b e r lu b b e r
5 8 5 8
( s id )
b id
d a y
2 2
1 0 1
1 0 /1 0 /9 6
4 5 .0
5 8
1 0 3
1 1 /1 2 /9 6
8 8
5 5 .5 5 5 .5
2 2 5 8
1 0 1 1 0 3
1 0 /1 0 /9 6 1 1 /1 2 /9 6
r u s t y
1 0
3 5 .0
2 2
1 0 1
1 0 /1 0 /9 6
r u s t y
1 0
3 5 .0
5 8
1 0 3
1 1 /1 2 /9 6
A Note on Range Variables Really needed only if the same relation appears twice in the FROM clause. The previous query can also be written as: SELECT S.sname FROM
Sailors S, Reserves R
WHERE S.sid=R.sid AND bid=103 OR SELECT sname FROM
Sailors, Reserves
WHERE Sailors.sid=Reserves.sid AND bid=103 It is good style, however, to use 62
range variables always!
•
Find sailors who’ve reserved at least one boat
•
SELECT S.sid
•
FROM Sailors S, Reserves R
•
WHERE S.sid=R.sid
•
Would adding DISTINCT to this query make a difference?
•
What is the effect of replacing S.sid by S.sname in the SELECT clause? Would adding DISTINCT to this variant of the query make a difference?
•
Expressions and Strings
•
SELECT S.age, age1=S.age-5, 2*S.age AS age2
•
FROM Sailors S
•
WHERE S.sname LIKE ‘B_%B’
•
Illustrates use of arithmetic expressions and string pattern matching: Find triples (of ages of sailors and two fields defined by expressions) for sailors whose names begin and end with B and contain at least three characters.
•
AS and = are two ways to name fields in result.
•
LIKE is used for string matching. `_’ stands for any one character and `%’ stands for 0 or more arbitrary characters.
•
String Operations
•
SQL includes a string-matching operator for comparisons on character strings. The operator “like” uses patterns that are described using two special characters:
•
–
percent (%). The % character matches any substring.
–
underscore (_). The _ character matches any character.
Find the names of all customers whose street includes the substring “Main”.
63
select customer_name from customer where customer_street like '% Main%' •
Match the name “Main%” like 'Main\%' escape '\'
•
SQL supports a variety of string operations such as –
concatenation (using “||”)
–
converting from upper to lower case (and vice versa)
–
finding string length, extracting substrings, etc.
•
Ordering the Display of Tuples
•
List in alphabetic order the names of all customers having a loan in Perryridge branch select distinct customer_name from borrower, loan where borrower loan_number = loan.loan_number and branch_name = 'Perryridge' order by customer_name
•
We may specify desc for descending order or asc for ascending order, for each attribute; ascending order is the default. –
Example: order by customer_name desc
•
Duplicates
•
In relations with duplicates, SQL can define how many copies of tuples appear in the result.
•
Multiset versions of some of the relational algebra operators – given multiset relations r1 and r2:
1. σ θ (r1): If there are c1 copies of tuple t1 in r1, and t1 satisfies selections σθ,, then there are c1 copies of t1 in σθ (r1). 2. ΠA (r ): For each copy of tuple t1 in r1, there is a copy of tuple ΠA (t1) in ΠA (r1) where ΠA (t1) denotes the projection of the single tuple t1. 3. r1 x r2 : If there are c1 copies of tuple t1 in r1 and c2 copies of tuple t2 in r2, there are c1 x c2 copies of the tuple t1. t2 in r1 x r2 64
•
Example: Suppose multiset relations r1 (A, B) and r2 (C) are as follows: r1 = {(1, a) (2,a)}
•
r2 = {(2), (3), (3)}
Then ΠB(r1) would be {(a), (a)}, while ΠB(r1) x r2 would be {(a,2), (a,2), (a,3), (a,3), (a,3), (a,3)}
•
SQL duplicate semantics: select A1,, A2, ..., An from r1, r2, ..., rm where P is equivalent to the multiset version of the expression:
∏ A1 , A2 ,, An (σ P (r1 × r2 × × rm ))
•
Set Operations
•
The set operations union, intersect, and except operate on relations and correspond to the relational algebra operations ∪, ∩, −.
•
Each of the above operations automatically eliminates duplicates; to retain all duplicates use the corresponding multiset versions union all, intersect all and except all. Suppose a tuple occurs m times in r and n times in s, then, it occurs: –
m + n times in r union all s
–
min(m,n) times in r intersect all s
–
max(0, m – n) times in r except all s
•
Set Operations
•
Find all customers who have a loan, an account, or both:
•
(select customer_name from depositor) union (select customer_name from borrower)
•
Find all customers who have both a loan and an account
•
(select customer_name from depositor) intersect (select customer_name from borrower
•
Find all customers who have an account but no loan 65
•
(select customer_name from depositor) except (select customer_name from borrower)
UNIT-VI 1. Transaction concept & State 2. Implementation of atomicity and durability 3. Serializability 4. Recoverability 5. Implementation of isolation 6. Lock based protocols 7. Lock based protocols 8. Timestamp based protocols 9. Validation based protocol
66
Transaction Concept
•
A transaction is a unit of program execution that accesses and possibly updates various data items.
•
E.g. transaction to transfer $50 from account A to account B:
•
1.
read(A)
2.
A := A – 50
3.
write(A)
4.
read(B)
5.
B := B + 50
6.
write(B)
Two main issues to deal with: –
Failures of various kinds, such as hardware failures and system crashes
–
Concurrent execution of multiple transactions
Example of Fund Transfer •
Transaction to transfer $50 from account A to account B:
1.
read(A)
2.
A := A – 50
3.
write(A)
4.
read(B)
5.
B := B + 50 67
6.
write(B) •
Atomicity requirement –
if the transaction fails after step 3 and before step 6, money will be “lost” leading to an inconsistent database state •
–
Failure could be due to software or hardware
the system should ensure that updates of a partially executed transaction are not reflected in the database
•
Durability requirement — once the user has been notified that the transaction has completed (i.e., the transfer of the $50 has taken place), the updates to the database by the transaction must persist even if there are software or hardware failures.
•
Transaction to transfer $50 from account A to account B:
1.
read(A)
2.
A := A – 50
3.
write(A)
4.
read(B)
5.
B := B + 50
6.
write(B) •
Consistency requirement in above example: –
•
the sum of A and B is unchanged by the execution of the transaction
In general, consistency requirements include •
Explicitly specified integrity constraints such as primary keys and foreign keys
•
Implicit integrity constraints –
e.g. sum of balances of all accounts, minus sum of loan amounts must equal value of cash-in-hand 68
–
A transaction must see a consistent database.
–
During transaction execution the database may be temporarily inconsistent.
–
When the transaction completes successfully the database must be consistent •
•
Erroneous transaction logic can lead to inconsistency
Isolation requirement — if between steps 3 and 6, another transaction T2 is allowed to access the partially updated database, it will see an inconsistent database (the sum A + B will be less than it should be). T1 T2
1.
read(A)
2.
A := A – 50
3.
write(A) read(A), read(B), print(A+B)
4.
read(B)
5.
B := B + 50
6.
write(B •
Isolation can be ensured trivially by running transactions serially –
•
that is, one after the other.
However, executing multiple transactions concurrently has significant benefits, as we will see later.
ACID Properties A transaction is a unit of program execution that accesses and possibly updates various data items.To preserve the integrity of data the database system must ensure: •
Atomicity. Either all operations of the transaction are properly reflected in the database or none are. 69
•
Consistency. Execution of a transaction in isolation preserves the consistency of the database.
•
Isolation. Although multiple transactions may execute concurrently, each transaction must be unaware of other concurrently executing transactions. Intermediate transaction results must be hidden from other concurrently executed transactions. –
•
That is, for every pair of transactions Ti and Tj, it appears to Ti that either Tj, finished execution before Ti started, or Tj started execution after Ti finished.
Durability. After a transaction completes successfully, the changes it has made to the database persist, even if there are system failures.
Transaction State •
Active – the initial state; the transaction stays in this state while it is executing
•
Partially committed – after the final statement has been executed.
•
Failed -- after the discovery that normal execution can no longer proceed.
•
Aborted – after the transaction has been rolled back and the database restored to its state prior to the start of the transaction. Two options after it has been aborted: –
restart the transaction •
– •
can be done only if no internal logical error
kill the transaction
Committed – after successful completion.
70
Implementation of Atomicity and Durability •
The recovery-management component of a database system implements the support for atomicity and durability.
•
E.g. the shadow-database scheme: –
all updates are made on a shadow copy of the database •
db_pointer is made to point to the updated shadow copy after –
the transaction reaches partial commit and
–
all updated pages have been flushed to disk.
71
•
db_pointer always points to the current consistent copy of the database. –
•
In case transaction fails, old consistent copy pointed to by db_pointer can be used, and the shadow copy can be deleted.
The shadow-database scheme: –
Assumes that only one transaction is active at a time.
–
Assumes disks do not fail
–
Useful for text editors, but •
extremely inefficient for large databases (why?) –
– •
Variant called shadow paging reduces copying of data, but is still not practical for large databases
Does not handle concurrent transactions
Will study better schemes in Chapter 17.
Concurrent Executions •
Multiple transactions are allowed to run concurrently in the system. Advantages are: 72
–
increased processor and disk utilization, leading to better transaction throughput •
–
•
E.g. one transaction can be using the CPU while another is reading from or writing to the disk
reduced average response time for transactions: short transactions need not wait behind long ones.
Concurrency control schemes – mechanisms to achieve isolation –
that is, to control the interaction among the concurrent transactions in order to prevent them from destroying the consistency of the database
Will study in Chapter 16, after studying notion of correctness of concurrent executions
Schedules •
•
Schedule – a sequences of instructions that specify the chronological order in which instructions of concurrent transactions are executed –
a schedule for a set of transactions must consist of all instructions of those transactions
–
must preserve the order in which the instructions appear in each individual transaction.
A transaction that successfully completes its execution will have a commit instructions as the last statement –
•
by default transaction assumed to execute commit instruction as its last step
A transaction that fails to successfully complete its execution will have an abort instruction as the last statement
Schedule 1 •
Let T1 transfer $50 from A to B, and T2 transfer 10% of the balance from A to B.
•
A serial schedule in which T1 is followed by T2 :
73
Schedule 2 •
A serial schedule where T2 is followed by T1
Schedule 3 •
Let T1 and T2 be the transactions defined previously. The following schedule is not a serial schedule, but it is equivalent to Schedule 1.
74
In Schedules 1, 2 and 3, the sum A + B is preserved.
Schedule 4 The following concurrent schedule does not preserve the value of (A + B ).
Serializability
1.
•
Basic Assumption – Each transaction preserves database consistency.
•
Thus serial execution of a set of transactions preserves database consistency.
•
A (possibly concurrent) schedule is serializable if it is equivalent to a serial schedule. Different forms of schedule equivalence give rise to the notions of: conflict serializability
75
2.
view serializability •
Simplified view of transactions –
We ignore operations other than read and write instructions
–
We assume that transactions may perform arbitrary computations on data in local buffers in between reads and writes.
–
Our simplified schedules consist of only read and write instructions.
Conflicting Instructions •
Instructions li and lj of transactions Ti and Tj respectively, conflict if and only if there exists some item Q accessed by both li and lj, and at least one of these instructions wrote Q.
1. li = read(Q), lj = read(Q). li and lj don’t conflict. 2. li = read(Q), lj = write(Q). They conflict. 3. li = write(Q), lj = read(Q). They conflict 4. li = write(Q), lj = write(Q). They conflict •
Intuitively, a conflict between li and lj forces a (logical) temporal order between them. –
If li and lj are consecutive in a schedule and they do not conflict, their results would remain the same even if they had been interchanged in the schedule.
Conflict Serializability •
If a schedule S can be transformed into a schedule S´ by a series of swaps of non-conflicting instructions, we say that S and S´ are conflict equivalent.
•
We say that a schedule S is conflict serializable if it is conflict equivalent to a serial schedule
View Serializability •
Let S and S´ be two schedules with the same set of transactions. S and S´ are view equivalent if the following three conditions are met, for each data item Q, 1. If in schedule S, transaction Ti reads the initial value of Q, then in schedule S’ also
transaction Ti must read the initial value of Q. 76
2. If in schedule S transaction Ti executes read(Q), and that value was produced by
transaction Tj (if any), then in schedule S’ also transaction Ti must read the value of Q that was produced by the same write(Q) operation of transaction Tj . 3. The transaction (if any) that performs the final write(Q) operation in schedule S
must also perform the final write(Q) operation in schedule S’. As can be seen, view equivalence is also based purely on reads and writes alone.
•
A schedule S is view serializable if it is view equivalent to a serial schedule.
•
Every conflict serializable schedule is also view serializable.
•
Below is a schedule which is view-serializable but not conflict serializable.
•
What serial schedule is above equivalent to?
•
Every view serializable schedule that is not conflict serializable has blind writes.
Other Notions of Serializability •
The schedule below produces same outcome as the serial schedule < T1, T5 >, yet is not conflict equivalent or view equivalent to it.
77
Determining such equivalence requires analysis of operations other than read and write. Recoverable Schedules Need to address the effect of transaction failures on concurrently running transactions. •
Recoverable schedule — if a transaction Tj reads a data item previously written by a transaction Ti , then the commit operation of Ti appears before the commit operation of Tj. The following schedule (Schedule 11) is not recoverable if T9 commits immediately after the read
•
If T8 should abort, T9 would have read (and possibly shown to the user) an inconsistent database state. Hence, database must ensure that schedules are recoverable.
Cascading Rollbacks Cascading rollback – a single transaction failure leads to a series of transaction rollbacks. Consider the following schedule where none of the transactions has yet committed (so the schedule is recoverable) 78
•
If T10 fails, T11 and T12 must also be rolled back.
•
Can lead to the undoing of a significant amount of work
Cascadeless Schedules •
Cascadeless schedules — cascading rollbacks cannot occur; for each pair of transactions Ti and Tj such that Tj reads a data item previously written by Ti, the commit operation of Ti appears before the read operation of Tj.
•
Every cascadeless schedule is also recoverable
•
It is desirable to restrict the schedules to those that are cascadeless
Concurrency Control •
•
A database must provide a mechanism that will ensure that all possible schedules are –
either conflict or view serializable, and
–
are recoverable and preferably cascadeless
A policy in which only one transaction can execute at a time generates serial schedules, but provides a poor degree of concurrency –
Are serial schedules recoverable/cascadeless?
•
Testing a schedule for serializability after it has executed is a little too late!
•
Goal – to develop concurrency control protocols that will assure serializability.
Concurrency Control vs. Serializability Tests 79
•
Concurrency-control protocols allow concurrent schedules, but ensure that the schedules are conflict/view serializable, and are recoverable and cascadeless .
•
Concurrency control protocols generally do not examine the precedence graph as it is being created –
Instead a protocol imposes a discipline that avoids nonseralizable schedules.
–
We study such protocols in Chapter 16.
•
Different concurrency control protocols provide different tradeoffs between the amount of concurrency they allow and the amount of overhead that they incur.
•
Tests for serializability help us understand why a concurrency control protocol is correct.
Weak Levels of Consistency •
•
Some applications are willing to live with weak levels of consistency, allowing schedules that are not serializable –
E.g. a read-only transaction that wants to get an approximate total balance of all accounts
–
E.g. database statistics computed for query optimization can be approximate (why?)
–
Such transactions need not be serializable with respect to other transactions
Tradeoff accuracy for performance Levels of Consistency in SQL-92
•
Serializable — default
•
Repeatable read — only committed records to be read, repeated reads of same record must return same value. However, a transaction may not be serializable – it may find some records inserted by a transaction but not find others.
•
Read committed — only committed records can be read, but successive reads of record may return different (but committed) values.
•
Read uncommitted — even uncommitted records may be read.
•
Lower degrees of consistency useful for gathering approximate information about the database
•
Warning: some database systems do not ensure serializable schedules by default 80
•
E.g. Oracle and PostgreSQL by default support a level of consistency called snapshot isolation (not part of the SQL standard)
Transaction Definition in SQL •
Data manipulation language must include a construct for specifying the set of actions that comprise a transaction.
•
In SQL, a transaction begins implicitly.
•
A transaction in SQL ends by:
•
–
Commit work commits current transaction and begins a new one.
–
Rollback work causes current transaction to abort.
In almost all database systems, by default, every SQL statement also commits implicitly if it executes successfully –
Implicit commit can be turned off by a database directive •
E.g. in JDBC,
connection.setAutoCommit(false);
Implementation of Isolation •
Schedules must be conflict or view serializable, and recoverable, for the sake of database consistency, and preferably cascadeless.
•
A policy in which only one transaction can execute at a time generates serial schedules, but provides a poor degree of concurrency.
•
Concurrency-control schemes tradeoff between the amount of concurrency they allow and the amount of overhead that they incur.
•
Some schemes allow only conflict-serializable schedules to be generated, while others allow view-serializable schedules that are not conflict-serializable.
Figure 15.6
81
Testing for Serializability •
Consider some schedule of a set of transactions T1, T2, ..., Tn
•
Precedence graph — a direct graph where the vertices are the transactions (names).
•
We draw an arc from Ti to Tj if the two transaction conflict, and Ti accessed the data item on which the conflict arose earlier.
•
We may label the arc by the item that was accessed.
•
Example 1
Example Schedule (Schedule A) + Precedence Graph T1
T2
T3
T4
T5 82
read(X) read(Y) read(Z) read(V) read(W) read(W) read(Y) write(Y) write(Z) read(U) read(Y) write(Y) read(Z) write(Z read(U) write(U)
T5
Test for Conflict Serializability •
A schedule is conflict serializable if and only if its precedence graph is acyclic.
•
Cycle-detection algorithms exist which take order n2 time, where n is the number of vertices in the graph. –
(Better algorithms take order n + e where e is the number of edges.) 83
•
If precedence graph is acyclic, the serializability order can be obtained by a topological sorting of the graph. –
This is a linear order consistent with the partial order of the graph.
–
For example, a serializability order for Schedule A would be T5 → T1 → T3 → T2 → T4 •
Are there others?
Test for View Serializability •
The precedence graph test for conflict serializability cannot be used directly to test for view serializability. –
•
The problem of checking if a schedule is view serializable falls in the class of NP-complete problems. –
•
Extension to test for view serializability has cost exponential in the size of the precedence graph.
Thus existence of an efficient algorithm is extremely unlikely.
However practical algorithms that just check some sufficient conditions for view serializability can still be used.
Lock-Based Protocols •
A lock is a mechanism to control concurrent access to a data item
•
Data items can be locked in two modes : 84
1. exclusive (X) mode. Data item can be both read as well as written. X-lock is requested using lock-X instruction. 2. shared (S) mode. Data item can only be read. S-lock is requested using lock-S instruction. •
Lock requests are made to concurrency-control manager. Transaction can proceed only after request is granted.
•
Lock-compatibility matrix
•
A transaction may be granted a lock on an item if the requested lock is compatible with locks already held on the item by other transactions
•
Any number of transactions can hold shared locks on an item, –
but if any transaction holds an exclusive on the item no other transaction may hold any lock on the item.
•
If a lock cannot be granted, the requesting transaction is made to wait till all incompatible locks held by other transactions have been released. The lock is then granted.
•
Example of a transaction performing locking: T2: lock-S(A); read (A); unlock(A); lock-S(B); read (B); 85
unlock(B); display(A+B) •
Locking as above is not sufficient to guarantee serializability — if A and B get updated inbetween the read of A and B, the displayed sum would be wrong.
•
A locking protocol is a set of rules followed by all transactions while requesting and releasing locks. Locking protocols restrict the set of possible schedules.
Pitfalls of Lock-Based Protocols •
Consider the partial schedule
•
Neither T3 nor T4 can make progress — executing lock-S(B) causes T4 to wait for T3 to release its lock on B, while executing lock-X(A) causes T3 to wait for T4 to release its lock on A.
•
Such a situation is called a deadlock. –
To handle a deadlock one of T3 or T4 must be rolled back and its locks released.
•
The potential for deadlock exists in most locking protocols. Deadlocks are a necessary evil.
•
Starvation is also possible if concurrency control manager is badly designed. For example: –
A transaction may be waiting for an X-lock on an item, while a sequence of other transactions request and are granted an S-lock on the same item. 86
– •
The same transaction is repeatedly rolled back due to deadlocks.
Concurrency control manager can be designed to prevent starvation.
The Two-Phase Locking Protocol •
This is a protocol which ensures conflict-serializable schedules.
•
Phase 1: Growing Phase
•
–
transaction may obtain locks
–
transaction may not release locks
Phase 2: Shrinking Phase –
transaction may release locks
–
transaction may not obtain locks
•
The protocol assures serializability. It can be proved that the transactions can be serialized in the order of their lock points (i.e. the point where a transaction acquired its final lock).
•
Two-phase locking does not ensure freedom from deadlocks
•
Cascading roll-back is possible under two-phase locking. To avoid this, follow a modified protocol called strict two-phase locking. Here a transaction must hold all its exclusive locks till it commits/aborts.
•
Rigorous two-phase locking is even stricter: here all locks are held till commit/abort. In this protocol transactions can be serialized in the order in which they commit.
•
There can be conflict serializable schedules that cannot be obtained if two-phase locking is used.
•
However, in the absence of extra information (e.g., ordering of access to data), twophase locking is needed for conflict serializability in the following sense: Given a transaction Ti that does not follow two-phase locking, we can find a transaction Tj that uses two-phase locking, and a schedule for Ti and Tj that is not conflict serializable Lock Conversions 87
•
Two-phase locking with lock conversions: – First Phase: –
can acquire a lock-S on item
–
can acquire a lock-X on item
–
can convert a lock-S to a lock-X (upgrade) – Second Phase:
•
–
can release a lock-S
–
can release a lock-X
–
can convert a lock-X to a lock-S (downgrade)
This protocol assures serializability. But still relies on the programmer to insert the various locking instructions.
Automatic Acquisition of Locks •
A transaction Ti issues the standard read/write instruction, without explicit locking calls.
•
The operation read(D) is processed as: if Ti has a lock on D then read(D) else begin if necessary wait until no other transaction has a lock-X on D grant Ti a lock-S on D; read(D) end
•
write(D) is processed as: 88
if Ti has a lock-X on D then write(D) else begin if necessary wait until no other trans. has any lock on D, if Ti has a lock-S on D then upgrade lock on D to lock-X else grant Ti a lock-X on D write(D) end; •
All locks are released after commit or abort
Implementation of Locking •
A lock manager can be implemented as a separate process to which transactions send lock and unlock requests
•
The lock manager replies to a lock request by sending a lock grant messages (or a message asking the transaction to roll back, in case of a deadlock)
•
The requesting transaction waits until its request is answered
•
The lock manager maintains a data-structure called a lock table to record granted locks and pending requests
•
The lock table is usually implemented as an in-memory hash table indexed on the name of the data item being locked Lock Table
89
•
Black rectangles indicate granted locks, white ones indicate waiting requests
•
Lock table also records the type of lock granted or requested
•
New request is added to the end of the queue of requests for the data item, and granted if it is compatible with all earlier locks
•
Unlock requests result in the request being deleted, and later requests are checked to see if they can now be granted
•
If transaction aborts, all waiting or granted requests of the transaction are deleted –
lock manager may keep a list of locks held by each transaction, to implement this efficiently
•
Black rectangles indicate granted locks, white ones indicate waiting requests
•
Lock table also records the type of lock granted or requested
•
New request is added to the end of the queue of requests for the data item, and granted if it is compatible with all earlier locks
•
Unlock requests result in the request being deleted, and later requests are checked to see if they can now be granted
•
If transaction aborts, all waiting or granted requests of the transaction are deleted –
lock manager may keep a list of locks held by each transaction, to implement this efficiently Graph-Based Protocols 90
•
Graph-based protocols are an alternative to two-phase locking
•
Impose a partial ordering → on the set D = {d1, d2 ,..., dh} of all data items.
•
–
If di → dj then any transaction accessing both di and dj must access di before accessing dj.
–
Implies that the set D may now be viewed as a directed acyclic graph, called a database graph.
The tree-protocol is a simple kind of graph protocol.
Tree Protocol
1. Only exclusive locks are allowed. 2. The first lock by Ti may be on any data item. Subsequently, a data Q can be locked by
Ti only if the parent of Q is currently locked by Ti. 3. Data items may be unlocked at any time. 4. A data item that has been locked and unlocked by Ti cannot subsequently be relocked
by Ti
Timestamp-Based Protocols
91
•
Each transaction is issued a timestamp when it enters the system. If an old transaction Ti has time-stamp TS(Ti), a new transaction Tj is assigned time-stamp TS(Tj) such that TS(Ti)
•
The protocol manages concurrent execution such that the time-stamps determine the serializability order.
•
In order to assure such behavior, the protocol maintains for each data Q two timestamp values: –
W-timestamp(Q) is the largest time-stamp of any transaction that executed write(Q) successfully.
–
R-timestamp(Q) is the largest time-stamp of any transaction that executed read(Q) successfully.
•
The timestamp ordering protocol ensures that any conflicting read and write operations are executed in timestamp order.
•
Suppose a transaction Ti issues a read(Q) 1. If TS(Ti) ≤ W-timestamp(Q), then Ti needs to read a value of Q
that was
already overwritten. n
Hence, the read operation is rejected, and Ti is rolled back.
2. If TS(Ti)≥ W-timestamp(Q), then the read operation is executed, and R-
timestamp(Q) is set to max(R-timestamp(Q), TS(Ti)).
•
Suppose that transaction Ti issues write(Q). 1. If TS(Ti) < R-timestamp(Q), then the value of Q that Ti is producing was needed
previously, and the system assumed that that value would never be produced. n
Hence, the write operation is rejected, and Ti is rolled back.
2. If TS(Ti) < W-timestamp(Q), then Ti is attempting to write an obsolete value of
Q. n
Hence, this write operation is rejected, and Ti is rolled back.
3. Otherwise, the write operation is executed, and W-timestamp(Q) is set to
TS(Ti). 92
Example Use of the Protocol A partial schedule for several data items for transactions with timestamps 1, 2, 3, 4, 5
read(Y)
read(X) r
ead(Z) read(Y) write(Y) write(Z) read(X) abort read(X) write(Z) abort wri te(Y) writ e(Z))
Correctness of Timestamp-Ordering Protocol •
The timestamp-ordering protocol guarantees serializability since all the arcs in the precedence graph are of the form:
Thus, there will be no cycles in the precedence graph •
Timestamp protocol ensures freedom from deadlock as no transaction ever waits.
93
•
But the schedule may not be cascade-free, and may not even be recoverable.
Thomas’ Write Rule •
Modified version of the timestamp-ordering protocol in which obsolete write operations may be ignored under certain circumstances.
•
When Ti attempts to write data item Q, if TS(Ti) < W-timestamp(Q), then Ti is attempting to write an obsolete value of {Q}. –
Rather than rolling back Ti as the timestamp ordering protocol would have done, this {write} operation can be ignored.
•
Otherwise this protocol is the same as the timestamp ordering protocol.
•
Thomas' Write Rule allows greater potential concurrency. –
Allows some view-serializable schedules that are not conflictserializable.
Validation-Based Protocol •
Execution of transaction Ti is done in three phases. 1. Read and execution phase: Transaction Ti writes only to temporary local variables 2. Validation phase: Transaction Ti performs a ``validation test'' to determine if local variables can be written without violating serializability. 3. Write phase: If Ti is validated, the updates are applied to the database; otherwise, Ti is rolled back.
•
The three phases of concurrently executing transactions can be interleaved, but each transaction must go through the three phases in that order. –
Assume for simplicity that the validation and write phase occur together, atomically and serially 94
•
I.e., only one transaction executes validation/write at a time.
•
Also called as optimistic concurrency control since transaction executes fully in the hope that all will go well during validation
•
Each transaction Ti has 3 timestamps
•
–
Start(Ti) : the time when Ti started its execution
–
Validation(Ti): the time when Ti entered its validation phase
–
Finish(Ti) : the time when Ti finished its write phase
Serializability order is determined by timestamp given at validation time, to increase concurrency. –
•
Thus TS(Ti) is given the value of Validation(Ti).
This protocol is useful and gives greater degree of concurrency if probability of conflicts is low. –
because the serializability order is not pre-decided, and
–
relatively few transactions will have to be rolled back.
Validation Test for Transaction Tj •
If for all Ti with TS (Ti) < TS (Tj) either one of the following condition holds: –
finish(Ti) < start(Tj)
–
start(Tj) < finish(Ti) < validation(Tj) and the set of data items written by Ti does not intersect with the set of data items read by Tj.
then validation succeeds and Tj can be committed. Otherwise, validation fails and Tj is aborted. •
Justification: Either the first condition is satisfied, and there is no overlapped execution, or the second condition is satisfied and n
the writes of Tj do not affect reads of Ti since they occur after Ti has finished its reads.
95
n
the writes of Ti do not affect reads of Tj since Tj does not read any item written by Ti.
Schedule Produced by Validation •
Example of schedule produced using validation
read(B)
read(B) B:= B-50 read(A) A:= A+50
read(A) (validate) display (A+B) (validate) write (B) write (A)
Multiple Granularity •
Allow data items to be of various sizes and define a hierarchy of data granularities, where the small granularities are nested within larger ones
•
Can be represented graphically as a tree (but don't confuse with treelocking protocol)
•
When a transaction locks a node in the tree explicitly, it implicitly locks all the node's descendents in the same mode.
•
Granularity of locking (level in tree where locking is done): –
fine granularity (lower in tree): high concurrency, high locking overhead 96
–
coarse granularity (higher in tree): low locking overhead, low concurrency
Example of Granularity Hierarchy
The levels, starting from the coarsest (top) level are –
database
–
area
–
file record Intention Lock Modes
•
•
In addition to S and X lock modes, there are three additional lock modes with multiple granularity: –
intention-shared (IS): indicates explicit locking at a lower level of the tree but only with shared locks.
–
intention-exclusive (IX): indicates explicit locking at a lower level with exclusive or shared locks
–
shared and intention-exclusive (SIX): the subtree rooted by that node is locked explicitly in shared mode and explicit locking is being done at a lower level with exclusive-mode locks.
intention locks allow a higher level node to be locked in S or X mode without having to check all descendent nodes. 97
Compatibility Matrix with Intention Lock Modes •
The compatibility matrix for all lock modes is: IS IS
ü
IX
ü
S
ü
S IX X
IX
ü ×
ü
S
S IX
X ×
ü
ü
×
×
×
ü
×
×
×
×
×
×
×
×
×
×
ü ×
Multiple Granularity Locking Scheme •
Transaction Ti can lock a node Q, using the following rules: 1. The lock compatibility matrix must be observed. 2. The root of the tree must be locked first, and may be locked in any mode. 3. A node Q can be locked by Ti in S or IS mode only if the parent of Q is currently locked by Ti in either IX or IS mode. 4. A node Q can be locked by Ti in X, SIX, or IX mode only if the parent of Q is currently locked by Ti in either IX or SIX mode. 5. Ti can lock a node only if it has not previously unlocked any node (that is, Ti is two-phase). 6. Ti can unlock a node Q only if none of the children of Q are currently locked by Ti.
•
Observe that locks are acquired in root-to-leaf order, whereas they are released in leaf-to-root order.
98
UNIT-VIII 1.
Data on external storage & File organization and indexing
2.
Index data structures
3.
Comparison of file organizations
4.
Comparison of file organizations
5.
Indexes and performance tuning
6.
Indexes and performance tuning
7.
Intuition for tree indexes & ISAM 99
8.
B+ tree
Data on External Storage
•
Disks: Can retrieve random page at fixed cost
– •
Tapes: Can only read pages in sequence
– •
But reading several consecutive pages is much cheaper than reading them in random order
Cheaper than disks; used for archival storage
File organization: Method of arranging a file of records on external storage.
– –
Record id (rid) is sufficient to physically locate record Indexes are data structures that allow us to find the record ids of records with given values in index search key fields 100
•
Architecture: Buffer manager stages pages from external storage to main memory buffer pool. File and index layers make calls to the buffer manager.
Alternative File Organizations Many alternatives exist, each ideal for some situations, and not so good in others:
–
Heap (random order) files: Suitable when typical access is a file scan retrieving all records.
–
Sorted Files: Best if records must be retrieved in some order, or only a `range’ of records is needed.
–
Indexes: Data structures to organize records via trees or hashing.
•
Like sorted files, they speed up searches for a subset of records, based on values in certain (“search key”) fields
•
Updates are much faster than in sorted files. Index Classification
•
Primary vs. secondary: If search key contains primary key, then called primary index.
–
•
Unique index: Search key contains a candidate key.
Clustered vs. unclustered: If order of data records is the same as, or `close to’, order of data entries, then called clustered index.
– – –
Alternative 1 implies clustered; in practice, clustered also implies Alternative 1 (since sorted files are rare). A file can be clustered on at most one search key. Cost of retrieving data records through index varies greatly based on whether index is clustered or not! Clustered vs. Unclustered Index
•
Suppose that Alternative (2) is used for data entries, and that the data records are stored in a Heap file. – To build clustered index, first sort the Heap file (with some free space on each page for future inserts). – Overflow pages may be needed for inserts. (Thus, order of data recs is `close to’, but not identical to, the sort order.) 101
Index entries
direct search for
CLUSTERED
data entries Data entries Data Records
(Index File) (Data file)
UNCLUSTERED Data entries 102
Data Records Indexes
•
An index on a file speeds up selections on the search key fields for the index.
– –
•
Any subset of the fields of a relation can be the search key for an index on the relation. Search key is not the same as key (minimal set of fields that uniquely identify a record in a relation).
An index contains a collection of data entries, and supports efficient retrieval of all data entries k* with a given key value k.
–
Given data entry k*, we can find record with key k in at most one disk I/O. (Details soon …) B+ Tree Indexes
Pages (Sorted by search key) Leaf pages contain data entries, and are chained (prev & next)
Non-leaf pages have index entries; only used to direct searches: index entry
103
Example B+ Tree
Hash-Based Indexes
104
Alternatives for Data Entry k* in Index •
•
•
In a data entry k* we can store: –
Data record with key value k, or
–
, or
–
Choice of alternative for data entries is orthogonal to the indexing technique used to locate data entries with a given key value k. –
Examples of indexing techniques: B+ trees, hash-based structures
–
Typically, index contains auxiliary information that directs searches to the desired data entries
In a data entry k* we can store: –
Data record with key value k, or
–
, or
–
105
•
•
Choice of alternative for data entries is orthogonal to the indexing technique used to locate data entries with a given key value k. –
Examples of indexing techniques: B+ trees, hash-based structures
–
Typically, index contains auxiliary information that directs searches to the desired data entries
Alternatives 2 and 3: –
Data entries typically much smaller than data records. So, better than Alternative 1 with large data records, especially if search keys are small. (Portion of index structure used to direct search, which depends on size of data entries, is much smaller than with Alternative 1.)
–
Alternative 3 more compact than Alternative 2, but leads to variable sized data entries even if search keys are of fixed length. Cost Model for Our Analysis We ignore CPU costs, for simplicity:
–
B: The number of data pages
–
R: Number of records per page
–
D: (Average) time to read or write disk page
–
Measuring number of page I/O’s ignores gains of pre-fetching a sequence of pages; thus, even I/O cost is only approximated.
–
Average-case analysis; based on several simplistic assumptions. Comparing File Organizations
•
Heap files (random order; insert at eof)
•
Sorted files, sorted on
•
Clustered B+ tree file, Alternative (1), search key
•
Heap file with unclustered B + tree index on search key
•
Heap file with unclustered hash index on search key
106
Operations to Compare •
Scan: Fetch all records from disk
•
Equality search
•
Range selection
•
Insert a record
•
Delete a record
Assumptions in Our Analysis •
Heap Files: –
•
Sorted Files: –
•
Equality selection on key; exactly one match.
Files compacted after deletions.
Indexes: –
Alt (2), (3): data entry size = 10% size of record
–
Hash: No overflow buckets. •
–
Tree: 67% occupancy (this is typical). •
•
•
80% page occupancy => File size = 1.25 data size
Implies file size = 1.5 data size
Scans: –
Leaf levels of a tree-index are chained.
–
Index data-entries plus actual file scanned for unclustered indexes.
Range searches: –
We use tree indexes to restrict the set of data records fetched, but ignore hash indexes.
107
Cost of Operations
(a) Scan
(b) Equality
(c ) Range
(d) Insert
(e) Delete
(1) Heap
BD
0.5BD
BD
2D
(2) Sorted
BD
Dlog 2B
Search +BD
Search +D Search +BD
(3) Clustered
1.5BD
Search +D
Search +D
(4) Unclust. Treeindex
BD(R+0.15)
Search +2D
Search +2D
(5) Unclust. Hash index
BD(R+0.125)
Search +2D
Search +2D
D(log 2 B + #pgs with match recs) Dlog F 1.5B D(log F 1.5B +# pgs w. match recs) D(1+ D(log F 0.15B log F 0.15B) +# pgs w. match recs) 2D BD
Understanding the Workload •
•
For each query in the workload: –
Which relations does it access?
–
Which attributes are retrieved?
–
Which attributes are involved in selection/join conditions? How selective are these conditions likely to be?
For each update in the workload: –
Which attributes are involved in selection/join conditions? How selective are these conditions likely to be?
–
The type of update (INSERT/DELETE/UPDATE), and the attributes that are affected.
Choice of Indexes •
What indexes should we create?
108
– •
For each index, what kind of an index should it be? –
•
•
Which relations should have indexes? What field(s) should be the search key? Should we build several indexes?
Clustered? Hash/tree?
One approach: Consider the most important queries in turn. Consider the best plan using the current indexes, and see if a better plan is possible with an additional index. If so, create it. –
Obviously, this implies that we must understand how a DBMS evaluates queries and creates query evaluation plans!
–
For now, we discuss simple 1-table queries.
Before creating an index, must also consider the impact on updates in the workload! –
Trade-off: Indexes can make queries go faster, updates slower. Require disk space, too. Index Selection Guidelines
•
Attributes in WHERE clause are candidates for index keys. –
Exact match condition suggests hash index.
–
Range query suggests tree index. •
•
Clustering is especially useful for range queries; can also help on equality queries if there are many duplicates.
Multi-attribute search keys should be considered when a WHERE clause contains several conditions. –
Order of attributes is important for range queries.
–
Such indexes can sometimes enable index-only strategies for important queries. •
For index-only strategies, clustering is not important!
109
Examples of Clustered Indexes •
•
•
B+ tree index on E.age can be used to get qualifying tuples. –
How selective is the condition?
–
Is the index clustered?
Consider the GROUP BY query. –
If many tuples have E.age > 10, using E.age index and sorting the retrieved tuples may be costly.
–
Clustered E.dno index may be better!
Equality queries and duplicates: –
Clustering on E.hobby helps!
SELECT E.dno FROM Emp E WHERE E.age>40
SELECT E.dno, COUNT (*) FROM Emp E WHERE E.age>10 GROUP BY E.dno SELECT E.dno FROM Emp E WHERE E.hobby=Stamps
Indexes with Composite Search Keys Composite Search Keys: Search on a combination of fields.
• –
Equality query: Every field value is equal to a constant value. E.g. wrt index: 110
• –
Range query: Some field value is not a constant. E.g.: age =20; or age=20 and sal > 10
• •
age=20 and sal =75
Data entries in index sorted by search key to support range queries. –
Lexicographic order, or –
Data entries in index
Spatial order.
Data entries sorted by
sorted by 111
Composite Search Keys •
To retrieve Emp records with age=30 AND sal=4000, an index on would be better than an index on age or an index on sal. –
•
If condition is: 20
•
Clustered tree index on or is best.
If condition is: age=30 AND 3000
•
Choice of index key orthogonal to clustering etc.
Clustered index much better than index!
Composite indexes are larger, updated more often.
Index-Only Plans •
A number of queries can be answered without retrieving any tuples from one or more of the relations involved if a suitable index is available.
Or
112
SELECT AVG(E.sal) FROM Emp E WHERE E.age=25 AND E.sal BETWEEN 3000 AND 5000 Summary •
Many alternative file organizations exist, each appropriate in some situation.
•
If selection queries are frequent, sorting the file or building an index is important. –
Hash-based indexes only good for equality search.
–
Sorted files and tree-based indexes best for range search; also good for equality search. (Files rarely kept sorted in practice; B+ tree index is better.)
Index is a collection of data entries plus a way to quickly find entries with given key values. •
Data entries can be actual data records, pairs, or pairs. –
Choice orthogonal to indexing technique used to locate data entries with a given key value.
•
Can have several indexes on a given file of data records, each with a different search key.
•
Indexes can be classified as clustered vs. unclustered, primary vs. secondary, and dense vs. sparse. Differences have important consequences for utility/performance.
Introduction •
As for any index, 3 alternatives for data entries k*: –
Data record with key value k
–
–
113
•
Choice is orthogonal to the indexing technique used to locate data entries k*.
•
Tree-structured indexing techniques support both range searches and equality searches.
•
ISAM: static structure; B+ tree: dynamic, adjusts gracefully under inserts and deletes.
Range Searches •
``Find all students with gpa > 3.0’’ –
If data is in sorted file, do binary search to find first such student, then scan to find others.
–
Cost of binary search can be quite high.
Simple idea: Create an `index’ file.
ISAM Index entry
114
Comments on ISAM •
File creation: Leaf (data) pages allocated sequentially, sorted by search key; then index pages allocated, then space for overflow pages.
•
Index entries: ; they `direct’ search for data entries, which are in leaf pages.
•
Search: Start at root; use key comparisons to go to leaf. Cost N ; F = # entries/index pg, N = # leaf pgs
•
Insert: Find leaf data entry belongs to, and put it there.
•
Delete: Find and remove from leaf; if empty overflow page, deallocate.
log
F
115
Example ISAM Tree •
Each node can hold 2 entries; no need for `next-leaf-page’ pointers. (Why?)
116
117
B+ Tree: Most Widely Used Index •
Insert/delete at log N = # leaf pages)
•
Minimum 50% occupancy (except for root). Each node contains d <= m <= 2d entries. The parameter d is called the order of the tree.
•
Supports equality and range-searches efficiently.
F
N cost; keep tree height-balanced.
(F = fanout,
Example B+ Tree •
Search begins at root, and key comparisons direct it to a leaf (as in ISAM).
•
Search for 5*, 15*, all data entries >= 24* ...
B+ Trees in Practice 118
•
Typical order: 100. Typical fill-factor: 67%. –
•
•
average fanout = 133
Typical capacities: –
Height 4: 1334 = 312,900,700 records
–
Height 3: 1333 =
2,352,637 records
Can often hold top levels in buffer pool: –
Level 1 =
1 page =
8 Kbytes
–
Level 2 =
133 pages =
–
Level 3 = 17,689 pages = 133 MBytes
1 Mbyte
Inserting a Data Entry into a B+ Tree •
Find correct leaf L.
•
Put data entry onto L.
•
–
If L has enough space, done!
–
Else, must split L (into L and a new node L2) Redistribute entries evenly, copy up middle key.
•
Insert index entry pointing to L2 into parent of L.
This can happen recursively –
•
•
To split index node, redistribute entries evenly, but push up middle key. (Contrast with leaf splits.)
Splits “grow” tree; root split increases height.
Tree growth: gets wider or one level taller at top.
Inserting 8* into Example B+ Tree •
Observe how minimum occupancy is guaranteed in both leaf and index pg splits.
•
Note difference between copy-up and push-up; be sure you understand the reasons for this. 119
Example B+ Tree After Inserting 8*
Notice that root was split, leading to increase in height In this example, we can avoid split by re-distributing however, this is usually not done in practice.
entries;
Deleting a Data Entry from a B+ Tree 120
•
Start at root, find leaf L where entry belongs.
•
Remove the entry.
•
–
If L is at least half-full, done!
–
If L has only d-1 entries, •
Try to re-distribute, borrowing from sibling (adjacent node with same parent as L).
•
If re-distribution fails, merge L and sibling.
If merge occurred, must delete entry (pointing to L or sibling) from parent of L.
Merge could propagate to root, decreasing height.
Example Tree After (Inserting 8*, Then) Deleting 19* and 20* ...
•
Deleting 19* is easy.
Deleting 20* is done with re-distribution. Notice how middle key is copied up.
... And Then Deleting 24* •
Must merge.
•
Observe `toss’ of index entry (on right), and `pull down’ of index entry (below).
121
122
ADDONS What's a database ? A database is a collection of data organized in a particular way. Databases can be of many types such as Flat File Databases, Relational Databases, Distributed Databases etc. What's SQL ? In 1971, IBM researchers created a simple non-procedural language called Structured English Query Language. or SEQUEL. This was based on Dr. Edgar F. (Ted) Codd's design of a relational model for data storage where he described a universal programming language for accessing databases. In the late 80's ANSI and ISO (these are two organizations dealing with standards for a wide variety of things) came out with a standardized version called Structured Query Language or SQL. SQL is prounced as 'Sequel'. There have been several versions of SQL and the latest one is SQL-99. Though SQL-92 is the current universally adopted standard.
123
SQL is the language used to query all databases. It's simple to learn and appears to do very little but is the heart of a successful database application. Understanding SQL and using it efficiently is highly imperative in designing an efficient database application. The better your understanding of SQL the more versatile you'll be in getting information out of databases. What's an RDBMS ? This concept was first described around 1970 by Dr. Edgar F. Codd in an IBM research publication called "System R4 Relational". A relational database uses the concept of linked two-dimensional tables which comprise of rows and columns. A user can draw relationships between multiple tables and present the output as a table again. A user of a relational database need not understand the representation of data in order to retrieve it. Relational programming is non-procedural. [What's procedural and non-procedural ? Programming languages are procedural if they use programming elements such as conditional statements (if-then-else, do-while etc.). SQL has none of these types of statements.] In 1979, Relational Software released the world's first relational database called Oracle V.2 What a DBMS ? MySQL and mSQL are database management systems or DBMS. These software packages are used to manipulate a database. All DBMSs use their own implementation of SQL. It may be a subset or a superset of the instructions provided by SQL 92. MySQL, due to it's simplicity uses a subset of SQL 92 (also known as SQL2). What's Database Normalization ? Normalization is the process where a database is designed in a way that removes redundancies, and increases the clarity in organizing data in a database. In easy English, it means take similar stuff out of a collection of data and place them into tables. Keep doing this for each new table recursively and you'll have a Normalized database. From this resultant database you should be able to recreate the data into it's original state if there is a need to do so. The important thing here is to know when to Normalize and when to be practical. That will come with experience. For now, read on...
124
Normalization of a database helps in modifying the design at later times and helps in being prepared if a change is required in the database design. Normalization raises the efficiency of the datatabase in terms of management, data storage and scalability. Now Normalization of a Database is achieved by following a set of rules called 'forms' in creating the database. These rules are 5 in number (with one extra one stuck in-between 3&4) and they are: 1st Normal Form or 1NF: Each Column Type is Unique. 2nd Normal Form or 2NF: The entity under consideration should already be in the 1NF and all attributes within the entity should depend solely on the entity's unique identifier. 3rd Normal Form or 3NF: The entity should already be in the 2NF and no column entry should be dependent on any other entry (value) other than the key for the table. If such an entity exists, move it outside into a new table. Now if these 3NF are achieved, the database is considered normalized. But there are three more 'extended' NF for the elitist. These are: BCNF (Boyce & Codd): The database should be in 3NF and all tables can have only one primary key. 4NF: Tables cannot have multi-valued dependencies on a Primary Key. 5NF: There should be no cyclic dependencies in a composite key. Well this is a highly simplified explanation for Database Normalization. One can study this process extensively though. After working with databases for some time you'll automatically create Normalized databases. As, it's logical and practical. For now, don't worry too much about Normalization. The quickest way to grasp SQL and Databases is to plunge headlong into creating tables and start messing around with SQL statements. After you 125
go through the tutorial examples and also the example contacts database, look at the example provided in creating a normalized database near the very end of this tutorial. And then try to think how you would like to create your own database. Much of database design depends on how YOU want to keep the data. In real life situations often you may find it more convenient to store data in tables designed in a way that does fall a bit short of keeping all the NFs happy. But that's what databases are all about. Making your life simpler. Onto SQL There are four basic commands which are the workhorses for SQL and figure in almost all queries to a database. INSERT - Insert Data DELETE - Delete Data SELECT - Pull Data UPDATE - Change existing Data As you can see SQL is like English. Let's build a real world example database using MySQL and perform some SQL operations on it. A database that practically anyone could use would be a Contacts database. In our example we are going to create create a database with the following fields: • • • • • • • • • • • • •
FirstName LastName BirthDate StreetAddress City State Zip Country TelephoneHome TelephoneWork Email CompanyName Designation
First, lets decide how we are going to store this data in the database. For illustration purposes, we are going to keep this data in multiple tables. 126
This will let us exercise all of the SQL commands pertaining to retrieving data from multiple tables. Also we can separate different kinds of entities into different tables. So let's say you have thousands of friends and need to send a mass email to all of them, a SELECT statement (covered later) will look at only one table. Well, we can keep the FirstName, LastName and BirthDate in one table. Address related data in another. Company Details in another. Emails can be separated into another. Telephones can be separated into another. Let's build the database in MySQL. While building a database - you need to understand the concept of data types. Data types allow the user to define how data is stored in fields or cells within a database. It's a way to define how your data will actually exist. Whether it's a Date or a string consisting of 20 characters, an integer etc. When we build tables within a database we also define the contents of each field in each row in the table using a data type. It's imperative that you use only the data type that fits your needs and don't use a data type that reserves more memory than the data in the field actually requires. Let's look at various Data Types under MySQL. Type TINYINT (length) SMALLINT (length) MEDIUMINT(length) INT(length) BIGINT(length)
FLOAT(length, decimal) DOUBLEPRECISION(length, decimal) DECIMAL(length, decimal) TIMESTAMP(length)
Size in bytes
Description
Integer with unsigned range of 0-255 and a signed range from -128-127 Integer with unsigned range of 0-65535 and a signed 2 range from -32768-32767 Integer with unsigned range of 0-16777215 and a signed 3 range from -8388608-8388607 Integer with unsigned range of 0-429467295 and a signed 4 range from -2147483648-2147483647 Integer with unsigned range of 0-18446744 and a signed 8 range from -9223372036854775808-9223372036854775807 Floating point number with max. value +/4 3.402823466E38 and min.(non-zero) value +/11.175494351E-38 Floating point number with max. value +/8 -1.7976931348623157E308 and min. (non-zero) value +/-2.2250738585072014E-308 Floating point number with the range of the DOUBLE length type that is stored as a CHAR field type. 4 YYYYMMDDHHMMSS or YYMMDDHHMMSS or YYYYMMDD, YYMMDD. A Timestamp value is 1
127
DATE TIME DATETIME YEAR CHAR(length) VARCHAR(length) TINYTEXT TINYBLOB TEXT BLOB MEDIUMTEXT MEDIUMBLOB LONGTEXT LONGBLOB ENUM SET
updated each time the row changes value. A NULL value sets the field to the current time. 3 YYYY-MM-DD 3 HH:MM:DD 8 YYYY-MM-DD HH:MM:SS 1 YYYY or YY A fixed length text string where fields shorter than the length assigned length are filled with trailing spaces. A fixed length text string (255 Character Max) where length unused trailing spaces are removed before storing. length+1A text field with max. length of 255 characters. length+1A binary field with max. length of 255 characters. length+164Kb of text length+164Kb of data length+316Mb of text length+316 Mb of data length+44GB of text length+44GB of data This field can contain one of a possible 65535 number of 1,2 options. Ex: ENUM('abc','def','ghi') This type of field can contain any number of a set of 1-8 predefined possible values.
The following examples will make things quite clear on declaring Data Types within SQL statements. Steps in Creating the Database using MySQL From the shell prompt (either in DOS or UNIX): mysqladmin create contacts; This will create an empty database called "contacts". Now run the command line tool "mysql" and from the mysql prompt do the following: mysql> use contacts; (You'll get the response "Database changed") The following commands entered into the MySQL prompt will create the tables in the database. 128
mysql> CREATE TABLE names (contact_id SMALLINT NOT NULL AUTO_INCREMENT PRIMARY KEY, FirstName CHAR(20), LastName CHAR(20), BirthDate DATE); mysql> CREATE TABLE address(contact_id SMALLINT NOT NULL PRIMARY KEY, StreetAddress CHAR(50), City CHAR(20), State CHAR(20), Zip CHAR(15), Country CHAR(20)); mysql> CREATE TABLE telephones (contact_id SMALLINT NOT NULL PRIMARY KEY, TelephoneHome CHAR(20), TelephoneWork(20)); mysql> CREATE TABLE email (contact_id SMALLINT NOT NULL PRIMARY KEY, Email CHAR(20)); mysql> CREATE TABLE company_details (contact_id SMALLINT NOT NULL PRIMARY KEY, CompanyName CHAR(25), Designation CHAR(15)); Note: Here we assume that one person will have only one email address. Now if there were a situation where one person has multiple addresses, this design would be a problem. We'd need another field which would keep values that indicated to whom the email address belonged to. In this particular case email data ownership is indicated by the primary key. The same is true for telephones. We are assuming that one person has only one home telephone and one work telephone number. This need not be true. Similarly one person could work for multiple companies at the same time holding two different designation. In all these cases an extra field will solve the issue. For now however let's work with this small design. KEYS: The relationships between columns located in different tables are usually described through the use of keys. As you can see we have a PRIMARY KEY in each table. The Primary key serves as a mechanism to refer to other fields within the same row. In this case, the Primary key is used to identify a relationship between a row under consideration and the person whose name is located inside the 'names' table. We use the AUTO_INCREMENT statement only for the 'names' table as we need to use the generated contact_id number in all the other tables for identification of the rows. This type of table design where one table establishes a relationship with several other tables is known as a 'one to many' relationship. In a 'many to many' relationship we could have several Auto Incremented Primary Keys in various tables with several inter-relationships. Foreign Key: A foreign key is a field in a table which is also the Primary Key in another table. This is known commonly as 'referential integrity'. 129
Execute the following commands to see the newly created tables and their contents. To see the tables inside the database: mysql> SHOW TABLES; +-----------------------+ | Tables in contacts | +-----------------------+ | address | | company_details | | email | | names | | telephones | +----------------------+ 5 rows in set (0.00 sec) To see the columns within a particular table: mysql>SHOW COLUMNS FROM address; +---------------+-------------+------+-----+---------+-------+---------------------------------+ | Field | Type | Null | Key | Default | Extra | Privileges | +---------------+-------------+------+-----+---------+-------+---------------------------------+ | contact_id | smallint(6) | | PRI | 0 | | select,insert,update,references | | StreetAddress | char(50) | YES | | NULL | | select,insert,update,references | | City | char(20) | YES | | NULL | | select,insert,update,references | | State | char(20) | YES | | NULL | | select,insert,update,references | | Zip | char(10) | YES | | NULL | | select,insert,update,references | | Country | char(20) | YES | | NULL | | select,insert,update,references | +---------------+-------------+------+-----+---------+-------+------------------ ---------------+ 6 rows in set (0.00 sec) So we have the tables created and ready. Now we put in some data. Let's start with the 'names' table as it uses a unique AUTO_INCREMENT field which in turn is used in the other tables. Inserting data, one row at a time: mysql> INSERT INTO names (FirstName, LastName, BirthDate) VALUES ('Yamila','Diaz ','1974-10-13'); Query OK, 1 row affected (0.00 sec) Inserting multiple rows at a time: mysql> INSERT INTO names (FirstName, LastName, BirthDate) VALUES ('Nikki','Taylor','1972-03-04'),('Tia','Carrera','1975-09-18'); 130
Query OK, 2 rows affected (0.00 sec) Records: 2 Duplicates: 0 Warnings: 0 Let's see what the data looks like inside the table. We use the SELECT command for this. mysql> SELECT * from NAMES; +------------+-----------+----------+------------+ | contact_id | FirstName | LastName | BirthDate | +------------+-----------+----------+------------+ | 3 | Tia | Carrera | 1975-09-18 | | 2 | Nikki | Taylor | 1972-03-04 | | 1 | Yamila | Diaz | 1974-10-13 | +------------+-----------+----------+------------+ 3 rows in set (0.06 sec) Try another handy command called 'DESCRIBE'. mysql> DESCRIBE names; +------------+-------------+------+-----+---------+----------------+---------------------------------+ | Field | Type | Null | Key | Default | Extra | Privileges | +------------+-------------+------+-----+---------+----------------+---------------------------------+ | contact_id | smallint(6) | | PRI | NULL | auto_increment | select,insert,update,references | | FirstName | char(20) | YES | | NULL | | select,insert,update,references | | LastName | char(20) | YES | | NULL | | select,insert,update,references | | BirthDate | date | YES | | NULL | | select,insert,update,references | +------------+-------------+------+-----+---------+----------------+---------------------------------+ 4 rows in set (0.00 sec) Now lets populate the other tables. Observer the syntax used. mysql> INSERT INTO address(contact_id, StreetAddress, City, State, Zip, Country) VALUES ('1', '300 Yamila Ave.', 'Los Angeles', 'CA', '300012', 'USA'),('2','4000 Nikki St.','Boca Raton','FL','500034','USA'),('3','404 Tia Blvd.','New York','NY','10011','USA'); Query OK, 3 rows affected (0.05 sec) Records: 3 Duplicates: 0 Warnings: 0 mysql> SELECT * FROM address; +------------+-----------------+-------------+-------+--------+---------+ | contact_id | StreetAddress | City | State | Zip | Country | +------------+-----------------+-------------+-------+--------+---------+ | 1 | 300 Yamila Ave. | Los Angeles | CA | 300012 | USA | | 2 | 4000 Nikki St. | Boca Raton | FL | 500034 | USA | | 3 | 404 Tia Blvd. | New York | NY | 10011 | USA | +------------+-----------------+-------------+-------+--------+---------+ 3 rows in set (0.00 sec) 131
mysql> INSERT INTO company_details (contact_id, CompanyName, Designation) VALUES ('1','Xerox','New Business Manager'), ('2','Cabletron','Customer Support Eng'), ('3','Apple','Sales Manager'); Query OK, 3 rows affected (0.05 sec) Records: 3 Duplicates: 0 Warnings: 0 mysql> SELECT * FROM company_details; +------------+-------------+----------------------+ | contact_id | CompanyName | Designation | +------------+-------------+----------------------+ | 1 | Xerox | New Business Manager | | 2 | Cabletron | Customer Support Eng | | 3 | Apple | Sales Manager | +------------+-------------+----------------------+ 3 rows in set (0.06 sec) mysql> INSERT INTO email (contact_id, Email) VALUES ('1', '[email protected]'),( '2', '[email protected]'),('3','[email protected]'); Query OK, 3 rows affected (0.00 sec) Records: 3 Duplicates: 0 Warnings: 0 mysql> SELECT * FROM email; +------------+-------------------+ | contact_id | Email | +------------+-------------------+ | 1 | [email protected] | | 2 | [email protected] | | 3 | [email protected] | +------------+-------------------+ 3 rows in set (0.06 sec) mysql> INSERT INTO telephones (contact_id, TelephoneHome, TelephoneWork) VALUES ('1','333-50000','333-60000'),('2','444-70000','444-80000'),('3','555-30000','55 5-40000'); Query OK, 3 rows affected (0.00 sec) Records: 3 Duplicates: 0 Warnings: 0 mysql> SELECT * FROM telephones; +------------+---------------+---------------+ | contact_id | TelephoneHome | TelephoneWork | +------------+---------------+---------------+ | 1 | 333-50000 | 333-60000 | | 2 | 444-70000 | 444-80000 | | 3 | 555-30000 | 555-40000 | +------------+---------------+---------------+ 3 rows in set (0.00 sec) Okay, so we now have all our data ready for experimentation. 132
Before we start experimenting with manipulating the data let's look at how MySQL stores the Data. To do this execute the following command from the shell prompt. mysqldump contacts > contacts.sql Note: The reverse operation for this command is: mysql contacts < contacts.sql The file generated is a text file that contains all the data and SQL instruction needed to recreate the same database. As you can see, the SQL here is slightly different than what was typed in. Don't worry about this. It's all good ! It would also be obvious that this is a good way to backup your stuff. # MySQL dump 8.2 # # Host: localhost Database: contacts #-------------------------------------------------------# Server version 3.22.34-shareware-debug # # Table structure for table 'address' # CREATE TABLE address ( contact_id smallint(6) DEFAULT '0' NOT NULL, StreetAddress char(50), City char(20), State char(20), Zip char(10), Country char(20), PRIMARY KEY (contact_id) ); # # Dumping data for table 'address' # INSERT INTO address VALUES (1,'300 Yamila Ave.','Los Angeles','CA','300012','USA'); INSERT INTO address VALUES (2,'4000 Nikki St.','Boca Raton','FL','500034','USA'); INSERT INTO address VALUES (3,'404 Tia Blvd.','New York','NY','10011','USA');
133
# # Table structure for table 'company_details' # CREATE TABLE company_details ( contact_id smallint(6) DEFAULT '0' NOT NULL, CompanyName char(25), Designation char(20), PRIMARY KEY (contact_id) ); # # Dumping data for table 'company_details' # INSERT INTO company_details VALUES (1,'Xerox','New Business Manager'); INSERT INTO company_details VALUES (2,'Cabletron','Customer Support Eng'); INSERT INTO company_details VALUES (3,'Apple','Sales Manager'); # # Table structure for table 'email' # CREATE TABLE email ( contact_id smallint(6) DEFAULT '0' NOT NULL, Email char(20), PRIMARY KEY (contact_id) ); # # Dumping data for table 'email' # INSERT INTO email VALUES (1,'[email protected]'); INSERT INTO email VALUES (2,'[email protected]'); INSERT INTO email VALUES (3,'[email protected]'); # # Table structure for table 'names' # CREATE TABLE names ( contact_id smallint(6) DEFAULT '0' NOT NULL auto_increment, FirstName char(20), LastName char(20), BirthDate date, 134
PRIMARY KEY (contact_id) ); # # Dumping data for table 'names' # INSERT INTO names VALUES (3,'Tia','Carrera','1975-09-18'); INSERT INTO names VALUES (2,'Nikki','Taylor','1972-03-04'); INSERT INTO names VALUES (1,'Yamila','Diaz','1974-10-13'); # # Table structure for table 'telephones' # CREATE TABLE telephones ( contact_id smallint(6) DEFAULT '0' NOT NULL, TelephoneHome char(20), TelephoneWork char(20), PRIMARY KEY (contact_id) ); # # Dumping data for table 'telephones' # INSERT INTO telephones VALUES (1,'333-50000','333-60000'); INSERT INTO telephones VALUES (2,'444-70000','444-80000'); INSERT INTO telephones VALUES (3,'555-30000','555-40000'); Let's try some SELECT statement variations: To select all names whose corresponding contact_id is greater than 1. mysql> SELECT * FROM names WHERE contact_id > 1; +------------+-----------+----------+------------+ | contact_id | FirstName | LastName | BirthDate | +------------+-----------+----------+------------+ | 3 | Tia | Carrera | 1975-09-18 | | 2 | Nikki | Taylor | 1972-03-04 | +------------+-----------+----------+------------+ 2 rows in set (0.00 sec) As a condition we can also use NOT NULL. This statement will return all names where there exists a contact_id. 135
mysql> SELECT * FROM names WHERE contact_id IS NOT NULL; +------------+-----------+----------+------------+ | contact_id | FirstName | LastName | BirthDate | +------------+-----------+----------+------------+ | 3 | Tia | Carrera | 1975-09-18 | | 2 | Nikki | Taylor | 1972-03-04 | | 1 | Yamila | Diaz | 1974-10-13 | +------------+-----------+----------+------------+ 3 rows in set (0.06 sec) Result's can be arranged in a particular way using the statement ORDER BY. mysql> SELECT * FROM names WHERE contact_id IS NOT NULL ORDER BY LastName; +------------+-----------+----------+------------+ | contact_id | FirstName | LastName | BirthDate | +------------+-----------+----------+------------+ | 3 | Tia | Carrera | 1975-09-18 | | 1 | Yamila | Diaz | 1974-10-13 | | 2 | Nikki | Taylor | 1972-03-04 | +------------+-----------+----------+------------+ 3 rows in set (0.06 sec) 'asc' and 'desc' stand for ascending and descending respectively and can be used to arrange the results. mysql> SELECT * FROM names WHERE contact_id IS NOT NULL ORDER BY LastName desc; +------------+-----------+----------+------------+ | contact_id | FirstName | LastName | BirthDate | +------------+-----------+----------+------------+ | 2 | Nikki | Taylor | 1972-03-04 | | 1 | Yamila | Diaz | 1974-10-13 | | 3 | Tia | Carrera | 1975-09-18 | +------------+-----------+----------+------------+ 3 rows in set (0.04 sec) You can also place date types into conditional statements. mysql> SELECT * FROM names WHERE BirthDate > '1973-03-06'; +------------+-----------+----------+------------+ | contact_id | FirstName | LastName | BirthDate | +------------+-----------+----------+------------+ | 3 | Tia | Carrera | 1975-09-18 | | 1 | Yamila | Diaz | 1974-10-13 | +------------+-----------+----------+------------+ 136
2 rows in set (0.00 sec) LIKE is a statement to match field values using wildcards. The % sign is used for denoting wildcards and can represent multiple characters. mysql> SELECT FirstName, LastName FROM names WHERE LastName LIKE 'C%'; +-----------+----------+ | FirstName | LastName | +-----------+----------+ | Tia | Carrera | +-----------+----------+ 1 row in set (0.06 sec) '_' is used to represent a single wildcard. mysql> SELECT FirstName, LastName FROM names WHERE LastName LIKE '_iaz'; +-----------+----------+ | FirstName | LastName | +-----------+----------+ | Yamila | Diaz | +-----------+----------+ 1 row in set (0.00 sec) SQL Logical Operations (operates from Left to Right) 1.NOT or ! 2. AND or && 3. OR or || 4. = : Equal 5. <> or != : Not Equal 6. <= 7. >= 8 <,> Here are some more variations with Logical Operators and using the 'IN' statement. mysql> SELECT FirstName FROM names WHERE contact_id < 3 AND LastName LIKE 'D %'; +-----------+ | FirstName | 137
+-----------+ | Yamila | +-----------+ 1 row in set (0.00 sec) mysql> SELECT contact_id FROM names WHERE LastName IN ('Diaz','Carrera'); +------------+ | contact_id | +------------+ |3| |1| +------------+ 2 rows in set (0.02 sec) To return the number of rows in a table mysql> SELECT count(*) FROM names; +----------+ | count(*) | +----------+ |3| +----------+ 1 row in set (0.02 sec) mysql> SELECT count(FirstName) FROM names; +------------------+ | count(FirstName) | +------------------+ |3| +------------------+ 1 row in set (0.00 sec) To do some basic arithmetic aggregate functions. mysql> SELECT SUM(contact_id) FROM names; +-----------------+ | SUM(contact_id) | +-----------------+ |6| +-----------------+ 1 row in set (0.00 sec) To select a largest value from a row. Substitute 'MIN' and see what happens next. mysql> SELECT MAX(contact_id) FROM names; +-----------------+ | MAX(contact_id) | 138
+-----------------+ |3| +-----------------+ 1 row in set (0.00 sec) HAVING Take a look at the first query using the statement WHERE and the second statement using the statement HAVING. mysql> SELECT * FROM names WHERE contact_id >=1; +------------+-----------+----------+------------+ | contact_id | FirstName | LastName | BirthDate | +------------+-----------+----------+------------+ | 1 | Yamila | Diaz | 1974-10-13 | | 2 | Nikki | Taylor | 1972-03-04 | | 3 | Tia | Carrera | 1975-09-18 | +------------+-----------+----------+------------+ 3 rows in set (0.03 sec) mysql> SELECT * FROM names HAVING contact_id >=1; +------------+-----------+----------+------------+ | contact_id | FirstName | LastName | BirthDate | +------------+-----------+----------+------------+ | 3 | Tia | Carrera | 1975-09-18 | | 2 | Nikki | Taylor | 1972-03-04 | | 1 | Yamila | Diaz | 1974-10-13 | +------------+-----------+----------+------------+ 3 rows in set (0.00 sec) Now lets work with multiple tables and see how information can be pulled out of the data. mysql> SELECT names.contact_id, FirstName, LastName, Email FROM names, email WHERE names.contact_id = email.contact_id; +------------+-----------+----------+-------------------+ | contact_id | FirstName | LastName | Email | +------------+-----------+----------+-------------------+ | 1 | Yamila | Diaz | [email protected] | | 2 | Nikki | Taylor | [email protected] | | 3 | Tia | Carrera | [email protected] | +------------+-----------+----------+-------------------+ 3 rows in set (0.11 sec) mysql> SELECT DISTINCT names.contact_id, FirstName, Email, TelephoneWork FROM names, email, telephones WHERE names.contact_id=email.contact_id=telephones.contact_id; +------------+-----------+-------------------+---------------+ 139
| contact_id | FirstName | Email | TelephoneWork | +------------+-----------+-------------------+---------------+ | 1 | Yamila | [email protected] | 333-60000 | | 2 | Nikki | [email protected] | 333-60000 | | 3 | Tia | [email protected] | 333-60000 | +------------+-----------+-------------------+---------------+ 3 rows in set (0.05 sec) So what's a JOIN ? JOIN is the action performed on multiple tables that returns a result as a table. It's what makes a database 'relational'. There are several types of joins. Let's look at LEFT JOIN (OUTER JOIN) and RIGHT JOIN Let's first check out the contents of the tables we're going to use mysql> SELECT * FROM names; +------------+-----------+----------+------------+ | contact_id | FirstName | LastName | BirthDate | +------------+-----------+----------+------------+ | 3 | Tia | Carrera | 1975-09-18 | | 2 | Nikki | Taylor | 1972-03-04 | | 1 | Yamila | Diaz | 1974-10-13 | +------------+-----------+----------+------------+ 3 rows in set (0.00 sec) mysql> SELECT * FROM email; +------------+-------------------+ | contact_id | Email | +------------+-------------------+ | 1 | [email protected] | | 2 | [email protected] | | 3 | [email protected] | +------------+-------------------+ 3 rows in set (0.00 sec) A LEFT JOIN First: mysql> SELECT * FROM names LEFT JOIN email USING (contact_id); +------------+-----------+----------+------------+------------+------------------+ | contact_id | FirstName | LastName | BirthDate | contact_id | Email| +------------+-----------+----------+------------+------------+-------------------+ | 3 | Tia | Carrera | 1975-09-18 | 3 | [email protected] | | 2 | Nikki | Taylor | 1972-03-04 | 2 | [email protected] | | 1 | Yamila | Diaz | 1974-10-13 | 1 | [email protected] | 140
+------------+-----------+----------+------------+------------+-------------------+ 3 rows in set (0.16 sec) To find the people who have a home phone number. mysql> SELECT names.FirstName FROM names LEFT JOIN telephones ON names.contact_id = telephones.contact_id WHERE TelephoneHome IS NOT NULL; +-----------+ | FirstName | +-----------+ | Tia | | Nikki | | Yamila | +-----------+ 3 rows in set (0.02 sec) These same query leaving out 'names' (from names.FirstName) is still the same and will generate the same result. mysql> SELECT FirstName FROM names LEFT JOIN telephones ON names.contact_id = telephones.contact_id WHERE TelephoneHome IS NOT NULL; +-----------+ | FirstName | +-----------+ | Tia | | Nikki | | Yamila | +-----------+ 3 rows in set (0.00 sec) And now a RIGHT JOIN: mysql> SELECT * FROM names RIGHT JOIN email USING(contact_id); +------------+-----------+----------+------------+------------+----------------- --+ | contact_id | FirstName | LastName | BirthDate | contact_id | Email | +------------+-----------+----------+------------+------------+-------------------+ | 1 | Yamila | Diaz | 1974-10-13 | 1 | [email protected] | | 2 | Nikki | Taylor | 1972-03-04 | 2 | [email protected] | | 3 | Tia | Carrera | 1975-09-18 | 3 | [email protected] | +------------+-----------+----------+------------+------------+------------------+ 3 rows in set (0.03 sec) BETWEEN 141
This conditional statement is used to select data where a certain related contraint falls between a certain range of values. The following example illustrates it's use. mysql> SELECT * FROM names; +------------+-----------+----------+------------+ | contact_id | FirstName | LastName | BirthDate | +------------+-----------+----------+------------+ | 3 | Tia | Carrera | 1975-09-18 | | 2 | Nikki | Taylor | 1972-03-04 | | 1 | Yamila | Diaz | 1974-10-13 | +------------+-----------+----------+------------+ 3 rows in set (0.06 sec) mysql> SELECT FirstName, LastName FROM names WHERE contact_id BETWEEN 2 AND 3; +-----------+----------+ | FirstName | LastName | +-----------+----------+ | Tia | Carrera | | Nikki | Taylor | +-----------+----------+ 2 rows in set (0.00 sec) ALTER The ALTER statement is used to add a new column to an existing table or to make changes to it. mysql> ALTER TABLE names ADD Age SMALLINT; Query OK, 3 rows affected (0.11 sec) Records: 3 Duplicates: 0 Warnings: 0 Now let's take a look at the 'ALTER'ed Table. mysql> SHOW COLUMNS FROM names; +------------+-------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +------------+-------------+------+-----+---------+----------------+ | contact_id | smallint(6) | | PRI | 0 | auto_increment | | FirstName | char(20) | YES | | NULL | | | LastName | char(20) | YES | | NULL | | | BirthDate | date | YES | | NULL | | | Age | smallint(6) | YES | | NULL | | +------------+-------------+------+-----+---------+----------------+ 5 rows in set (0.06 sec)
142
But we don't require Age to be a SMALLINT type when a TINYINT would suffice. So we use another ALTER statement. mysql> ALTER TABLE names CHANGE COLUMN Age Age TINYINT; Query OK, 3 rows affected (0.02 sec) Records: 3 Duplicates: 0 Warnings: 0 mysql> SHOW COLUMNS FROM names; +------------+-------------+------+-----+---------+--------+----------------+ | Field | Type | Null | Key | Default | Extra | +------------+-------------+------+-----+---------+--------+----------------+ | contact_id | smallint(6) | | PRI | NULL | | FirstName | char(20) | YES | | NULL | | | LastName | char(20) | YES | | NULL | | | BirthDate | date | YES | | NULL | | | Age | tinyint(4) | YES | | NULL | | +------------+-------------+------+-----+---------+--------+----------------+ 5 rows in set (0.00 sec) MODIFY You can also use the MODIFY statement to change column data types. mysql> ALTER TABLE names MODIFY COLUMN Age SMALLINT; Query OK, 3 rows affected (0.03 sec) Records: 3 Duplicates: 0 Warnings: 0 mysql> SHOW COLUMNS FROM names; +------------+-------------+------+-----+---------+----------------+---------------+ | Field | Type | Null | Key | Default | Extra | +------------+-------------+------+-----+---------+----------------+---------------+ | contact_id | smallint(6) | | PRI | NULL | auto_increment | | FirstName | char(20) | YES | | NULL | | | LastName | char(20) | YES | | NULL | | | BirthDate | date | YES | | NULL | | | Age | smallint(6) | YES | | NULL | | +------------+-------------+------+-----+---------+----------------+---------------+ 5 rows in set (0.00 sec) To Rename a Table: mysql> ALTER TABLE names RENAME AS mynames; Query OK, 0 rows affected (0.00 sec) mysql> SHOW TABLES; +--------------------+ | Tables_in_contacts | 143
+--------------------+ | address | | company_details | | email | | mynames | | telephones | +--------------------+ 5 rows in set (0.00 sec) We rename it back to the original name. mysql> ALTER TABLE mynames RENAME AS names; Query OK, 0 rows affected (0.01 sec) UPDATE The UPDATE command is used to add a value to a field in a table. mysql> UPDATE names SET Age ='23' WHERE FirstName='Tia'; Query OK, 1 row affected (0.06 sec) Rows matched: 1 Changed: 1 Warnings: 0 The Original Table: mysql> SELECT * FROM names; +------------+-----------+----------+------------+------+ | contact_id | FirstName | LastName | BirthDate | Age | +------------+-----------+----------+------------+------+ | 3 | Tia | Carrera | 1975-09-18 | 23 | | 2 | Nikki | Taylor | 1972-03-04 | NULL | | 1 | Yamila | Diaz | 1974-10-13 | NULL | +------------+-----------+----------+------------+------+ 3 rows in set (0.05 sec) The Modified Table: mysql> SELECT * FROM names; +------------+-----------+----------+------------+------+ | contact_id | FirstName | LastName | BirthDate | Age | +------------+-----------+----------+------------+------+ | 3 | Tia | Carrera | 1975-09-18 | 24 | | 2 | Nikki | Taylor | 1972-03-04 | NULL | | 1 | Yamila | Diaz | 1974-10-13 | NULL | +------------+-----------+----------+------------+------+ 3 rows in set (0.00 sec) 144
DELETE mysql> DELETE FROM names WHERE Age=23; Query OK, 1 row affected (0.00 sec) mysql> SELECT * FROM names; +------------+-----------+----------+------------+------+ | contact_id | FirstName | LastName | BirthDate | Age | +------------+-----------+----------+------------+------+ | 2 | Nikki | Taylor | 1972-03-04 | NULL | | 1 | Yamila | Diaz | 1974-10-13 | NULL | +------------+-----------+----------+------------+------+ 2 rows in set (0.00 sec) A DEADLY MISTAKE... mysql> DELETE FROM names; Query OK, 0 rows affected (0.00 sec) mysql> SELECT * FROM names; Empty set (0.00 sec) One more destructive tool... DROP TABLE mysql> DROP TABLE names; Query OK, 0 rows affected (0.00 sec) mysql> SHOW TABLES; +--------------------+ | Tables in contacts | +--------------------+ | address | | company_details | | email | | telephones | +--------------------+ 4 rows in set (0.05 sec) mysql> DROP TABLE address ,company_details, telephones; Query OK, 0 rows affected (0.06 sec) mysql> SHOW TABLES; Empty set (0.00 sec) As you can see, the table 'names' no longer exists. MySQL does not give a warning so be careful. 145
FULL TEXT INDEXING and Searching Since version 3.23.23, Full Text Indexing and Searching has been introduced into MySQL. FULLTEXT indexes can be created from VARCHAR and TEXT columns. FULLTEXT searches are performed with the MATCH function. The MATCH function matches a natural language query on a text collection and from each row in a table it returns relevance.The resultant rows are organized in order of relevance. Full Text searches are a very powerful way to search through text. But is not ideal for small tables of text and may produce inconsistent results. Ideally it works with large collections of textual data. Optimizing your Database Well, databases do tend to get large at some or the other. And here arises the issue of database optimization. Queries are going to take longer and longer as the database grows and certain things can be done to speed things up. Clustering The easiest method is that of 'clustering'. Suppose you do a certain kind of query often, it would be faster if the database contents were arranged in a in the same way data was requested. To keep the tables in a sorted order you need a clustering index. Some databases keep stuff sorted automatically. Ordered Indices These are a kind of 'lookup' tables of sorts. For each column that may be of interest to you, you can create an ordered index. It needs to be noted that again these kinds of optimization techniques produce a system load in terms of creating a new index each time the data is re-arranged. There are additional method such as B-Trees, Hashing which you may like to read up about but will not be discussed here. Replication Replication is the term given to the process where databases synchronize with each other. In this process one database updates it's own data with respect to another or with reference to certain criteria for updates specified by the programmer. Replication can be used under various circumstances. Examples may be : safety and backup, to provide a closer location to the database for certain users. What are Transactions ? 146
In an RDBMS, when several people access the same data or if a server dies in the middle of an update, there has to be a mechanism to protect the integrity of the data. Such a mechanism is called a Transaction. A transaction groups a set of database actions into a single instantaneous event. This event can either succeed or fail. i.e .either get the job done or fail. The definition of a transaction can be provided by an Acronym called 'ACID'. (A)tomicity: If an action consists of multiple steps - it's still considered as one operation. (C) Consistency: The database exists in a valid and accurate operating state before and after a transaction. (I) Isolation: Processes within one transaction are independent and cannot interfere with that in others. (D) Durability: Changes affected by a transaction are permanent. To enable transactions a mechanism called 'Logging' needs to be introduced. Logging involves a DBMS writing details on the tables, columns and results of a particular transaction, both before and after, onto a log file. This log file is used in the process of recovery. Now to protect a certain database resource (ex. a table) from being used and written onto simulatneously several techniques are used. One of them is 'Locking' another is to put a 'time stamp' onto an action. In the case of Locking, to complete an action, the DBMS would need to acquire locks on all resources needed to complete the action. The locks are released only when the transaction is completed. Now if there were say a large numbers of tables involved in a particular action, say 50, all 50 tables would be locked till a transaction is completed. To improve things a bit, there is another technique used called 2 Phase Locking or 2PL. In this method of locking, locks are acquired only when needed but are released only when the transaction is completed. This is done to make sure that that altered data can be safely restored if the transaction fails for any reason. This technique can also result in problems such as "deadlocks". In this case - 2 processes requiring the same resources lock each other up by preventing the other to complete an action. Options here are to abort one, or let the programmer handle it. MySQL implements transactions by implementing the Berkeley DB libraries into its own code. So it's the source version you'd want here for MySQL installation. Read the MySQL manual on implementing this. Beyond MySQL 147
What are Views ? A view allows you to assign the result of a query to a new private table. This table is given the name used in your VIEW query. Although MySQL does not support views yet a sample SQL VIEW construct statement would look like: CREATE VIEW TESTVIEW AS SELECT * FROM names; What are Triggers ? A trigger is a pre-programmed notification that performs a set of actions that may be commonly required. Triggers can be programmed to execute certain actions before or after an event occurs. Triggers are very useful as they they increase efficiency and accuracy in performing operations on databases and also are increase productivity by reducing the time for application development. Triggers however do carry a price in terms of processing overhead. What are Procedures ? Like triggers, Procedures or 'Stored' Procedures are productivity enhancers. Suppose you needed to perform an action using a programming interface to the database in say PERL and ASP. If a programmed action could be stored at the database level, it's obvious that it has to be written only once and cam be called by any programming language interacting with the database. Procedures are executed using triggers. Beyond RDBMS Distributed Databases (DDB) A distributed database is a collection of several, logically interrelated database located at multiple locations of a computer network. A distributed database management system permits the management of such a database and makes the operation transparent to the user. Good examples of distributed databases would be those utilized by banks, multinational firms with several office locations where each distributed data system works only with the data that is relevant to it's operations. DDBs have have full functionality of any DBMS. It's also important to know that the distributed databases are considered to be actually one database rather than discrete files and data within distributed databases are logically interrelated. Object Database Management Systems or ODBMS 148
When the capabilities of a database are integrated with object programming language capababilities, the resulting product is an ODBMS. Database objects appear as programming objects in an ODBMS. Using an ODBMS offers several advantages. The ones that can be most readily appreciated are: 1. Efficiency When you use an ODBMS, you're using data the way you store it. You will use less code as you're not dependent on an intermediary like SQL or ODBC. When this happens you can create highly complex data structures through your programming language. 2. Speed When data is stored the way you'd like it to be stored (i.e. natively) there is a massive performance increase as no to-and-fro translation is required. A Quick Tutorial on Database Normalization Let's start off by taking some data represented in a Table. Table Name: College Table StudentN CourseI CourseTi CourseProfe CourseI CourseTi CourseProfe ame D1 tle1 ssor1 D2 tle2 ssor2 Perl Object Tia Regular Oriented CS123 Don CorleoneCS003 Daffy Duck Carrera Expressio Program ns ming 1 Socket John Algorithm Homer CS456 Program DJ Tiesto CS004 Wayne s Simpson ming Data Lara Croft CS789 OpenGL Bill Clinton CS001 Papa Smurf Structures
StudentAd Student visor ID Fred Flintstone
400
Barney Rubble
401
Seven of Nine
402
(text size has been shrunk to aid printability on one page) The First Normal Form: (Each Column Type is Unique and there are no repeating groups [types] of data) This essentially means that you indentify data that can exist as a separate table and therefore reduce repetition and will reduce the width of the original table. We can see that for every student, Course Information is repeated for each course. So if a student has three course, you'll need to add another set of columns for Course Title, Course Professor and CourseID. So Student information and Course Information can be considered to be two broad groups. 149
Table Name: Student Information StudentID (Primary Key) StudentName AdvisorName Table Name: Course Information CourseID (Primary Key) CourseTitle CourseDescription CourseProfessor It's obvious that we have here a Many to Many relationship between Students and Courses. Note: In a Many to Many relationship we need something called a relating table which basically contains information exclusively on which relatioships exist between two tables. In a One to Many relationship we use a foreign key. So in this case we need another little table called: Students and Courses Table Name: Students and Courses SnCStudentID SnCCourseID
The Second Normal Form: (All attributes within the entity should depend solely on the entity's unique identifier) The AdvisorName under Student Information does not depend on the StudentID. Therefore it can be moved to it's own table. Table Name: Student Information StudentID (Primary Key) StudentName Table Name: Advisor Information AdvisorID AdvisorName
150
Table Name: Course Information CourseID (Primary Key) CourseTitle CourseDescription CourseProfessor Table Name: Students and Courses SnCStudentID SnCCourseID Note: Relating Tables can be created as required. The Third Normal Form:(no column entry should be dependent on any other entry (value) other than the key for the table) In simple terms - a table should contain information about only one thing. In Course Information, we can pull CourseProfessor information out and store it in another table. Table Name: Student Information StudentID (Primary Key) StudentName Table Name: Advisor Information AdvisorID AdvisorName Table Name: Course Information CourseID (Primary Key) CourseTitle CourseDescription Table Name: Professor Information ProfessorID CourseProfessor
151
Table Name: Students and Courses SnCStudentID SnCCourseID Note: Relating Tables can be created as required.
152