1.课程简介
(1)教师是Rachid Hamadi,名字发音是Ra’shid。
(2)课程主要是理论,少量编程,使用开源软件PostgreSQL 9.3.3,这个软件是engineer-based,市面常见的是web-based MySQL。
(3)共3个assignment:
– Assignment 1: ER Diagram / Relational Mapping (10%) (due Week 5)
– Assignment 2: SQL Queries and Functions (15%) (due Week 8)
– Assignment 3: Normalisation / Relational Algebra / Transaction (15%) (due Week 11)
database的一个核心是消除redundancy,一条数据不要重复储存。database最常见的问题是如何同时处理数据,比如,在一个账户下多人同时save和withdraw。
(4)9311的后续课程
• COMP9315: how to build relational DBMSs (write your own Oracle)
• COMP9318: techniques for data mining (discovering patterns in DB)
• COMP6714: information retrieval, web search (dealing with text data)
• COMP9319: web search and data compression (dealing searching compressed web data)
• COMP932(1|2|3): service-oriented computing, which relies on DB background
课件中在大标题上经常能看到“cont’d”,含义是continue
2.Data Modeling
(1)什么是database
最简单的定义是Database is a set of tables that data are related to each other. 换言之,无关联的数据不应该用数据库处理。
Textbook的定义:a collection of related data
(2)database中的重要概念
1.Data:能描述所代表的对象特征
2.Relationships:定义数据间的关联
3.Constraints:关于数据和关联的限制,比如bank teller不能修改客户的账户金额
4.Redundancy: 每条数据只在一处存储
5.Data manipulation:常见declarative和procedural两类,data manipulation是数据操作,比如筛选数据、插入、删除。declarative programming描述对象的性质而不是流程,主要在逻辑层面构建管理数据库,procedural programming起源于structured programming,基于程序调用(procedure call)或函数调用(function call)进行流程控制。
6.Transactions:多条数据的同时处理,一条数据的多用户处理,以及引发的数据集聚效应
7.Concurrency:多用户同时调用一个数据
8.Scale:big data被应用的越来越多,如何构建成规模的数据库变成越来越重要的问题
(3)什么是data
Textbook的定义:Known facts that can be recorded and have explicit meaning. 这意味着数据应该有数据类型和存储方式等信息。
孤立存在的数据没有什么价值,有关联的数据形成确定的数据结构,再通过database management system(DBMS)管理就实现了数据的价值。
(4)什么是DBMS
Textbook的定义:a collection of programs that enables users to create and maintain a database.
(5)什么是database system
Textbook的定义:the database and DBMS together.
按照范围来记忆,上文提到最小的单元是data,data关联起来后形成database,构建对database的管理形成DBMS,database和DBMS加和构成了database system。即
data < database < DBMS < database system
(6)Database Users
1.Database Administrator(DBA):管理数据库权限,协调服务用户,维护数据库的软硬件,提升数据库的效率
2.Databse Designer:定义数据结构,数据限制,数据传输,要直接对终端客户的需求负责
3.End Users:数据查询,数据更新等。其下又分为4类用户--a. Casual,偶尔使用数据库;b. Naive,占据绝大多数的end users,他们不需要了解数据库的底层构建,而是在顶层使用,例如之前提到的bank-teller;c. Sophisticated,这些用户非常熟悉database,他们的工作和database密切相关,例如商业分析师、科学家和工程师;d. Stand-alone,维护使用自己的数据库,并不分享,例如个人通讯录
(7)DBMS
课程核心是学会SQL-Structured Query Language,具体使用什么样的DBMS不重要,一般Relational database system is more efficient and flexible than object-oriented database system.
注:PL-Procedural Language
1.Oracle:resource hungry, non-flexible
2.MySQL:开源,高效,web-based,但oracle控制了MySQL,未来可能解除开源的属性
3.PostgreSQL:课程使用
4.MongoDB:开源,NoSQL database program
(8)数据库开发流程
database application development:
1.analyse application requirements
2.develop a data model to meet these requirements
3.define operations (transactions) on this model
4.implement the data model as relational schema
5.implement transactions via SQL and PLs
6.construct a web interface to these transactions
分析需求-建立模型-定义操作-实现关系模型-实现操作-构建交互界面
(9)数据库语言
1.Request to DBMS(e.g. SQL):
data manipulation language, DML (data level);
data definition language, DDL (data structure constraints);
create and drop database, indexes, functions
2.Results/effects from DBMS requests
tuples or sets of tuples
changes to underlying data store
(10)Data Modeling
1.Data modeling的目标:describe information, relationships, constraints.
2.Data modelling is a design process, it converts requirements into a data model.
3.Data modeling类型:logical(抽象的概念设计,比如Entity Relationship, Object Definition Language),physical(实用型,例如relational)。一般先design using abstract mode,再 map to physical model。
4.design ideas:从简单的信息入手,先找到对象然后分析关联,一般自然语言的名词是数据,而动词是关联,考虑所有可能的数据。没有完美的design,design时要多考虑正确性correctness,完整性completeness,一致性consistency。
Example Gmail
最重要的是分析model的核心要素:information/data, relationships, constraints
Data Objects:
1.Users--username, password, security questions, backup email, phone, etc.
2.Emails--subject, body, from, to, cc, bcc, date-sent, status, etc.
3.Attachments--content-type, content, etc.
4.Sessions--user, expiry-time, location, etc.
5.Labels--name, color, etc.
Relationships:
Email-has-Label
Email-has-Attachment
Email-is a reply to-Email
3.ER Notation(以Textbook的Entity Relationship notation为准)
(1)ER Model
世界是由互相关联的entities构建起来的。
ER的三个核心要素是:
attributes属性--data item describing a property of interest
entity实体--collection of attributes describing object of interest
relationship关联--association between entities (objects)
(2)ER Diagrams
ER diagrams are a graphical tool for data modelling.
It is consist of Data, Relationships, Attributes, Connections among them.
图中有3种一维几何图形(线段,加粗线段,箭头)和3种二维几何图形(矩形,椭圆,菱形)
1.椭圆,代表attributes
2.矩形,代表data/entities
3.菱形,代表relationships
4.线段,代表participation中的”部分关系“
5.加粗线段,代表participation中的”完全关系“
6.箭头,代表cardinality中的”1“;非箭头,代表cardinality中的”多“
7.加横线的椭圆,代表key attributes,一般是unique的
另外
1.entity/data通常用单数名词表示一类概念
2.如果一个椭圆是可以依据其他椭圆计算得到的,则用虚线椭圆表示,比如年龄可以用出生日期计算。叫做derived attributes
3.如果一个椭圆包含多个椭圆,则用双线椭圆表示,比如最喜欢的食物,其中包含多个attributes。叫做multivalued attributes
(3)entity sets
一种是extensional view,通过同类事物扩展得到一个entity set;
另一种是intensional view,通过抽象事物得到一个class的entity set。
(4)attributs的keys
1.key/superkey指的是attributes中比较独特可以代表object的一个/一组,从superkey中如果删除一个attribute,往往还可以代表object;
2.candidate key值得也是attributes中比较独特可以代表object的一个/一组,但是如果从candidate key中删除一个attribute,不能再代表object;
3.primary key是由databse designer选择的
注:ER图中key用下划线标注。
(5)relational sets
Relationship被定义为an association among several entities;
Relationship set被定义为collection of relationships of the same type;
Degree = # entities involved in relationship (in ER model, ≥ 2);
Cardinality = # associated entities on each side of relationship;
Participation = must every entity be in the relationship
(6)Subclasses and Inheritance
1.A subclass of an entity set A is a set of entities:
• with all attributes of A, plus (usually) its own attributes
• that is involved in all of A's relationships, plus its own
2.Properties of subclasses:
• overlapping or disjoint
• total or partial