Fundamentals of Databases

These notes cover:

  • What a database is, and
  • What a database management system is, and
  • What the major topics of this course are.

Database: a model of reality.

Why use models?

  • Models can be useful when we want to examine or manage part of the real world.
  • Costs of using a model are often considerably lower than the costs of using or experimenting with the real world itself.
  • Example models:
    • map
    • model of US Economy
    • tsunami warning system
    • traffic simulation
  • Message to Model Makers
    • a model is a means of communication
    • users of a model must have a certain amount of knowledge in common
    • North is up, true to scale, colors right, etc.
    • a model
    • only emphasizes selected aspects
    • is described in some language
    • can be erroneous
    • may have features that do not exist in reality
    • contour-lines, international date line, etc.

Database Management System (DBMS)

  • Reasons to Use
    • Data Intensive apps (as opposed to process-intensive apps). Data Intensive: a lot of data flows between memory and secondary storage, and a lot of data may be communicated between user and database.
    • Persistent storage of data
    • Centralized Control of Data
    • Control of Redundancy – limits of prescribes duplications in data
    • Control of Consistency and Integrity
    • Multiple User Support (tens or hundreds of thousands of users, flight reservation systems, point of sales, etc)
    • Sharing of Data – key to communication
    • Data Documentation – databases cannot run without definitions of the data structure, and so contain their own documentation
    • Data Independence – ability to change the implementation of a database to be more efficient without changing the user interface
    • Control of access and security
    • Backup and recovery
  • Reasons Not to Use
    • the initial investment in hardware, software, and training is too high
    • generality is not needed – examples:
    • Overhead for security, concurrency control, and recovery is too high
    • data and applications are simple and stable
    • real-time requirements cannot be met by it
    • multiple user access is not needed

Outline of Major Topics

  1. Data modeling
  2. Process modeling
  3. Database Efficiency

Data Modeling

  • The Model represents a perception of structures of reality
  • The data modeling process is to …
    • fix a perception of structures of reality (accomplished using Extended Entity Relationship Model), and
    • represent this perception (accomplished using Relational Model).
  • In this process, we select aspects and we abstract.

Process Modeling

  • Fixing and representing a perception of processes of reality.
  • Processes may be represented:
    • Embedded in program code (Data Manipulation Language, SQL, program code)
    • Executed ad-hoc (use Data Manipulation Language, SQL, directly)

Data Models, Database Architecture, and Database Management System Architecture

Data Models

  • Data Structures
  • Constraints
  • Operations
  • Keys and Identifiers
  • Integrity and Consistency
  • Null Values
  • Surrogates

Database Architecture

  • ANSI/SPARC 3-Level Database Architecture
  • Data Independence

Database Management System Metadata

Data Model - A data model is not the same as a model of data.

  • We defined a database to be a model of structures of reality.
  • A data model is a tool or formalism that we use to create such a model.

Three Elements must be kept in mind when discussing data models

  • Data Structures
  • Integrity Constraints
  • Operations

Examples of Data Models

  • Entity-Relationship Model – fixes a perception of reality
  • Relational Model – implements a model in a DBMS
  • Hierarchical Model – implemented in the first DBMS, underpins XML today

Relational Model

  • Data is represented in tables
  • Table name, column name, and datatypes constitute the database’s “Schema”
  • Schema represents aspects of the data that are stable over time, IE, not expected to change over time. Schema represents structure.
  • Rows of the table represent the “State” of a table. State represents the state of reality the database represents.
  • Constraints express rules that cannot be expressed by the data structures alone. Constraints are not included in the columns/datatypes. Examples:
    • Emails must be unique
    • Emails are not allowed to be NULL
    • BirthDate must be after 1900-01-01
    • Hometown must be cities in the US
  • Operations support change of and retrieval of data (INSERT and SELECT)
  • Keys are uniqueness constraints
    • Making email the primary key in a table will force all emails to be unique in the table
  • Integrity and Consistency
    • Integrity: does the database reflect reality well?
    • Consistency: is the database without internal conflicts? IE different cities in “Address” and “CurrentCity” fields.
  • NULL Values
    • NULL values can represent an unknown piece of information, which is okay.
    • NULL values can also represent an inapplicable piece of information, which is not okay. It indicates that the table should probably be redesigned.
  • Surrogates – Things and Names
    • Name-based representations: the piece of information is what is known about it, and no more. The fields can change.
    • Surrogate-based representations: surrogates identify each piece of information. They are system-generated, unique, internal identifiers that never change.

ANSI/SPARC 3-Level Database Architecture

  • A database is divided into schema and data
    • The schema describes the intension (types)
    • The data describes the extension (data)
  • ANSI/SPARC separates out the schema into three levels
    • External Schema – outermost layer(s), one for each application. Focused on the use of data.
    • Conceptual Schema – Focused on the meaning of data.
    • Internal Schema – Focused on the storage of data.

Conceptual Schema

  • Describes all conceptually relevant, general, time-invariant structural aspects of reality
  • Excludes aspects of data representation, physical organization, and access
  • Applications can only “see” these structures

External Schema

  • Describes parts of the information in the conceptual schema in a form convenient to a particular user group’s view
  • It is derived from the conceptual schema
  • These are “views,” or windows, into the conceptual schema, designed to fit a particular application’s use case.

Internal Schema

  • Describes how the information described in the conceptual schema is physically represented to provide the best overall query performance.
  • May have a special sort in place, to facilitate lookups
  • If there are frequently queries run on a particular field, such as Salary, there might be an index defined on Salary that enables logarithmic time access
    • These be removed with no visible impact from the level of the external schema.

Conceptual vs External vs Internal Schema – Office Hour 1/24/19

Conceptual Schema

  • Overall logical model of the piece of the universe/reality.
  • Consists entirely and only of concepts.
  • No info about how files are laid out, indexed, stored, sorted, etc.
  • Logical, conceptual model of reality.
  • Constitutes the whole and complete single model of the piece of reality we are working with in the database.

Internal Schema

  • Implementation of the Conceptual Schema
  • Contains the structure and organization for every single one of the structures in the conceptual schema.
  • Includes things such as sorting
  • One item in the internal schema might be two in the conceptual, and vice versa.
  • In rare cases, may be identical to the conceptual schema.

External Schema

  • There are multiple external schemata
  • Each is a subset that is logically derived from the conceptual schema using operators like SQL or relational algebra/calculus queries
  • Defines a subset or view of the database
  • Conceptual schemas contain one or more subsets or views
  • Applications that are developed to run on the database are developed to run on the external schema or on the conceptual schema.
  • Since sorting, storage, physical structures, locations, indexing is not unveiled in the conceptual schema and SQL queries only run on the conceptual schema, there is no way to come to any conclusions about the way the data is stored.
    • This is the Physical Data Independence that is allowed by the separation of Conceptual schema and internal schema

Two types of independence

  • Physical Data Independence – a measure of how much the internal schema can change without affecting the application programs.
  • Logical Data Independence – a measure of how much the conceptual schema can change without affecting the application programs. Logical data independence is more difficult to provide than Physical.

Metadata

System Metadata:

  • Where data came from
  • How data were changed
  • How data are stored
  • How data are mapped
  • Who owns data
  • Who can access data
  • Data usage history
  • Data usage statistics.

Business metadata:

  • What data are available
  • Where data are located
  • What does the data mean
  • How do you access the data
  • Predefined reports
  • Predefined queries
  • How current the data are

Metadata is very important

  • Systems metadata are critical in a DBMS
  • Business metadata are critical in a data warehouse