Basic Definitions

Database

An organized collection of information (nowadays almost invariably electronic). The rest of this section will focus on relational databases

Table

A structure (on disk) that consists of rows and columns. The rows are also called records, the columns are also called fields. In relational databases, the table is a fundamental building block. A database consists of one or more tables, which are related (conceptually linked) to each other.

For example, a table of Students will have the fields Social Security Number, first name, last name, date of birth, etc. There will be one row (record) for each student.

In E.F. Codd's original papers "relation" was a synonym for table. But now it refers to the fact that different tables are logically linked, or "related", to each other, as we shall see shortly.

Datatypes

A field must have a definite datatype. Examples of datatypes are:

"Numeric" data such as phone numbers and SSNs should be stored as text, not as numbers because leading zeros are significant, and arithmetic operations on such numbers are meaningless.

Entities and Relationships

When planning a database, one needs to identify Entities (the things about which we want to capture information) and the Relationships (logical links) between them. Relationships between entities are one-to-one, one-to-many, or many-to-many. The last is implemented through a bridge table between entities, as shown below.(In the diagram, the infinity symbol implies "many".)   One student can take many courses, and the same course can be taken by multiple students, so Students and Courses are related many-to-many to each other.

schema.jpg (10547 bytes)

Index

A data structure (also on disk) that supports fast searching and sorting on one or more fields of a table. "Fast" implies that search time is proportional to the logarithm of the number of records. A table can have multiple indexes, each one used as needed by the database engine. Indexing technology antedates relational databases, but indexing is what makes database engines far superior to spreadsheets in dealing with vast volumes of data: for high-end database engines, the volume of data is limited, for practical purposes, to whatever willl fit on your disk. For large volumes of data, logarithmic search can be hundreds or thousands of times faster than a linear search.

Key

A combination of one or more fields that are used for indexed search.

A candidate key: a combination of one or more fields whose value uniquely identifies a record in a table-that is, no two records in a table can have the same key value. When you create an index on one or more fields, you can specify that the index must contain unique values only. In such a case, any attempt to create a duplicate value will cause the database engine to complain.

Primary Key: A table may have one or more candidate keys, but in any case one of these keys must be designated as the primary key. The concept of primary keys is fundamental to good table design. A good choice for the primary key of a table is a field that is  reasonably short, and (preferably) changes rarely, if at all. For people in the USA, a good choice of primary key is social security number, since it is both unique and never changes. Names are not a good choice, because there may be more than one individual with the same first and last names (and names may change).

Designating a primary key for a table results automatically in a unique index being created on the primary key field/s. 

In the schema diagram above, the fields in bold indicate the primary key for each table. Note that the table Students_Courses has a primary key consisting of two fields, student SSN and course ID.

It is not absolutely essential that every table have a primary key, though it helps to think of the combination of fields in a table that could uniquely identify a record. However, if you are going to use the key values from one table in another table, the first table must have a primary key.

Foreign Key: When a key in one table is referenced in another table, the key in the second table is called a foreign key. (In the Students_Courses table in the diagram, the fields Soc_Sec_No and Course_ID are, respectively, foreign keys into the Students and Courses tables.) The concept of foreign keys is critical to good database design; primary keys and foreign keys together form the basis of inter-table links.

Referential Integrity

Referential integrity implies several things:

The referential integrity dialog activated by clicking on a link between two tables in MS-Access (here, the link between Students and Student_Courses) is shown below.

ref_integrity.jpg (29319 bytes)

 

Equijoins and Outer Joins

(In the diagram above, the dialog on the right is activated by clicking on the "Join Type..." button in the dialog on the left.) In almost all cases, the purpose of the link between the two tables (called a "join") is intended to link only matching values. Thus, if we are looking at data that combines rows from Students and Students_Courses, records in the "one" (Students) table that lack counterparts in the "many" (Students_Courses) table will not be visible. However, if we wish to look at all students, including those who are not currently taking any course, we resort to an "outer join". An outer join is indicated by options 2 and 3 in the right-hand dialog above.

Artificial Primary Keys (Surrogate Keys): Autonumber Fields

Very often, a combination of two or three fields will determine the primary key. If this key is to be referenced in other tables, then, in order to save space as well as improve performance, database engines will often allow the designer to create a field that contains a value (a machine-generated Unique Identifier or UI) that increments each time a new record is added, and  is guaranteed to be unique (such values will not be recycled if records are deleted). Depending on the package, such fields are called autonumber fields or identity fields. The artificial primary key created by using an autonumber fields is also called a surrogate key. Surrogate keys have the advantage that, because they are "meaningless", they do not change their meaning and so do not need to be changed if the semantics of a key value changes (so cascading updates, which can be resource-intensive, are not necessary).

An example is a database for recording outpatient clinical patient data. We have the following tables:

  1. Basic patient demographics: candidate key SSN
  2. Visits: candidate key SSN and visit date/time
  3. Outpatient clinics/sites visited: candidate key SSN, visit date, clinic name
  4. Details of findings at each clinic site: candidate key SSN, visit date, clinic name, finding code.

Here, each table is related many-to-one to the previous one. To save space, we can create artificial primary keys for tables 2 (a visit ID) and 3 (a visit-clinic ID). The modified schema then becomes:

  1. Basic patient demographics: primay key SSN
  2. Visits: primary key: visit ID (SSN and visit date/time are still recorded)
  3. Outpatient clinics/sites visited: primary key: visit-clinic ID. Clinic ID is still recorded, as is visit ID.
  4. Details of findings at each clinic site: primay key: visit-clinic ID, finding code.

The modified schema is shown below:

autonumber.jpg (12302 bytes)

Drawbacks of Surrogate Keys

There is only one potential drawback of autonumber fields: the unique identifiers apply only to the database in which they reside.  If several people or groups are sharing data which is being generated from heterogeneous sources, the only way to use UIs is to designate a central location that will assign UIs across common data pooled across all groups.

There are also some situations where the use of machine-generated UIs (as opposed to natural UIs such as Social Security Number) can result in a major amount of work. If two different organizations decide to merge, their databases also have to be merged. In this circumstance, there will almost certainly be a clash of UIs— thus, product 1235 may mean Widgets in the Products database of organization A and Gidgets in the Products database of organization B. After the merger, considerable data massaging (and the assignment of new IDs) has to be done in order to ensure that each product ID is unique once more. (This is one of the reasons why, after a merger, a business rarely gets its data processing act together until several months have elapsed.)

Constraints

Constraints are "rules" that limit a given field's permissible values. The most common types of constraints, which we have already addressed earlier, are data type constraints (a field must be of a particular data type) and referential integrity constraints (a field's values must be limited to those occurring in a primary key of another specified table).

However, constraints can be far more general. You may, for example, determine that a field, CourseGrade, that sets the grade for a student, must be one of the text values: (A+, A, A-,B+,B, B-,C+, C, C-, D+, D, D-, F), or that a person's age must be between 0 and 121 years.

The ease with which you can specify constraints depends a lot on the database engine you are using. In Microsoft Access, for example, there is support for an "Input Mask", which activates during data entry, and automatically presents a kind of template for input data such as phone numbers, social security numbers and the like. The template for phone numbers includes parentheses for area code and a hyphen separating the exchange code from the last four digits. Entry of alphabetical characters in a phone number is prevented, as is entry of more (or less) than the requisite number of digits.  

In the high-end databases, which allow you to write program code using a highly expressive programming language like Java, C# or Visual Basic,  there are no limits to how complex or sophisticated a constraint can be: you can support any constraint that you are capable of programming.