CPS 430/542 Lecture notes: Normalization



Coverage: [FCDB] §§3.6-3.7 (pp. 102-127)


The BIG picture

  • ideas → E/R → relations → better (normalized) relations
  • why study FD's? inferring FD's is very important to identifying flaws in the design of a database
  • final goal: Boyce-Codd Normal Form (BCNF)


Students relation

  • key: {id, advisor_id}
  • FD's: {idname level favorite_advisor, advisor_idadvisor_office}
  • name | id | level | advisor_id | advisor_office | favorite_advisor
    ------------------------------------------------------------------
    Mark   1    Senior  349          AN151            350
    ???    1    ???     350          MH134            ???
    Kathy  2    Senior  146          AN240            146
    ???    1    ???     351          AN130            ???
    ???    1    ???     352          AN131            ???
    ???    2    ???     351          ???              ???
    David  3    Junior  349          ???              349
    ------------------------------------------------------------------
    
    `???' denotes redundant, i.e., can be inferred from other tuples


Sources of redundancy

  • update anomaly (more serious of the two): what happens if we update the favorite_advisor of Mark?
  • deletion anomaly: if 350 is not the advisor for anybody, we lose their room info!


Root cause of the problem

  • main idea: if each FD is not in the form `key → {all other attributes}' then the relation has too much stuff in it
  • specifically, each FD has a lhs which is a proper subset of the key! (a BCNF violation)


Normalization

  • process of making relations better by decomposing them into smaller relations to
    • reduce redundancy
    • eliminate update anomalies
    • eliminate deletion anomalies
  • final goal: all relations in Boyce-Codd Normal Form (BCNF)


Boyce-Codd Normal Form

  • the lhs of every nontrivial FD is a super-key
  • above anomalies are guaranteed not to exist in relations which satisfy this condition
  • formally, R is in BCNF if for every nontrivial FD XA in R, X is a superkey
  • advantages
    • removes redundancy
    • removes update anomalies
    • removes deletion anomalies


Proof that every two-column relation is in BCNF

  1. there are no nontrivial FD's; key: {A, B}; R in BCNF
  2. AB is the only nontrivial FD; key: {A}; R in BCNF
  3. BA is the only nontrivial FD; key: {B}; R in BCNF
  4. AB, BA are the only nontrivial FD's keys: {A}, {B}; R in BCNF


Decomposition into BCNF

  • need to ensure that we can reconstruct the decomposed relation
  • therefore, we cannot just break a relation schema into a collection of two-attribute relations
  • process:
    1. find a nontrivial FD where the lhs is not a superkey; make sure the rhs is maximally expanded
    2. split the original relation into two new relations
      • one containing only the attributes from both sides of the violating FD
      • the other containing all attributes from the original relation except the rhs of the violating FD


Decomposing Movies

Movies(title, year, length, filmType, studioName, starName)
  • has update and deletion anomalies
  • is it in BCNF? no; violating FD: title year → length, filmType, studioName
  • decompose into
    • Movies1(title, year, length, filmType, studioName)
    • Movies2(title, year, starName)
  • are these relations in BCNF?
  • must determine its complete set of FD's by computing the closure of each subset of its attributes, using the full set of FD's satisfied by the decomposed relation
  • does Movies2 have an update anomaly?


Another Movie example

  • MovieStudio(title, year, length, filmType, studioName, studioAddr)
  • FD's: title year → length filmType studioName studioAddr, studioName → studioAddr
  • key: {title, year}
  • decomposition:
    • MovieStudio1(title, year, length, filmType, studioName)
    • MovieStudio2(studioName, studioAddr)


Is one application sufficient? No

  • StudioPres(title, year, studioName, president, presAddr)
  • FD's:
      title year → studioName
      studioName → president
    + president → presAddr
    --------------------------
      title year → studioName president presAddr
      studioName → president presAddr
      president → presAddr
    
  • key: {title, year}
  • decomposition:
    • StudioPres1(title, year, studioName)
    • StudioPres2(studioName, president, presAddr)
  • StudioPres1 is in BCNF; key: {title, year}
  • StudioPres2 is not in BCNF;
    • key: {studioName}
    • violating FD: president → presAddr
  • solution: decompose again
  • final relational schema
    • R1(title, year, studioName)
    • R2(studioName, president)
    • R3(president, presAddr)


Reconstructing a relation from a decomposition

  • join the relations on the common attributes
  • guaranteed to produce the original relation if we decomposed based on the BCNF decomposition (a violating FD)
  • decomposing in a way not based on an FD:
    -----
    A B C
    -----
    1 2 3
    4 2 5
    -----
    
    ---
    A B
    ---
    1 2
    4 2
    ---
    
    ---
    B C
    ---
    2 3
    2 5
    ---
    
    after joining, we get
    -----
    A B C
    -----
    1 2 3
    1 2 5
    4 2 3
    4 2 5
    -----
    


Third Normal Form (3NF)

  • consider:
    • Bookings(title, theatre, city) with
    • FD's: {theatre → city, title city → theatre}
  • FD theatre → city violates the condition for BCNF
  • decompose:
    • R1(theatre, city)
    • R2(theatre, title)
    • --------------------
      theatre | city
      --------------------
      Guild   | Menlo Park
      Park    | Menlo Park
      --------------------
      
      -----------------
      theatre | title
      -----------------
      Guild   | The Net
      Park    | The Net
      -----------------
      
      ------------------------------
      theatre | city       | title
      ------------------------------
      Guild   | Menlo Park | The Net
      Park    | Menlo Park | The Net
      ------------------------------
      
  • now FD title city → theatre is not preserved
  • solution: slightly relaxing BCNF requirement leads to the third normal form (3NF)
  • a relation is in 3NF if for each nontrivial FD XY, either X is a superkey or Y is prime (a member of some key)
  • decomposition into relations in 3NF
    • does not lose any information, and
    • allows FD's to be verified,
    • but if the relations are not also in BCNF, then there will be some redundancy in the schema


Recap

  • the decomposition into BCNF provides a lossless join decomposition, i.e., we can reconstruct the tuples of the original relation by joining
  • the BCNF decomposition however does not preserve dependencies
  • 3NF is weaker than BCNF
  • decomposition into 3NF (not covered)
    • preserves dependencies, and
    • provides a lossless join,
    • but does not guarantee the elimination of redundancy (unless, of course, the relations are also in BCNF)


Good properties of breakups (if such things exist :-)

  1. removes redundancy
  2. lossless join decomposition
  3. dependency preservation
  • BCNF guarantees 1 & 2
  • 3NF guarantees 2 & 3
  • no normal form guarantees all 3


Interesting example

    ----------------------
    name   address  car
    ----------------------
    Carla   1       Honda
    Carla   1       Toyota
    Carla   2       Honda
    Carla   2       Toyota
    Chuck   1       Ford
    Chuck   1       Chevy
    Chuck   3       Ford
    Chuck   3       Chevy 
    ----------------------
    
    There are no nontrivial FD's in this relation.

    Is this relation in BCNF?

    Is there any redundancy?

    for one person, each address repeated for each car
    means attribute address is independent of attribute car
    specified by MVD: name →→ address


Multivalued dependencies (MVD's)

  • assertion that two attributes or sets of attributes are independent of each other
  • schematic view
  • written A →→ B
  • MVD's are a generalization of FD's: all FD's are MVD's, why?


Every FD is an MVD

    Every FD XY is also an MVD X →→ Y:

    Consider the following relation which satisfies the FD idname:
    -----------------------
    id  name    likes_color
    -----------------------
    1      X            red
    1      X           blue
    2      Y            red
    2      Y           blue
    -----------------------
    
    We want to see if this relation also satisfies the MVD id →→ name.

    To do that we need to find a tuple v which
    • agrees with two tuples t and u on the lhs of the MVD (id),
    • agrees with t on the rhs of the MVD (name), and
    • agrees on all else (likes_color) with u.
    ----------------------------------
               id  name    likes_color
    ----------------------------------
    (t)        1      X            red
    (u & v)    1      X           blue
    ----------------------------------
    

    The tuple u can be such a tuple v. Tuple v agrees with both t and u on id, it agrees with t on name, and it agrees with u on likes_color. Must find such a tuple v for every id.

    Therefore this relation satisfies the MVD id →→ likes_color.

    In general, every FD is also a MVD. You can find never an FD satisfied by a relation R that is not also a MVD satisfied by R.

    Note that every MVD is not necessarily an FD.


Rules for MVD's

  • MVD's do not obey all of the rules of FD's
  • MVD's do not obey the splitting/combining rule
    • i.e., X →→ Y and X →→ Z does not imply X →→ Y Z
    • e.g., name →→ street does not hold in StarsIn
  • complementation rule
    • means name →→ car
    • has no analog in the world of FD's
  • transitive rule
  • every FD is a MVD; why?
  • trivial dependencies rule (resembles reflexivity)


Fourth Normal Form (4NF)

  • eliminates the redundancy caused by MVD's
  • nontrivial MVD A →→ B
    • none of the Bs are among the As
    • ABR
  • condition: essentially BCNF condition applied to MVD's
    • for all nontrivial MVD's A →→ B, A is a superkey
    • means every BCNF violation is a 4NF violation
    • every relation in 4NF is in BCNF


Decomposition into 4NF

  • analogous to decomposition in BCNF; just string replace acronym FD with MVD
  • Cars example
    • now MVD name →→ address is trivial
    • ditto for name →→ car
  • how to determine the new MVD's? there is no easy way (such as computing the closure in BCNF)


Relationship of normal forms

  • concentric circles: 4NF ⊂ BCNF ⊂ 3NF
  • 1NF: requires every column to have an atomic value
  • 2NF: primarily historical at this point
  • 5NF
    • further generalization of MVD's to JDs
    • beyond the scope of CPS 430/542


Good properties of breakups (final)

  1. removes redundancy caused by FD's
  2. removes redundancy caused by MVD's
  3. lossless join decomposition
  4. preserves FD's
  5. preserves MVD's
  • 4NF guarantees 1, 2, & 3
  • BCNF guarantees 1 & 3
  • 3NF guarantees 3 & 4
  • no normal form guarantees all 5


Goal

  • BCNF (no redundancy due to FD's & lossless join)
  • if not possible, aim for
    • 3NF
    • lossless join
    • dependency preservation


Frequenty asked question (and summary)

If a relation R is in BCNF, then it is automatically in 3NF because 3NF is a weaker normal form than BCNF. Then does not BCNF preserve dependencies like 3NF if all relations in BCNF are also in 3NF?

First, we must realize we are talking about two different things (`normal forms' and `decomposition algorithms') and we should not confuse the two.

When we talk of normal forms, the following holds:

  • the set of relations in 4NF is a proper subset of the set of relations in BCNF
  • the set of relations in BCNF is a proper subset of the set of relations in 3NF

This means any relation in BCNF is automatically in 3NF.

Now, when we talk of properties of `decompositions' (not normal forms), there is no such subset relationship.

We say the `decomposition into BCNF' removes redundancy (but for all we know, the smaller relations resulting may not be in BCNF at all (and we saw such an example in class). In such cases, we must apply the decomposition algorithm again.

We say the `decomposition into 3NF' preserves dependencies. Saying that relations in 3NF preserve dependencies is nonsense (what would that mean?).

When we talk of good properties of breakups, we are talking about `decompositions' NOT normal forms. We say that the `decomposition procedure into 3NF' preserves FD's, but that has nothing to do with saying that all relations in BCNF are also in 3NF. You will notice that we never showed the `decomposition procedure into 3NF.' We only covered the breakup procedure into BCNF and 4NF.

One can never find a relation in BCNF that is not in 3NF.

In summary, a relation is in

  • 3NF, if for every nontrivial FD XY, X is either a superkey or Y is part of a key
  • BCNF, if for every nontrivial FD XY, X is a superkey
  • 4NF if for every nontrivial MVD X →→ Y, X is a superkey

To decompose into

  • 3NF (not covered in this course)
  • BCNF: find a violating FD XY
    • create a new relation R1(X, Y)
    • create a new relation R2(X, R-Y)
  • 4NF: find a violating MVD X →→ Y
    • create a new relation R1(X, Y)
    • create a new relation R2(X, R-Y)


References

    [FCDB] J.D. Ullman and J. Widom. A First Course in Database Systems. Prentice Hall, Upper Saddle River, NJ, Second edition, 2002.

Return Home