Data Modeling Fundamentals

What is Data Model?

Unlike a real database or a real big data environment in the real world, the data model doesn’t have data in it. It is a representation of what that data actually is in the real world. It provides us with a great deal of insight into a lot of the characteristics and rules that apply to our data.

  • Major data subjects
  • Attributes of data subjects
  • Relationships among data subjects
  • Business rules for our data

Value of a data model

  • Abstraction from database implementation specifics
  • Helps even with relational databases
  • Even more valuable with ‘non-intuitive’ data implementations

Basic Data Modeling Concepts and Terminology

Data Subjects

  • Commonly called ‘entities’
  • Some methodologies use ‘objects’ or ‘classes’
  • Somewhat analogous to a database table
  • Think it as something that exists

Attributes

  • Analogous to a database column
  • Think ‘field’
  • Attributes typically associated with entities (subjects)
  • Attribute types often shared across multiple entities

Relationship among Data Subjects

Business Rules for data

  • Cardinality
  • Mandatory or optional relationships
  • Permissible attribute values (including NULLs)
  • Data change dynamics (when a row is deleted, other rows in other tables will be deleted too)

Compare Transactional Data Modeling to Analytical Data Modeling

Transactional Analytical
Conceptual Level mirror real world dimensional
Logical level (relationsl) data normalization rules with deliverate denormalization fact and dimension tables in accordance with best practices
Logical level (non-relational) NoSQL, OODBMS constructs cubes, columnar databases
Physical level blocks/tracks, MPP distribution blocks/trakcs, MPP distribution, AWS buckets, HDFS NameNodes and DataNodes

The Building Blocks of Data Modeling

Entities

Think it as\

  • the real-world subject.
  • collection of attributes (fields)

Representing entities in a model

  • Classic ER modeling: square, squared-off (not round) corners
  • Crow’s foot notation

Attributes

Field, the details of an entity

  • Represented by a circle

Attributes have descriptions and rules

  • Data types and sizes
  • Whether NULL values allowed
  • Permissible values

Attribute domains

  • Reuseable general classes of descriptions
  • Applied to selective attributes that fit the domain
  • Supported by some data modeling software tools
  • Example: Valid Business Date is between 1/1/1980 and 21/31/2199, attribute not only have type DATE but also have domain ‘Valid Business Date’

Multi-valued attribute (MVA)

  • More than one possible values for each instance

  • Example: a student could have more than one email addresses

  • represented by double circles

  • Modeling software often doesn’t permit MVAs in crow’s foot notation

  • MVAs is one of the two most obvious differentiators between conceptual and logical modeling

  • the other is many-to-many relationships vs database intersection tables

Hierarchies for the Entities

  • A special type of relationship
  • Two or more entities that have a lot in common but also at least a little bit different
  • Parent and Child entities
  • Concept of inheritance
  • Hierarchies can be inclusive (the item can be in both child entities) or exclusive (the item can be in only one child entity)

Constraints for your Attributes

  • Data types and sizes
  • Whether NULL values allowed
  • Permissible values
    • Range of values (can only be the value inside a range)
    • List of values (enum, can only be one of the values in the list)

Strong and Weak entities

Think instead in terms of dependencies

  • Independent entities

  • Dependent entities

    • Identification dependency
    • Existence dependency
  • Strong entity exists on its own terms

    • exists independent of any other entity
    • does not require any other entity instances to help identify its own instances
  • Weak entity needs some help

    • to identify specific instance of that entity
    • can’t exist without an instance of another entity
    • or both

Relationships

  • Classic ER notation: a diamond shape

  • Crow’s foot notation: a stright line

  • Multiple relationships between two entities

  • Recursive relationship involving just one entitiy

  • Ternary (three-entity) relationships

  • Relationships that seem like entities

Cardinalities

The number of something

  • Each relationship has two cardinalities
    • Maximum cardinality
    • Minimum cardinality
  • Representing cardinality in various notations
    • Classic ER
    • Crow’s foot

Maximum Cardinality

  • The maximum number of instances of both sides of a relationship
  • typical values: 1 or M
  • Can also be a specific numeric value

1:1 relationship

  • An instance from each side of a relationship is related to exactly 1 instance from the other side
  • Business rules:
    • a university has exactly 1 president
    • a person can only be president of 1 university

1:M relationship

  • An instance from one side of a relationship is related to 1 or more instances from the other side
  • Business rules:
    • A faculty member can advise many students
    • A student is advised by only one faculty member

M:M relationship

  • Any instance from either side of the relationship can be associated with one or many instances from the other side
  • Business rules:
    • A student can enroll in 1 or many classes
    • A class can enroll 1 or many students

Specific number of max cardinality

  • An instance from one side of a relationship can be related to at most some number from the other side
  • can be one-directional or bi-directional
  • One-directional 1 to (some number)
  • Bi-directional (some number) to (some number)

Example

  1. A student can take no more than 7 classes in one semester, a class can have many students. (7:M)
  2. A student can take no more than 7 classes and any class can enroll up to but no more than 300 students (7:300)

Minimum Cardinality

3 possible value for minimum cardinality

  • 0: optional/partial participation

  • 1: mandatory/total participation

  • n: some explicit number of minimum instances

    • A full-time lecturer must teach at least 6 classes
  • Difficult to represent explicit numbers with 0 | 1 notation for minimum cardinality

  • Number pairs

  • Left: min cardinality, right: max cardinality

Cardinality

  • An active student must enroll in at least 1 but up to many courses
  • A course can have many students but could possibly have zero students

Normalization

1st Normal Form

  • Every row (tuple) must be unique
  • No repeating groups
  • MVAs is a violation of 1st NF

2nd Normal Form

  • Must be in 1st NF
  • No Partial key dependencies
  • must have single-column primary key

3rd Normal Form

  • Must be in 2NF
  • No non-key dependencies
  • All attributes are dependent on the primary key

The key, the whole key, nothing but the key.

At the conceptual level, there is no real issues violating normalization. We address the issues at logicl level. Then, sometimes we deliverately violate the normalization at the physical level to improve performance based on real situations.

Conceptual level to logical level

Addressing normalization violations

  • 1NF violations: move offending data to separate table
  • 2NF violations: partial key dependencies, move offending attribute to correct entity
  • 3NF violations: non-key dependencies, same as 2NF, move attribute the correct place

Transform many-to-many relationships

  • Add intersection entity to your model
  • Also referred to as associative entity
  • Purpose: decompose M:M relationship into multiple semantically equivalent relationships

Software for Data Modeling

Advantages of data modeling tools

  • Enforcement of methodology, technique, and notation rules
  • Automated or semi-automated forward and reverse engineering

Options available

  1. Microsoft Visio
  2. CA ERwin
  3. ER/Studio Data Architect
  4. https://dbmstools.com/categories/data-modeling-tools

Microsoft Visio

  • From guided drawing templates to semantically aware models
    • Business process models
    • Data models
  • Multi-functioned drawing tool
  • Close alignment with Microsoft SQL server
  • Widely used

CA ERwin

  • One of the oldest data modeling tools still being used
  • Rich capabilities for forward and reverse engineering

ER/Studio Data Architect

  • Another old timer that still is a market leader
  • Rich feature set including dimensional modeling (for analytical)