Data modeling is the process of documenting a complex software system design as an easily understood diagram, using text and symbols to represent the way data needs to flow. The diagram can be used to ensure efficient use of data, as a blueprint for the construction of new software or for re-engineering a legacy application.
Data modeling is an important skill for data scientists or others involved with data analysis. Traditionally, data models have been built during the analysis and design phases of a project to ensure that the requirements for a new application are fully understood. Data models can also be invoked later in the data lifecycle to rationalize data designs that were originally created by programmers on an ad hoc basis.
Data modeling approaches
Data modeling can be a painstaking upfront process and, as such, is sometimes seen as being at odds with rapid development methodologies. As Agile programming has come into wider use to speed development projects, after-the-fact methods of data modeling are being adapted in some instances. Typically, a data model can be thought of as a flowchart that illustrates the relationships among data. It enables stakeholders to identify errors and make changes before any programming code has been written. Alternatively, models can be introduced as part of reverse engineering efforts that extract models from existing systems, as seen with NoSQL data.
Data modelers often use multiple models to view the same data and ensure that all processes, entities, relationships and data flows have been identified. They initiate new projects by gathering requirements from business stakeholders. Data modeling stages roughly break down into creation of logical data models that show specific attributes, entities and relationships among entities and the physical data model.
The logical data model serves as the basis for creation of a physical data model, which is specific to the application and database to be implemented. A data model can become the basis for building a more detailed data schema.
Hierarchical data modeling
Data modeling as a discipline began to arise in the 1960s, accompanying the upswing in use of database management systems (DBMSes). Data modeling enabled organizations to bring consistency, repeatability and well-ordered development to data processing. Application end users and programmers were able to use the data model as a reference in communications with data designers.
Hierarchical data models that array data in treelike, one-to-many arrangements marked these early efforts and replaced file-based systems in many popular use cases. IBM's Information Management System (IMS) is a primary example of the hierarchical approach, which found wide use in businesses, especially in banking. Although hierarchical data models were largely superseded -- beginning in the 1980s -- by relational data models, the hierarchical method is common still in XML (Extensible Markup Language) and geographic information systems (GISes) today. Network data models also arose in the early days of DBMSes as a means to provide data designers with a broad conceptual view of their systems. One such example is the Conference on Data Systems Languages (CODASYL), which formed in the late 1950s to guide the development of a standard programming language that could be used across various types of computers.
Relational data modeling
While it reduced program complexity versus file-based systems, the hierarchical model still required detailed understanding of the specific physical data storage employed. Proposed as an alternative to the hierarchical data model, the relational data model does not require developers to define data paths. Relational data modeling was first described in a 1970 technical paper by IBM researcher E.F. Codd. Codd's relational model set the stage for industry use of relational databases in which data segments are explicitly joined by use of tables, as compared to the hierarchical model where data is implicitly joined together. Soon after its inception, the relational data model was coupled with the Structured Query Language (SQL) and began to gain an ever larger foothold in enterprise computing as an efficient means to process data.
The entity relationship model
Relational data modeling took another step forward beginning in the mid-1970s as use of entity relationship (ER) models became more prevalent. Closely integrated with relational data models, ER models use diagrams to graphically depict the elements in a database and to ease understanding of underlying models.
With relational modeling, data types are determined and rarely changed over time. Entities comprise attributes; for example, an employee entity's attributes could include last name, first name, years employed and so on. Relationships are visually mapped, providing a ready means to communicate data design objectives to various participants in data development and maintenance. Over time, modeling tools, including Idera's ER/Studio, Erwin Data Modeler and SAP PowerDesigner, gained wide use among data architects for designing systems.
As object-oriented programming gained ground in the 1990s, object-oriented modeling gained traction as yet another way to design systems. While bearing some resemblance to ER methods, object-oriented approaches differ in that they focus on object abstractions of real-world entities. Objects are grouped in class hierarchies, and the objects within such class hierarchies can inherit attributes and methods from parent classes. Because of this inheritance trait, object-oriented data models have some advantages versus ER modeling, in terms of ensuring data integrity and supporting more complex data relationships. Also arising in the 1990s were data models specifically oriented toward data warehousing needs. Notable examples are snowflake schema and star schema dimensional models.
Graph data models
An offshoot of hierarchical and network data modeling is the property graph model, which, together with graph databases, has found increased use for describing complex relationships within data sets, particularly in social media, recommender and fraud detection applications.
Using the graph data model, designers describe their system as a connected graph of nodes and relationships, much as they might do with ER or object data modeling. Graph data models can be used for text analysis, creating models that uncover relationships among data points within documents.