Data Management/Data Warehousing Definitions

This glossary explains the meaning of key words and phrases that information technology (IT) and business professionals use when discussing data management and related software products. You can find additional definitions by visiting WhatIs.com or using the search box below.

  • A

    Apache Falcon

    Apache Falcon is a data management tool for overseeing data pipelines in Hadoop clusters, with a goal of ensuring consistent and dependable performance on complex processing jobs.

  • Apache Flink

    Apache Flink is an in-memory and disk-based distributed data processing platform for use in big data streaming applications.

  • Apache Giraph

    Apache Giraph is real-time graph processing software that is mostly used to analyze social media data. Giraph was developed by Yahoo! and given to the Apache Software Foundation for future management.

  • Apache Hadoop YARN

    Apache Hadoop YARN is the resource management and job scheduling technology in the open source Hadoop distributed processing framework.

  • Apache HBase

    Apache HBase is a column-oriented key/value data store built to run on top of the Hadoop Distributed File System (HDFS).

  • Apache Hive

    Apache Hive is an open source data warehouse system for querying and analyzing large data sets that are principally stored in Hadoop files.

  • Apache Incubator

    Apache Incubator is the starting point for projects and software seeking to become part of the Apache Software Foundation’s efforts. The ASF is a non-profit organization that oversees the development of Apache software.

  • Apache Pig

    Apache Pig is an open-source technology that offers a high-level mechanism for parallel programming of MapReduce jobs to be executed on Hadoop clusters.

  • Apache Spark

    Apache Spark is an open source parallel processing framework for running large-scale data analytics applications across clustered computers. It can handle both batch and real-time analytics and data processing workloads.

  • B

    Big data

    Big data is a combination of structured, semistructured and unstructured data collected by organizations that can be mined for information and used in machine learning projects, predictive modeling and other advanced analytics applications.

  • big data management

    Big data management is the organization, administration and governance of large volumes of both structured and unstructured data.

  • C

    column database management system (CDBMS)

    There are different types of CDBMS offerings, with the common defining feature being that data is stored by column (or column families) instead of as rows.

  • columnar database

    A columnar database is a database management system (DBMS) that stores data in columns instead of rows.

  • compliance

    Compliance is the act of being in alignment with guidelines, regulations and/or legislation. Organizations must ensure that they are in compliance with software licensing terms set by vendors, for example, as well as regulatory mandates.

  • conformed dimension

    In data warehousing, a conformed dimension is a dimension that has the same meaning to every fact with which it relates.

  • consumer privacy (customer privacy)

    Consumer privacy, also known as customer privacy, involves the handling and protection of the sensitive personal information provided by customers in the course of everyday transactions.

  • cooked data

    Cooked data is raw data after it has been processed - that is, extracted, organized, and perhaps analyzed and presented - for further use.

  • corporate performance management (CPM)

    Corporate performance management (CPM) is a term used to describe the various processes and methodologies involved in aligning an organization's strategies and goals to its plans and executions in order to control the success of the company.

  • CouchDB

    CouchDB is an open source document-oriented database based on common web standards. NoSQL databases are useful for very large sets of distributed data, especially for the large amounts of non-uniform data in various formats that is characteristic of Web-based data.

  • CRUD cycle (Create, Read, Update and Delete Cycle)

    The CRUD cycle describes the elemental functions of a persistent database in a computer.

  • customer data integration (CDI)

    Customer data integration (CDI) is the process of defining, consolidating and managing customer information across an organization's business units and systems to achieve a "single version of the truth" for customer data.

  • D

    dark data

    Dark data is digital information that is not being used. Consulting and market research company Gartner Inc. describes dark data as "information assets that an organization collects, processes and stores in the course of its regular business activity, but generally fails to use for other purposes."

  • data

    In computing, data is information that has been translated into a form that is efficient for movement or processing.

  • data access rights

    A data access right (DAR) is a permission that has been granted that allows a person or computer program to locate and read digital information at rest. Digital access rights play and important role in information security and compliance.

  • data activation

    Data activation is a marketing approach that uses consumer information and data analytics to help companies gain real-time insight into target audience behavior and plan for future marketing initiatives.

  • data analytics (DA)

    Data analytics (DA) is the science of examining raw data with the purpose of drawing conclusions about that information.

  • Data as a Service (DaaS)

    Data as a Service (DaaS) is an information provision and distribution model in which data files (including text, images, sounds, and videos) are made available to customers over a network, typically the Internet.

  • data catalog

    A data catalog is a metadata management tool designed to help organizations find and manage large amounts of data – including tables, files and databases – stored in their ERP, human resources, finance and e-commerce systems as well as other sources like social media feeds.

  • data classification

    Data classification is the process of organizing data into categories that make it is easy to retrieve, sort and store for future use.

  • data dredging (data fishing)

    Data dredging, sometimes referred to as data fishing is a data mining practice in which large volumes of data are searched to find any possible relationships between data...

  • data engineer

    A data engineer is a worker whose primary job responsibilities involve preparing data for analytical or operational uses.

  • data federation software

    Data federation software is programming that provides an organization with the ability to collect data from disparate sources and aggregate it in a virtual database where it can be used for business intelligence (BI) or other analysis.

  • data integration

    Data integration is the process of combining data from multiple source systems to create unified sets of information for both operational and analytical uses.

  • data management-as-a-service (DMaaS)

    Data Management-as-a-Service (DMaaS) is a type of cloud service that provides protection, governance and intelligence across a company’s various data sources.

  • data mart (datamart)

    A data mart is a repository of data that is designed to serve a particular community of knowledge workers.

  • data modeling

    Data modeling is the process of documenting a complex software system design as an easily understood diagram, using text and symbols to represent the way data needs to flow.

  • data profiling

    Data profiling is the process of examining, analyzing and reviewing data to collect statistics surrounding the quality and hygiene of the dataset.

  • data quality

    Data quality is a measure of the condition of data based on factors such as accuracy, completeness, consistency, reliability and whether it's up to date.

  • data scrubbing (data cleansing)

    Data scrubbing, also called data cleansing, is the process of cleaning up data in a database that is incorrect, incomplete, or duplicated.

  • data silo

    A data silo exists when an organization's departments and systems cannot, or do not, communicate freely with one another and encourage the sharing of business-relevant data.

  • data stewardship

    Data stewardship is the management and oversight of an organization's data assets to help provide business users with high-quality data that is easily accessible in a consistent manner.

  • data transformation

    Data transformation is the process of converting data from one format, such as a database file, XML document or Excel spreadsheet, into another.

  • data virtualization

    Data virtualization is an umbrella term used to describe any approach to data management that allows an application to retrieve and manipulate data without needing to know any technical details about the data such as how it is formatted or where it is physically located. 

  • data warehouse

    A data warehouse is a federated repository for all the data collected by an enterprise's various operational systems, be they physical or logical.

  • data warehouse as a service (DWaaS)

    Data warehousing as a service (DWaaS) is an outsourcing model in which a service provider configures and manages the hardware and software resources a data warehouse requires, and the customer provides the data and pays for the managed service.

  • database as a service (DBaaS)

    Database as a service (DBaaS) is a cloud computing managed service offering that provides access to a database without requiring the setup of physical hardware, the installation of software or the need to configure the database.

  • database replication

    Database replication is the frequent electronic copying of data from a database in one computer or server to a database in another -- so that all users share the same level of information.

  • database-agnostic

    Database-agnostic is a term describing the capacity of software to function with any vendor’s database management system (DBMS). In information technology (IT), agnostic refers to the ability of something – such as software or hardware – to work with various systems, rather than being customized for a single system.

  • DataOps (data operations)

    DataOps (data operations) is an Agile approach to designing, implementing and maintaining a distributed data architecture that will support a wide range of open source tools and frameworks in production. The goal of DataOps is to create business value from big data. 

  • denormalization

    In a relational database, denormalization is an approach to optimizing performance in which the administrator selectively adds back specific instances of duplicate data after the data structure has been normalized.

  • deterministic/probabilistic data

    Deterministic and probabilistic are opposing terms that can be used to describe customer data and how it is collected. Deterministic data is also referred to as first party data. Probabilistic data is information that is based on relational patterns and the likelihood of a certain outcome.

  • dimension

    In data warehousing, a dimension is a collection of reference information about a measurable event (fact).

  • dimension table

    A dimension table is a table in a star schema of a data warehouse. A dimension table stores attributes, or dimensions, that describe the objects in a fact table.

  • dirty data

    In a data warehouse, dirty data is a database record that contains errors.

  • disambiguation

    Disambiguation (also called word sense disambiguation) is the act of interpreting the intended sense or meaning of a word. Disambiguation is a common problem in computer language processing, since it is often difficult for a computer to distinguish a word’s sense when the word has multiple meanings or spellings.

  • What is data governance and why does it matter?

    Data governance (DG) is the process of managing the availability, usability, integrity and security of the data in enterprise systems, based on internal data standards and policies that also control data usage.

  • What is data management and why is it important?

    Data management is the process of ingesting, storing, organizing and maintaining the data created and collected by an organization, as explained in this in-depth look at the process.

  • E

    Entity Relationship Diagram (ERD)

    An entity relationship diagram (ERD), also known as an entity relationship model, is a graphical representation that depicts relationships among people, objects, places, concepts or events within an information technology (IT) system.

  • Extract, Load, Transform (ELT)

    Extract, Load, Transform (ELT) is a data integration process for transferring raw data from a source server to a data system (such as a data warehouse or data lake) on a target server and then preparing the information for downstream uses.

  • extract, transform, load (ETL)

    In managing databases, extract, transform, load (ETL) refers to three separate functions combined into a single programming tool.

  • F

    fact table

    A fact table is the central table in a star schema of a data warehouse. A fact table stores quantitative information for analysis and is often denormalized.

  • fixed data (permanent data, reference data, archival data, or fixed-content data)

    Fixed data (sometimes referred to as permanent data) is data that is not, under normal circumstances, subject to change. Any type of historical record is fixed data. For example, meteorological details for a given location on a specific day in the past are not likely to change (unless the original record is found, somehow, to be flawed).

  • G

    Google BigQuery

    Google BigQuery is a cloud-based big data analytics web service for processing very large read-only data sets. BigQuery was designed for analyzing data on the order of billions of rows, using a SQL-like syntax.

  • Google Bigtable

    Google Bigtable is a distributed, column-oriented data store created by Google Inc. to handle very large amounts of structured data associated with the company's Internet search and Web services operations.

  • Google Cloud Dataflow

    Google Cloud Dataflow is a cloud-based data processing service for both batch and real-time data streaming applications.

  • Google Cloud Spanner

    Google Cloud Spanner is a distributed relational database service that runs on Google Cloud.

  • H

    Hadoop

    Hadoop is an open source distributed processing framework that manages data processing and storage for big data applications running in clustered systems.

  • Hadoop 2

    Apache Hadoop 2 is the second iteration of the Hadoop framework for distributed data processing.  Hadoop 2 adds support for running non-batch applications as well as new features to improve system availability.

  • Hadoop data lake

    A Hadoop data lake is a data management platform comprising one or more Hadoop clusters.

  • Hadoop Distributed File System (HDFS)

    The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications.

  • Heap

    Heap is a user analytics tool which can be utilized to capture all web, mobile and cloud-based user interactions in an application.

  • HP e3000

    The HP e3000 is a line of midrange business servers that carries on the well-known series of 3000 computers from Hewlett-Packard (HP).

  • I

    in-memory database management system (IMDBMS)

    An in-memory database management system (IMDBMS) stores, manages and provides access to data from main memory.

  • J

    JAQL (json query language)

    JAQL is a query language for the JavaScript Object Notation (JSON) data interchange format. Pronounced "jackal," JAQL is a functional, declarative programming language designed especially for working with large volumes of structured, semi-structured and unstructured data.

  • M

    MariaDB

    MariaDB is an open source relational database management system (DBMS) that is a compatible drop-in replacement for the widely used MySQL database technology.

  • master data

    Master data is the core data that is essential to operations in a specific business or business unit.

  • master data management (MDM)

    Master data management (MDM) is a process that creates a uniform set of data on customers, products, suppliers and other business entities from different IT systems.

  • MongoDB

    MongoDB is an open source database management system (DBMS) that uses a document-oriented database model which supports various forms of data.

  • MPP database (massively parallel processing database)

    An MPP database is a database that is optimized to be processed in parallel for many operations to be performed by many processing units at a time.

  • multimodel database

    A multimodel database is a data processing platform that supports multiple data models, which define the parameters for how the information in a database is organized and arranged.

  • N

    NewSQL

    NewSQL is a term coined by the analyst firm The 451 Group as shorthand to describe vendors of new, scalable, high performance SQL databases.

  • NoSQL (Not Only SQL database)

    NoSQL is an approach to database design that can accommodate a wide variety of data models, including key-value, document, columnar and graph formats. NoSQL, which stands for "not only SQL," is an alternative to traditional relational databases in which data is placed in tables and data schema is carefully designed before the database is built.

  • NoSQL DBMS (Not only SQL database management system)

    Understanding the types of NoSQL DBMSes that are available is an important requirement for modern application development.

  • NuoDB

    NuoDB is a SQL-oriented transactional database management system designed for distributed deployment in the cloud.

  • O

    OLAP (online analytical processing)

    OLAP (online analytical processing) enables a user to easily and selectively extract and view data from different points-of-view.

  • OLAP cube

    An OLAP cube is a multidimensional database that is optimized for data warehouse and online analytical processing (OLAP) applications.

  • P

    privacy

    On the Internet, privacy, a major concern of users, can be divided into these concerns: What personal information can be shared with whom Whether messages can be exchanged without anyone else seeing them Whether and how one can send messages anonymously Personal Information Privacy Most Web users want to understand that personal information they share will not be shared with anyone else without their permission.

  • R

    raw data (source data or atomic data)

    Raw data (sometimes called source data or atomic data) is data that has not been processed for meaningful use.

  • RDBMS (relational database management system)

    A relational database management system (RDBMS) is a collection of programs and capabilities that enable IT teams and others to create, update, administer and otherwise interact with a relational database.

  • relational database

    A relational database is a collection of information that organizes data points with defined relationships for easy access.

  • RFM analysis (recency, frequency, monetary)

    RFM (recency, frequency, monetary) analysis is a marketing technique used to determine which customers are the best ones by examining how often a customer buys (recency), how long it's been since their last purchase (frequency), and...

  • S

    semantic technology

    Semantic technology is a set of methods and tools that provide advanced means for categorizing and processing data, as well as for discovering relationships within varied data sets.

  • snowflaking (snowflake schema)

    In data warehousing, snowflaking is a form of dimensional modeling where dimensions are stored in multiple related dimension tables. 

  • sparsity and density

    Sparsity and density are terms used to describe the percentage of cells in a database table that are not populated and populated, respectively. The sum of the sparsity and density should equal 100%.

  • SQL-on-Hadoop

    SQL-on-Hadoop is a class of analytical application tools that combine established SQL-style querying with newer Hadoop data framework elements.

  • star schema

    In data warehousing, a star schema is the simplest form of dimensional model, with data organized into facts and dimensions. 

  • synthetic backup

    Synthetic backup is the process of generating a file from a complete copy of a file created at some past time and one or more incremental copies created at later times... (Continued)

  • T

    TensorFlow

    TensorFlow is an open source framework developed by Google researchers to run machine learning, deep learning and other statistical and predictive analytics workloads.

  • three-phase commit (3PC)

    Three-phase commit (3PC) is a protocol that consists of a distributed algorithm used to ensure all transactions in a system are agreed upon and are committed to.

  • tree structure

    A tree structure is an algorithm for placing and locating files (called records or keys) in a database. The algorithm finds data by repeatedly making choices at decision points called nodes. A node can have as few as two branches (also called children)...

-ADS BY GOOGLE

SearchBusinessAnalytics

SearchAWS

SearchContentManagement

SearchOracle

SearchSAP

SearchSQLServer

Close