Relational databases are based on the relational model, an intuitive, straightforward way of representing data in tables. In a relational database, each row in the table is a record with a unique ID called the primary key. Each column in the table represents a different field in the record, making it easy to establish relationships between different tables. For example, consider a university database with tables for Students, Courses, and Enrollment. The Students and Courses tables have a primary key for each student and course, respectively, while the Enrollment table uses foreign keys to represent enrollments in specific courses by specific students.
Non-relational databases, or NoSQL databases, are designed for specific data models and have flexible schemas for building modern applications. Non-relational databases are capable of storing structured, semi-structured, and unstructured data. This flexibility is their main advantage over relational databases. For example, a MongoDB database storing Twitter posts would be a Non-relational database, as each post (document) can have different fields (user, message content, timestamp, etc.), and the order of fields can vary from post to post.
A data mart is a subset of a data warehouse that is generally oriented to a specific business line or team. Whereas a data warehouse combines databases across an entire enterprise, data marts are usually smaller and focus on a particular subject or department. For instance, a company might have a data mart dedicated to finance, sales, or HR. These data marts would contain data specifically related to those departments.
A data warehouse is a system used for reporting and data analysis, and is considered a core component of business intelligence. Data warehouses are central repositories of integrated data from one or more disparate sources. They store current and historical data and are used for creating analytical reports for knowledge workers throughout the enterprise. The data stored in the warehouse is uploaded from the operational systems. The data may pass through an operational data store and may require data cleansing for additional operations to ensure data quality before it is used in the data warehouse for reporting. For instance, a data warehouse for a large retail chain might integrate data from various sources like point-of-sale systems, online sales, and supply chain management systems to give executives a comprehensive view of their business.
A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files. It’s a place to store every type of data in its native format with no fixed limits on account size or file. It offers high data quantity to increase analytic performance and native integration. Unlike a hierarchical data warehouse which stores data in files or folders, a data lake uses a flat architecture to store data. An example of a data lake could be a Hadoop ecosystem storing diverse data, from log files and social media feeds to financial records, and the data could be structured, semi-structured, or unstructured. The main advantage of a data lake is the ability to store a vast amount of data at a low cost, and the flexibility of the types of data it can store.
Online transaction processing (OLTP) is a class of systems that supports or facilitates high transaction-oriented applications. OLTP systems are designed to efficiently process transactions in real time. An example of an OLTP system could be a hotel reservation system. When you book a room, the system checks whether a room is available, reserves the room, deducts the cost from your credit card, and confirms your reservation, all in real time.
In contrast to OLTP, online analytical processing (OLAP) is a category of software tools that analyze data stored in databases and provide fast analysis of multidimensional data. OLAP systems allow users to analyze database information from multiple database systems at one time. An example of an OLAP operation could be an annual report of a company’s product sales, which would draw data from different database systems, including sales, products, and time.
The snowflake schema is a variation of the star schema used in a data warehouse. The difference is that the tables, which describe the dimensions of the star schema, are normalized. This schema is called a snowflake because its diagram resembles a Snowflake. For example, if you have a Product Sales database, the product category, product, and supplier details are separated into different dimension tables in a snowflake schema.
The star schema is the simplest type of Data Warehouse schema. It is called a star schema because the entity-relationship diagram of this schema resembles a star, with points radiating from a center. The center of the star consists of a fact table, and the points of the star are the dimension tables. For instance, in a Sales data warehouse, a single Fact table could hold the core transactional data (like total sale amount), and the Dimension tables could contain descriptive attributes related to the Facts (like product, date, store, and customer details).
Slowly changing dimensions (SCDs) are dimensions that change over time. The concept comes from data warehousing and describes how to version data that changes over time. There are multiple types of SCDs, with the most commonly used ones being SCD Type 1, SCD Type 2, and SCD Type 3.
It’s crucial to select the right SCD type based on the business requirement, as they offer different ways to track history and can significantly affect the size and complexity of your data warehouse.
In a dimension table, when changes occur in the original data source, and you want to overwrite that information in the table, it’s a “keep current information” approach. For example, if a customer changes their home address, the new address replaces the old address in the table.
On the other hand, there are cases where preserving historical data is necessary. In such cases, a new record is added to the table to represent the new information, without discarding the old record. This way, the table keeps both historical and current information. For instance, when a product’s price changes, rather than simply overwriting the old price, a new record is created with the new price and the date of the change.
The Date data type is used to store date values. A Date value typically includes the year, month, and day, and sometimes also the time. For example, an ‘order_date’ column in an ‘orders’ table in an e-commerce database would be a Date data type.
Numeric data types store numeric values. These values can be integers (without decimal points) or floating-point numbers (with decimal points). For example, a ‘price’ column in a ‘products’ table storing the prices of products would be a Numeric data type.
An Alphanumeric data type can include letters, numbers, and other characters typically found on a keyboard. They’re used when the data to be stored is a mix of letters and numbers. For instance, a product ID, such as ‘A123’, would be stored as an Alphanumeric data type.
The Currency data type is used to store monetary values. It’s a type of numeric data type that represents a currency value. This data type is commonly used in business databases. For example, the ‘revenue’ column in a ‘sales’ table storing the revenue from each sale would be a Currency data type.
The Text data type is used to store alphanumeric characters up to the max allowed length. They’re used when the data to be stored is primarily composed of letters, as opposed to numbers. For instance, a ‘description’ column in a ‘products’ table storing a text description of each product would be a Text data type.
Discrete data is countable and cannot be broken down into a smaller unit. It represents items that can be counted. For instance, the number of calls received by a call center per day is discrete data.
Continuous data, on the other hand, represents measurements and can be broken down into smaller parts and get more precise values. For example, the time a customer spends on a website is continuous data.
Categorical, or dimension, data types are used for data that can be divided into groups or categories. For example, a ‘gender’ column in a ‘customers’ table storing the gender of each customer is a categorical data type.
The Image data type is used to store image data, which can be a graphic, a drawing, or a photo. For example, a ‘profile_picture’ field in a ‘users’ table storing the profile picture of each user would be an Image data type.
The Audio data type is used to store audio data, which can be music, speech, or any type of sound. For example, an ‘audio_clip’ field in a ‘voicemails’ table storing voicemail messages would be an Audio data type.
The Video data type is used to store video data. For example, a ‘tutorial_video’ field in a ‘lessons’ table storing video tutorials would be a Video data type.
JSON is a lightweight data interchange format that is easy for humans to read and write and easy for machines to parse and generate. JSON uses key-value pairs to store data, making it similar in some ways to a dictionary in Python. For example, a .json file might contain a list of students in a class, with each student represented by a JSON object with keys for name, ID number, and grade level.
XML is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. XML data is stored in text format and is both human-readable and machine-readable. For example, an .xml file might store data about a collection of books in a library, with tags for title, author, and publication date for each book.
HTML is the standard markup language for documents designed to be displayed in a web browser. It can include text, images, and other types of content. For example, a .html file might display a webpage containing a news article, with tags for the article’s title, the body of the article, and images to accompany the article.