Big data glossary: 50+ terms defined
Our big data glossary will help you navigate the world of big data.
What exactly is big data? And what about clustering or Hadoop? Our big data glossary will help you navigate the world of big data by walking you through key terms and definitions, from the basic to the advanced.
ACID stands for atomicity, consistency, isolation, and durability. These properties are guaranteed by a transactional database.
Data which is gathered and presented in a summary form, usually for statistical analysis. Aggregation may also be one component to a process that helps ensure user anonymity.
An algorithm is a step-by-step set of operations which can be performed to solve a particular class of problem. The steps may involve calculation, processing, or reasoning. To be effective, an algorithm must reach a solution in finite space and time. As an example, Google uses algorithms extensively to rank page results and autocomplete user queries.
See Also: Read about the curve algorithm, Geotab’s patented method for GPS tracking.
Data which has been stripped of personally identifiable information, or which has had this information replaced with a randomly generated identifier. Data anonymization is just one part of a collection of methods used to help protect user privacy and identity.
An open source stream processing platform that uses a publisher and subscriber model to handle real-time data feeds. It is a very scalable distributed system which has a high throughput and low-latency.
The process of developing software and intelligence machines that can recognize, react to the environment, take the corresponding action when required, and later learn from those actions. It may include tasks that normally require human intelligence such as translation between languages, speech recognition, visual perception and decision-making. Learn how artificial intelligence is impacting the mobility industry.
Automatic Identification and Capture (AIDC)
System of collecting data and automatically identifying items without any human involvement. Examples of this include facial recognition systems, magnetic stripes, smart cards, voice recognition and bar codes.
Large volumes of data, structured and unstructured, which are gathered and analyzed to improve customer experience and the efficiency of a business among many other things.
Big Data Graveyard
This “graveyard” consists of big repositories of unused data. The data usually gets stored in server farms or the cloud, mostly to never be seen or used again. Read how to resurrect value from data in this blog post from Mike Branch.
Big Data Scientist
An individual that performs data mining, statistical analysis, machine learning, and retrieval processes on large amounts of data to identify trends and patterns, and to forecast and make predictions.
A term that refers to the tools, applications, technologies and practices that are used for the collection, extraction, identification and analysis of business data.
Cloud computing is data storage and processing over the internet. Instead of locally managing computers, hard drives, and servers, a third-party service manages the physical infrastructure and the end-user utilizes the resources remotely.
The rise in cloud computing has been made possible by the increasing affordability of internet services, along with enhanced security and flexibility. No longer do employees need to be at their offices and computers to access their data, tools, and applications. Organizations no longer need to be in the data-storage business, they can outsource this aspect of their business. Read about the pros and cons of cloud computing.
This is an essential machine learning technique for analyzing big data. Sometimes referred to as “clustering analysis”, it is the task of grouping a set of objects together in a way that differentiates them from other groups. This may be used to find certain “types” of customers and identify their commonalities and needs.
An information management tool that assists with visually tracking and displaying information, metrics, key performance indicators, and key data points to monitor processes, individuals, data quality, or important business areas.
The process of detecting and removing, tagging, or correcting inaccurate, invalid, or corrupt records from a database.
A system of storage that holds raw data until it needs to be used. It can include unstructured data, semi-structured data, relational databases and binary data.
Data-Driven Decision Making
Using data to support making decisions.
A Geotab website offering free smart city and intelligence data to users. (Get everything you need to know about data.geotab.com in this blog post).
Set of rules and processes that ensure data quality, consistency, integrity, and security over time.
A system that is used to store data for the purpose of analyzing and reporting.
A subset of the data warehouse, it is used to provide data to users.
The analysis step in the “knowledge discovery in databases” process, it includes sorting raw data into information that can be used to solve problems and identify patterns. The aim of data mining is to obtain information from a data set and convert it into a more understandable structure for future use.
A structure that defines the organization of data in a database system.
A discipline that uses algorithms, processes, scientific methods and systems to gain insight and knowledge from data in different forms. This field usually incorporates data visualization, data mining, statistics, machine learning and programming to solve complicated problems using big data.
The practice of preventing data from destruction, corruption or unauthorized access.
A visual abstraction of data that uses plots, information graphics and statistical graphics to communicate information effectively. Read more on data visualization here.
A network that enables the same application to be run on multiple computers. This term can also be used in reference to working multiple computers in parallel to run a data processing pipeline or algorithm.
Distributed File System
A system that offers access to storing data on a server. It is often used to share files and information within users in a controlled and authorized manner.
Extract, Transform, and Load (ETL)
Three functions that are combined and used in data warehousing to prepare data in analytics or reporting.
A database that utilizes graphs with edges and nodes to represent and store data. This allows data to be connected and linked directly together so it could be easily retrieved with one operation.
Hadoop is an open source framework — or software platform — that allows for storing and analyzing vast quantities of data. Fun fact: The creator of “Hadoop” named the open source software after a stuffed yellow elephant toy belonging to his young son. This article describes when and when not to use Hadoop.
A database management system that relies on the main memory of the system for data storage. The unique and important aspect of an in-memory database is that it does not rely solely on disk, which makes it much faster and more reliable.
Internet of Things (IoT)
Gartner defines the Internet of Things (IoT) as “the network of physical objects that contain embedded technology to communicate and sense or interact with their internal states or the external environment.” No longer do computers and tablets exclusively generate data. In the near future, cars, refrigerators, wearable devices, and many other things will provide interesting insights.
See Also: Automotive IoT Is Disrupting the Car Rental Industry
The process of distributing workload across a computer network or computer cluster to optimize performance.
The study and practice of designing systems that can adjust, learn, and improve based on the data fed. This is commonly used to allow a computer to analyze data to “learn” what action to take when an event or specific pattern occurs. Examples of machine learning include the self-driving car, Netflix recommended selections system, and the Facebook news feed. Go to the full explainer: What Is Machine Learning?
This type of data gives information and further context about other data. Put simply, this is data about data. If you take a photo with your camera, the photo itself is data. The time, date, location, and other details of that photo represent the metadata.
Databases that are designed outside the widely used relational database management system model. For decades, users have written Structured Query Language (SQL) statements to extract, update, and create data from structured and related tables. While still enormously powerful, SQL doesn’t work nearly as well on large, messier, unstructured datasets. This is why NoSQL exists. Note that it stands for “not only SQL”.
Data that can be used by anyone to “access, use or share” without any limitations or restrictions. Download our free white paper on open data and big data privacy.
Machine learning that is focused on the classification, recognition, or labeling of an identified pattern.
A unit of data (one million gigabytes or 1,024 terabytes).
The process of developing a model and using statistics to predict a trend or outcome of an event.
A process used in databases for the purpose of optimizing it for efficiency and speed. This is important as it helps improve the overall performance of query processing which speeds up database functions, data analysis, and reporting tools.
Data that is delivered and presented immediately after it is acquired.
The application of linguistic, statistical and machine learning techniques on text-based sources to understand the insight or meaning behind it. With structured data, one can easily determine “averages” and “maximums” of sales, employee salaries, etc.
The 3 Vs
Originally coined by Gartner’s Doug Laney, data today is streaming at us with increasing velocity (speed of data processing), variety (types of data), and volume (amount of data).
In a data management context, transactional data describes the information that comes as a result of transactions. Note: they always have a time dimension.
The machine learning task of determining a function that pairs an input and output together based on examples of other input-output pairs. By providing “training” data to the algorithm where the result is known, the algorithm infers the function which describes the input-result relationship so the result/output of new inputs can be predicted.
The capability of a network, system or process to maintain and improve performance levels as the workload increases. A system is usually considered scalable when when they are able to increase their total output under an increased load when resources such as hardware are added. Scalability is a key characteristic of Geotab’s software development kit.
The process of analyzing spatial data which study entities using geographic, geometric and topological properties.
Data that has no identifiable model or structure. Unlike its structured counterpart, unstructured data is messier. Think photos, videos, emails or audio, etc. If you can analyze it in Excel, then the data is probably not unstructured.
The machine learning task of inferring a function that is trying to find hidden structure in data that isn’t labeled (i.e. “unlabeled data” that has not been categorized or classified).
A software project that provides open code name registration and centralized configuration for larger distributed systems.
Ready to geek out on more terminology? Go to our telematics glossary here.
To stay updated on more stories about big data, please subscribe to the blog!
If you liked this post, let us know!