AI: Glossary
General AI
Algorithm: A set of instructions that a computer follows to solve a problem or perform a task.
AlphaGo: A computer program that defeated a human Go champion in 2016. AlphaGo is a powerful example of the progress that has been made in AI in recent years.
Artificial Intelligence: The simulation of human intelligence in machines that are programmed to think and learn like humans.
Autonomous: Able to operate independently without human input or intervention.
Bayesian network: A probabilistic graphical model that is used to represent uncertainty in data. Bayesian networks are used in a variety of applications, including medical diagnosis, fraud detection, and natural language processing.
Bias: A tendency to favour one outcome or group of outcomes over another. In AI, bias can occur when the training data is not representative of the real world.
Bias-Variance Trade-off: The balance between overfitting and underfitting in ML models. Bias refers to errors caused by overly simple models, while variance refers to errors caused by overly complex models. Achieving the right balance is crucial for optimal model performance.
Big Data: Large and complex data sets that cannot be easily managed, processed, or analysed using traditional methods.
Chatbot: A computer program designed to simulate conversation with human users, typically through text-based interactions.
Convolutional neural network (CNN): A type of neural network that is commonly used for image recognition. CNNs are able to learn to identify patterns in images, which makes them well-suited for tasks such as facial recognition and object detection.
Deep Learning: A subset of machine learning that uses artificial neural networks to model and understand complex patterns and relationships in data.
Deep reinforcement learning: A type of reinforcement learning that uses deep learning techniques to train agents. Deep reinforcement learning has been used to achieve impressive results in a variety of domains, including game playing and robotics.
Emergent behaviour: Unexpected or unintended abilities that arise in a complex system. In AI, emergent behaviour can occur when a machine learning model is trained on a large dataset of data.
Emotion recognition: The ability to identify and understand the emotions of others. Emotion recognition is a challenging task, but it is becoming increasingly important as AI systems are being used in more and more social applications.
Expert System: A computer system that emulates the decision-making ability of a human expert in a specific domain.
Fuzzy Logic: A mathematical logic that deals with reasoning that is approximate rather than precise.
Generalization: The ability of an AI model to perform well on unseen data or data from a different distribution than the training data. Generalization is a key goal in ML to ensure that models can make accurate predictions in real-world scenarios.
Genetic Algorithm: A search algorithm inspired by the process of natural selection that is used to find optimal solutions to complex problems.
Heuristic: A problem-solving approach that uses rules of thumb or approximate methods to find solutions.
Hierarchical reinforcement learning: A type of reinforcement learning that uses a hierarchical structure to represent the environment. Hierarchical reinforcement learning has been shown to be effective for tasks that require long-term planning.
Inference: The process of using a trained AI model to make predictions or decisions on new, unseen data.
Intelligent Agent: A software program that can perform tasks autonomously and make decisions based on its environment and goals.
Jupyter Notebook: An open-source web application that allows users to create and share documents containing live code, equations, visualizations, and narrative text.
Knowledge Base: A repository of information and data that is used by an intelligent system to make decisions and solve problems.
Machine Learning: A branch of artificial intelligence that enables computers to learn and improve from experience without being explicitly programmed.
Natural Language Processing: The ability of a computer program to understand and interpret human language in a way that is meaningful and useful.
Natural language understanding (NLU): The ability to understand the meaning of natural language. NLU is a key component of many AI systems, such as chatbots and virtual assistants.
Neural Network: A computational model inspired by the structure and function of the human brain, consisting of interconnected nodes or "neurons" that process and transmit information.
Neural Network Architectures: The structure and organization of artificial neural networks, such as feedforward networks, recurrent neural networks (RNNs), convolutional neural networks (CNNs), and transformers. Each architecture is designed to solve specific types of problems.
Ontology: A formal representation of knowledge that defines the concepts, relationships, and properties within a specific domain.
Overfitting: A problem that occurs in machine learning when a model learns the training data too well and is unable to generalize to new data.
Pattern recognition: The ability to identify patterns in data. In AI, pattern recognition is used for tasks such as image recognition and natural language processing.
Predictive Analytics: The use of statistical techniques and data mining to analyse current and historical data in order to make predictions about future events or outcomes.
Prescriptive analytics: The use of data to recommend actions. In AI, prescriptive analytics is used to optimise business processes, allocate resources, and make decisions.
Recurrent neural network (RNN): A type of neural network that is able to process sequential data. RNNs are used in a variety of applications, including speech recognition, machine translation, and natural language generation.
Reinforcement Learning: A type of machine learning where an agent learns to make decisions and take actions in an environment to maximize a reward signal.
Supervised Learning: A type of machine learning where a model is trained on labelled data to make predictions or classifications.
TensorFlow: An open-source machine learning framework developed by Google that provides a comprehensive ecosystem of tools, libraries, and resources for building and deploying machine learning models.
Training Data: The data used to train an AI model. It consists of input features and corresponding target outputs or labels, which the model uses to learn patterns and make predictions.
Transfer learning: A technique that allows a machine learning model to be trained on one task and then applied to another task. Transfer learning can be used to improve the performance of machine learning models on new tasks.
Unstructured data: Data that does not have a predefined format. Unstructured data can be text, images, audio, or video.
Unsupervised Learning: A type of machine learning where a model is trained on unlabelled data to discover patterns, relationships, and structures.
Virtual assistant: A software program that can perform tasks or answer questions on behalf of a user. Virtual assistants are often used to control smart home devices, book appointments, and provide customer service.
Virtual Reality: A computer-generated simulation of a three-dimensional environment that can be interacted with and explored by a user.
Watson: A large language model developed by IBM. Watson is used for a variety of tasks, including question answering, machine translation, and medical diagnosis.
Weak AI: Artificial intelligence that is designed to perform specific tasks and simulate human intelligence within a limited scope.
XML: Extensible Markup Language, a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable.
YAML: Yet Another Markup Language, a human-readable data (as opposed to documents) serialization format that is often used for configuration files and data exchange between languages.
Zero-shot Learning: A type of machine learning where a model is trained to recognize and classify objects or concepts that it has never seen before.
AI Chipsets
AI chipset: A silicon-based chipset that is enabled with AI technology, that offers high computing capabilities to connected devices such as laptops, smart wearable, smartphones, among others.
Application-specific integrated circuit (ASIC): A type of chip that is designed for a specific application. ASICs are often used for AI applications because they can be optimized for the specific task at hand.
Central processing unit (CPU): The main processing unit of a computer. CPUs are responsible for executing instructions and performing calculations. CPUs are not typically well-suited for AI applications because they are not as efficient as other types of chips.
Field-programmable gate array (FPGA): A type of chip that can be programmed to perform different tasks. FPGAs are often used for AI applications because they can be reconfigured to meet the needs of different applications.
Graphical processing unit (GPU): A type of chip that is designed for graphics processing. GPUs are very efficient at performing matrix multiplication, which is a key operation in many AI algorithms. GPUs are becoming increasingly popular for AI applications because they offer a significant performance boost over CPUs.
Neuromorphic chip: A type of chip that is inspired by the human brain. Neuromorphic chips are designed to mimic the way that neurons in the brain process information. Neuromorphic chips are still in development, but they have the potential to revolutionize AI.
Tensor processing unit (TPU): A type of chip that is designed for machine learning. TPUs are very efficient at performing matrix multiplication, which is a key operation in many machine learning algorithms. TPUs are developed by Google and are used in their AI products such as Google Cloud Platform and Google Search.
Data Annotation
Data annotation: The process of labeling or tagging data with relevant metadata to make it easier for machines to understand and interpret.
Label: A piece of metadata that is attached to a piece of data to indicate its meaning. For example, a label might be "cat" or "dog" for an image of an animal.
Metadata: Data that describes other data. Metadata can be used to store information about the data, such as its source, format, and meaning.
Machine learning: A type of artificial intelligence that allows machines to learn without being explicitly programmed. Machine learning models are trained on data, and they learn to identify patterns in the data.
Natural language processing (NLP): A field of computer science that deals with the interaction between computers and human (natural) languages. NLP is often used for data annotation, as it can be used to identify and extract relevant information from text data.
Object detection: The task of identifying and locating objects in an image or video. Object detection is a common task in data annotation, as it is often necessary to identify objects in images and videos before they can be labeled.
Segmentation: The task of dividing an image or video into different segments. Segmentation is often used in data annotation, as it can be used to identify different parts of an image or video.
Supervised learning: A type of machine learning where the model is trained on labeled data. The labels tell the model what the desired output should be for a given input. Supervised learning is the most common type of machine learning used for data annotation.
Unsupervised learning: A type of machine learning where the model is trained on unlabeled data. The model learns to identify patterns in the data without being told what the desired output should be. Unsupervised learning is less common than supervised learning for data annotation, but it can be used for tasks such as clustering and anomaly detection.
Synthetic Data
Synthetic data: Data that is artificially generated rather than collected from the real world. Synthetic data can be used to train machine learning models, test software, and explore new ideas.
Data generation: The process of creating synthetic data. Data generation can be done using a variety of techniques, such as sampling, simulation, and machine learning.
Data augmentation: A technique for increasing the size and diversity of a dataset by creating new data points from existing data points. Data augmentation can be used to improve the performance of machine learning models.
Privacy-preserving data generation: A technique for generating synthetic data that protects the privacy of individuals. Privacy-preserving data generation can be used to create synthetic data that can be used for machine learning without compromising the privacy of individuals.
Sampling: A technique for creating synthetic data by randomly selecting data points from a real-world dataset. Sampling can be used to create synthetic data that is representative of the real world.
Simulation: A technique for creating synthetic data by simulating the real world. Simulation can be used to create synthetic data that is not possible to collect from the real world.
Machine learning: A type of artificial intelligence that allows machines to learn without being explicitly programmed. Machine learning models can be used to generate synthetic data that is similar to real-world data.
Data De-Identification
Data de-identification: The process of removing or altering personal information from data so that it can no longer be used to identify individuals.
Personally identifiable information (PII): Information that can be used to identify an individual, such as their name, address, phone number, or social security number.
De-identification techniques: There are a variety of techniques that can be used to de-identify data, such as:**
- Redaction: This involves removing PII from the data.
- Generalization: This involves replacing PII with more general information, such as age range or zip code.
- Suppression: This involves removing entire records that contain PII.
Data de-identification challenges: There are a number of challenges associated with data de-identification, such as:**
- The risk of re-identification: Even if data is de-identified, it is possible to re-identify individuals by combining it with other data sources.
- The loss of data utility: De-identification can sometimes lead to the loss of data utility, as the data may no longer be as useful for research or analysis.
Data de-identification standards: There are a number of standards that can be used to guide the de-identification of data, such as:**
- HIPAA: The Health Insurance Portability and Accountability Act (HIPAA) sets standards for the de-identification of health data.
- NIST: The National Institute of Standards and Technology (NIST) has published guidelines for the de-identification of data.
Data Quality & Observability
Data quality: The degree to which data meets the requirements of its intended use. Data quality is often expressed in terms of accuracy, completeness, consistency, timeliness, validity, and uniqueness.
Data observability: The ability to understand and monitor the health and performance of data in a system. Data observability is often described as the ability to answer the following questions about data:**
- What data do I have?
- Where is my data?
- What is the quality of my data?
- How is my data being used?
Data profiling: The process of collecting information about data, such as its structure, format, and content. Data profiling can be used to assess the quality of data and to identify potential problems.
Data cleansing: The process of identifying and correcting errors in data. Data cleansing can be a manual or automated process.
Data enrichment: The process of adding additional information to data. Data enrichment can be used to improve the quality of data and to make it more useful.
Data monitoring: The process of tracking the changes in data over time. Data monitoring can be used to identify problems with data quality and to ensure that data is being used in accordance with its intended purpose.
Root cause analysis: The process of identifying the underlying causes of a problem. Root cause analysis can be used to prevent problems from recurring.
Data lineage: The tracking of the history of data as it moves through a system. Data lineage can be used to understand how data is being used and to troubleshoot problems.
Version Control & Experiment Tracking
Version control: A system for tracking changes to files over time. Version control systems allow you to revert to previous versions of files, compare different versions of files, and track who made changes to files.
Experiment tracking: The process of tracking the changes made to an experiment over time. Experiment tracking systems allow you to track the parameters of an experiment, the results of an experiment, and the changes made to an experiment.
Git: A popular version control system. Git is a distributed version control system, which means that it does not require a central server. Git is often used for software development, but it can be used for any type of project.
GitHub: A popular online hosting service for Git repositories. GitHub makes it easy to collaborate on projects with others.
DVC: A data version control system. DVC is a command-line tool that allows you to track the changes made to data over time. DVC can be used to track data files, code files, and experiment results.
MLflow: A machine learning experiment tracking system. MLflow is a platform for managing the entire machine learning lifecycle, from experimentation to production. MLflow can be used to track the parameters of experiments, the results of experiments, and the changes made to experiments.
Model Validation & Monitoring
Model validation: The process of evaluating a model to ensure that it is performing as expected. Model validation can be done using a variety of techniques, such as:**
- Holdout sets: A holdout set is a set of data that is not used to train the model. The holdout set is used to evaluate the model's performance on unseen data.
- Cross-validation: Cross-validation is a technique for evaluating a model by dividing the data into multiple folds. The model is trained on a subset of the data and then evaluated on the remaining folds.
- Bootstrapping: Bootstrapping is a technique for evaluating a model by randomly sampling the data with replacement. The model is trained on the bootstrapped data and then evaluated on the original data.
Model monitoring: The process of tracking the performance of a model over time. Model monitoring can be used to identify problems with the model, such as overfitting or underfitting.
Overfitting: A problem that occurs when a model learns the training data too well and is unable to generalize to new data.
Underfitting: A problem that occurs when a model does not learn the training data well enough and is unable to make accurate predictions.
Bias: A tendency for a model to favor one outcome or group of outcomes over another.
Variance: A measure of how much the model's predictions vary from the actual values.
Machine Learning Platforms
Machine learning platform: A software platform that provides the tools and infrastructure needed to build, deploy, and manage machine learning models.
Cloud-based machine learning platform: A machine learning platform that is hosted in the cloud. Cloud-based machine learning platforms offer a number of advantages, such as scalability, flexibility, and cost-effectiveness.
On-premises machine learning platform: A machine learning platform that is hosted on-premises. On-premises machine learning platforms offer a number of advantages, such as control and security.
Open-source machine learning platform: A machine learning platform that is open source. Open-source machine learning platforms offer a number of advantages, such as flexibility and cost-effectiveness.
Proprietary machine learning platform: A machine learning platform that is proprietary. Proprietary machine learning platforms offer a number of advantages, such as support and integration with other products.
Machine Learning Deployment
Machine learning deployment: The process of making a machine learning model available to users so that it can be used to make predictions.
Model serving: The process of making a machine learning model available to users so that it can be used to make predictions.
Model containerization: The process of packaging a machine learning model into a container so that it can be deployed to a variety of environments.
Model monitoring: The process of tracking the performance of a machine learning model in production.
Model retraining: The process of updating a machine learning model with new data so that it can improve its performance.
Model rollback: The process of reverting to a previous version of a machine learning model if the current version is not performing well.
Model versioning: The process of tracking the different versions of a machine learning model so that it can be easily rolled back if necessary.
Model pipeline: A set of steps that are used to deploy a machine learning model to production.
Continuous integration/continuous delivery (CI/CD): A set of practices that are used to automate the deployment of machine learning models to production.
Resource Optimisation
Resource optimization: The process of maximizing the efficiency of resources such as computing power, memory, and storage.
Resource allocation: The process of assigning resources to tasks or workloads.
Resource scheduling: The process of determining when and how resources are used.
Resource virtualization: The process of abstracting resources from the physical hardware so that they can be shared.
Resource monitoring: The process of tracking the usage of resources so that they can be optimized.
Load balancing: Load balancing distributes workload across multiple resources to improve performance and reliability.
Caching: Caching stores frequently used data in memory so that it can be accessed more quickly.
Pre-fetching: Pre-fetching retrieves data in advance of when it is needed to improve performance.
Compression: Compression reduces the size of data so that it can be stored and transmitted more efficiently.
Deduplication: Deduplication removes duplicate data so that it does not take up unnecessary space.
Computer Vision
Computer vision: A field of artificial intelligence that gives computers the ability to see and understand the world around them.
Image recognition: The ability to identify objects in images.
Object detection: The ability to locate and identify objects in images.
Face recognition: The ability to identify faces in images.
Scene understanding: The ability to understand the context of an image, such as the objects in the scene and their relationships to each other.
Gesture recognition: The ability to recognize human gestures from images or videos.
Object tracking: The ability to track the movement of objects in images or videos.
Visual search: The ability to search for images or videos that match a given query.
Natural Language Processing
Natural language processing (NLP): A field of computer science that gives computers the ability to understand and process human language.
Corpus: A collection of documents used in a natural language processing system, commonly for benchmarking and comparing natural language processing models.
Tokenization: The process of breaking down text into tokens, which are individual words or phrases.
Stemming: The process of reducing a word to its root form. For example, looking and looked when stemmed resolve to the same root look.
Inverted index: A data structure used in information retrieval and search systems where tokens are mapped to documents to allow for performant search over large collections of documents.
Indexing: The process of analyzing one or more document, extracting key terms and storing in an index to allow for performant search over large collections of documents.
Lemmatization: The process of grouping together words that have the same meaning.
Part-of-speech tagging: The process of identifying the part of speech of each word in a sentence.
Named entity recognition: The process of identifying named entities in text, such as people, places, and organizations.
Semantic parsing: The process of understanding the meaning of text.
Machine translation: The process of translating text from one language to another.