Artificial Intelligence (AI) Learning Centre

Last Updated: May 1, 2025

No Results Found 0/0

What’s new

Cheat Sheets on Bias and Hallucinations in AI Scribes
AI Scribes: Know the legal and privacy risks (OMD)

The CEP’s AI Learning Centre is designed to help practitioners understand how AI will impact their work and build confidence in using AI effectively and ethically in clinical practice.

Key messages

Expect change and safely embrace experimentation

Innovation cycles are faster for AI tools, and machine learning tools improve independently and with the help of developers and consumers. As capabilities and limitations evolve, so do the associated risks and mitigation strategies. Stay vigilant and prioritize safety when integrating AI into practice.

Take a balanced approach considering benefits and risks

Understanding the limitations and potential risks of AI tools is crucial. Not all aspects of care will benefit from AI, and traditional methods may still be preferable in many situations. Critically evaluate AI tools before use, prioritizing safety and assessing risk, to ensure they are appropriate for your practice.

Ensure professional obligations are met

As AI expands into primary care, practitioners’ responsibilities remain the same: maintaining accuracy and accountability of health records, protecting patient data privacy, and ensuring informed patient consent. Practitioners should stay alert and verify that all AI applications comply with established privacy and security standards.

A note on AI regulation

AI is not regulated in Canada, and there are no current plans for proposed regulations. Some health regulations do apply to certain AI uses.

However, several non-binding principles are in various stages of development. The Voluntary Code of Conduct on the Responsible Development and Management of Advanced Generative AI Systems (Government of Canada) is already in effect. Meanwhile, the Trustworthy AI Framework (Government of Ontario) is still under development.

Expand All

Cheat Sheets New

Putting the “Informed” in Informed Consent: Outlining the risks of AI to your patients

Informed Use: Understanding AI Scribes and Safe Implementation

Fact & Fiction: AI-Generated Hallucinations

Built-In Bias: How Training Data Shapes AI Scribe Outputs

When AI Gets It Wrong: Hallucinations in AI Scribes

AI Tools in Primary Care

AI in the digital health spectrum

In the realm of digital health, algorithms have long been integral to tools supporting clinical decision-making. The integration of AI brings both opportunities and challenges:

AI tools offer advanced capabilities and adaptability, pushing the boundaries of what’s possible in healthcare technology. However, they also present unique challenges, particularly around transparency and potential biases.
In contrast, traditional non-AI digital health tools provide more predictable and transparent performance but lack the flexibility and depth of AI solutions.

What tools should I use?

Tools designed for healthcare are recommended for use in clinical workflow over generic tools. Custom pre-training and fine-tuning will improve performance for specialized environments such as healthcare. However, this is a nuanced topic. It has been shown that larger generic models trained on more data may have better performance than boutique healthcare-specific models trained on smaller data sets.

In comparable models, fine-tuning can improve performance over generic models in several ways:

How design impacts performance

Physical environment

Healthcare AI tools are tailored to operate in busy healthcare environments.

Generalized AI tools may not function well in busy or unpredictable environments – with varied lighting, many people talking or moving, and background noise.

Healthcare terminology

Healthcare AI tools are trained to recognize specialized healthcare terminology, non-English languages and accents.

Generalized AI tools may struggle with medical terms, accents, dialects, or ways of speaking not part of their training data.

Understand context

Healthcare AI tools are trained to interpret conversations and data products typical in healthcare settings.

Generalized AI tools may struggle with fragmented or non-linear conversations and healthcare-specific data formats.

Optimize clinical accuracy

Healthcare AI tools prioritize reducing errors that impact clinical accuracy and patient safety.

Generalized AI tools typically treat all errors equally, without considering the critical nature of healthcare information.

Regulatory compliance

Some healthcare AI tools may be developed with healthcare regulations in mind to ensure data privacy and security.

Generalized AI tools may not inherently comply with healthcare-specific regulatory requirements.

Accuracy and performance

Reliable, up-to-date information about accuracy rates for specific AI-powered tools is scarce. Often accuracy rate information is included within promotional materials produced by the product developers, and lack of transparency makes it difficult to assess validity of advertised performance (HAI, 2024).

At this time, the best way to assess the accuracy of an AI tool in clinical use is to: 

Use a well-known product advertised specifically for use in clinical documentation.  
Solicit information on accuracy from colleagues who are using the product
Try the product in practice and do your own assessment of accuracy.

Current limitations in assessing performance of AI models

As AI tools become increasingly prevalent in healthcare, practitioners should be aware of several key challenges in evaluating their true capabilities and limitations.

Training Bias: AI models may be trained on standardized test data, skewing test results. For example, high MCAT scores of an LLM might reflect exposure to practice tests in training, not true medical understanding.
Lack of Transparency: Most private companies developing foundation models do not disclose training data. It’s difficult to trust an AI tool’s recommendations without knowing what it was trained on.
Inconsistent Standards: There are no standards for AI training or performance in healthcare. Comparisons between tools are unreliable, which can make selection of a product challenging.
“Overfitting”: Overfitting refers to when an AI model is over-trained on its data set and has difficulty extrapolating to new data or information. This results in AI models that excel in test scenarios but falter with real patient complexities.

What can I do right now?

Inform yourself. Spend some time reading about AI in healthcare (see Resources), attend webinars, and talk with colleagues who are using AI-powered tools to get a sense of the benefits and limitations.
Get help with the fine print. Most AI tools available right now are not regulated by Health Canada. The OMA’s legal department will review any AI vendor contract submitted to legal.affairs@oma.org, and flag any potential contractual issues.
Start low. Starting with low-risk, non-clinical tasks is a good approach to gain an understanding and confidence in how AI-powered tools function in clinical workflow.

Though use of AI in primary care is evolving, practitioners have medico-legal responsibilities when it comes to the use of AI in clinical practice. For a current, detailed overview of issues related to privacy, bias, and consent, see the Navigating AI in healthcare webinar (60 min – MainPro+ certified) (CMPA, June 4, 2024), or AI Webinar—Important takeaways (CMPA, 2024).

Risk spectrum

Tasks in primary care range from low-risk administrative functions to high-risk clinical decision support, with risk levels generally correlating to the degree of clinical involvement. As risk increases, so should the level of management, mitigation, and oversight, including stricter protocols, mandatory clinician review, regular audits, and clear escalation procedures.

Currently, there are no explainable/interpretable products that meet the specific needs of healthcare – see Explainability, interpretability, and the challenge of the “black box”. Given that, the ability to validate outputs and correct errors simply and easily is paramount. For examples, see AI scribes (easy to validate and correct), or AI in CDSS (difficult to validate and correct).

Varied risk levels in AI clinical assistant products

Private clinical software companies are producing comprehensive AI-powered clinic assistant suites that execute a range of tasks with varying degrees of risk. For safe implementation using clinical assistant suites, practitioners should:

Understand the capabilities and limitations of each component in the suite.
Evaluate each component individually for risk level and appropriate use.
Implement tailored strategies for different risk levels.
Regularly reassess tool performance and risk profiles.

AI in CDSS: Challenges and considerations

Use of non-explainable, non-transparent CDSS in clinical practice presents serious risks. Without insight into the underlying evidence or the ability to trace recommendation logic, use of one of these systems would compromise clinicians’ medico-legal responsibilities and jeopardize patient care.

Clinical decision support systems (CDSS) are typically categorized as either knowledge-based or non-knowledge based.

Knowledge-based CDSS function by employing rules (if-then statements). The system pulls data to assess the rule and generates a corresponding action or outcome. These rules can be derived from literature-based, practice-based, or patient-specific evidence.
Non-knowledge-based CDSS still require a data source but make decisions using AI, machine learning, or statistical pattern recognition instead of following predefined expert medical knowledge.

A significant challenge of non-knowledge-based CDSS is the “black box” effect. Deep learning models are highly complex, and even a model’s developers and engineers cannot trace precisely how the AI generated a response. Furthermore, many private companies are not transparent about training data, leaving users blind to even the information sources the AI is using to generate responses.

Assessing AI-CDSS products

Currently, explainable or transparent systems are not required by law. Users must rely solely on vendor-supplied information. Avoid use of non-explainable, non-interpretable AI-powered CDSS in clinical practice.

A note on evidence:

Many products advertise use of “best” or “highest quality” evidence. This as a standalone statement is not sufficient to engender trust. Confidence in evidence can be related to currency, context, curation, risk of bias, generalizability, and myriad other factors.  

Does a vendor get specific about which evidence is used, how it is selected, and how often it is updated? If the answer is NO, the risk of using such a tool in practice is high.

Is the evidence source a “walled garden,” or are there other inputs (such as an LLM) that could add an additional and possibly inaccurate layer of interpretation? If the answer is YES, the risk of using such a tool in practice is high.

Resources:

AI Scribes

AI Scribes: Know the legal and privacy risks (OMD)

Clinicians are increasingly using AI scribes to manage patient encounter documentation.

Before adopting AI scribes, you should:

Understand the risks and mitigation strategies.
Know relevant laws and your professional obligations.
Consider privacy implications, like patient consent.
Understand patient data usage.
Utilize free change management supports.

Visit OMD’s AI Knowledge Zone for legal and privacy expertise, resources, and change management support.

OMA support to review AI scribe vendor contracts

It may feel daunting to make the decision about whether to adopt an AI scribe in practice, particularly since they’re not currently regulated by Health Canada. But, you’re not alone: the OMA’s legal department will review any AI vendor contract submitted to legal.affairs@oma.org, and flag any potential contractual issues

Ontario primary care evaluation study

A recent AI scribe evaluation study commissioned by OntarioMD enlisted 150 Ontario primary care clinicians and nurse practitioners to trial 3 different AI scribes for 3 months, using the scribes both in day-to-day practice and in clinical lab simulations.

Key results and takeaways

Privacy and security: For the study, OntarioMD negotiated stricter privacy and security requirements than the selected (anonymized) vendors typically offer, illustrating a gap in the privacy/security offerings of the piloted products.
Documentation, administration, and cognitive load
- 69.5% reduction in documentation time per encounter when supported by a scribe, compared to documentation without a scribe.
- Average 3 hours/week reduction in time spent on administrative tasks with use of an AI scribe.
- Over 80% of NPs and close to 60% of FPs reported a reduction in after-hours work.
- Over 75% of practitioners reported use of an AI scribe reduced cognitive load during patient encounters
Patient perspectives: The study characterizes patient perceptions as mostly positive. Feedback included that many patients appreciated having an “objective, word-for-word transcription” of encounters, though some expressed discomfort at having sensitive topics recorded and transcribed.
Cost: While most participant clinicians saw value in the use of AI scribes, the cost of current market products was seen as prohibitive for many.

Clinical Evaluation of Artificial Intelligence and Automation Technology to Reduce Administrative Burden in Primary Care (Centre for Digital Health Evaluation, WIHV, July 31, 2024)

Examples of AI medical scribe outputs

Depending on the product, AI medical scribes generate valuable outputs for clinical workflow, including:

SOAP notes for EMRs
Referral letters and request for consult
Insurance documentation

Effective use: Practical guidance from clinicians

AI 101: Intro to AI and AI scribes (OMDEducates, March 2024)

Alleviating Administrative Burden Through AI (COVID-19 Community of Practice, May 2024)

Ethical Landscape

Ethical landscape

This section will be updated on an ongoing basis to track bias and potential mitigation strategies in AI-powered healthcare tools.

Biases in AI: Implications for primary care

AI models are influenced by biases from a variety of sources:

Limitations of model types
Biases and lack of diversity in pre-training data and fine-tuning process
Biases and lack of diversity of developers and human users

Consumers of AI products should expect that products do contain and perpetuate biases.

In healthcare, AI biases have the potential to result in considerable harms.

Spotlight: Impact of biases in training data

The data that AI models draw on is not better than existing data sources, and even high-quality health population data and data from academic or scholarly literature have known biases and limitations (Oxford CEBM, 2024). Furthermore, without transparency in training data, there is no way to know what data sources are used to inform an AI model. It is safe to assume that datasets used by AI models over-represent certain demographic groups, under-represent others, use flawed proxies, and articulate patterns or trends that are in themselves flawed.

In primary care, this could translate to:

Diagnostic Errors: Biased AI systems may lead to missed or delayed diagnoses, particularly for underrepresented patient groups.
Treatment Disparities: AI-driven treatment recommendations might not be equally effective across all patient populations.
Communication Barriers: Biased language models or speech recognition systems could impede effective communication with diverse patient populations.
Reinforced Inequities: Unchecked AI bias could exacerbate existing health disparities and inequities in healthcare access and outcomes.
Erosion of Trust: If patients perceive or experience bias in AI-assisted care, it could damage trust in their healthcare providers and the healthcare system overall.

Movements to incorporate ethics into the design and use of AI

Responsible AI refers to principles guiding the design, development, deployment, and use of AI to foster trust and empower organizations and collaborators. It addresses the broader societal impacts of AI systems, ensuring they align with social values, legal standards, and ethical principles. The aim is to integrate these ethical guidelines into AI applications and workflows, thereby reducing risks and negative outcomes while enhancing positive results.

Explainable AI (XAI) involves methods and processes that allow human users to comprehend and trust the outcomes of machine learning algorithms. It clarifies an AI model’s anticipated impact, possible biases, accuracy, fairness, transparency, and decision-making results. XAI is crucial for fostering trust and confidence in AI models and promoting a responsible approach to AI development, especially as AI becomes more sophisticated and less understandable.

Sources for this section include:

AI Glossary

AI glossary

AI biases result from human biases captured in training data or algorithms. They will be reflected in responses and may be overt or subtle and difficult to detect.

AI Models are used to make decisions or predictions.

Algorithms define the logic by which an AI model operates.

Artificial intelligence (AI) is technology that allows computers and machines to mimic human intelligence and problem-solving abilities. It can be used to execute tasks that would otherwise require a human. Examples include GPS or self-driving cars.

Automated data extraction uses an automated process to transform unstructured or semi-structured data into structured information.

Automatic speech recognition (ASR) systems convert human speech into text through a highly complex process involving linguistics, mathematics, and statistics. The performance of speech recognition technology is measured by its word error rate (WER) and speed. Factors like pronunciation, accent, pitch, volume, and background noise can affect ASR systems’ performance.

Machine learning models with inner workings so complex that it is impossible for algorithm designers, engineers, and users to know precisely how the model came to a specific result.

Classification algorithms in machine learning aim to categorize data accurately. They identify patterns in existing data to determine how to label or define previously unseen examples.

Computer vision is a broad discipline that employs machine learning and neural networks to enable computers and systems to extract information from digital images, videos, and other visual inputs. It allows these systems to make recommendations or take actions based on their visual analysis.

Deep learning arranges machine learning algorithms in layers to form neural networks. Deep learning has had significant breakthroughs in recent years, and most of the AI products consumers interact with today are powered by deep learning models.

Products that combine multiple base AI models together.

Explainable AI (XAI) involves methods and processes that allow human users to comprehend and trust the outcomes of machine learning algorithms. It clarifies an AI model’s anticipated impact, possible biases, accuracy, fairness, transparency, and decision-making results. XAI is crucial for fostering trust and confidence in AI models and promoting a responsible approach to AI development, especially as AI becomes more sophisticated and less understandable.

Foundation models are large-scale, using vast amounts of training data to facilitate use across a variety of contexts.

Generative AI, or GenAI, can produce original content based on a user’s prompt or request, such as text, images, audio or video.

Hallucinations are specific to language models and refer to when the model’s response to a prompt is nonsensical or inaccurate. Hallucinations can be difficult or impossible to detect if the subject is outside the prompter’s realm of expertise.

Forced hallucination is a term for when a user attempts to use an LLM in ways contrary to how they are designed to work, which forces the model into a position where their limitations are more stark. For instance, LLMs are not designed to be accurate or to “find” information. Asking an LLM-powered chatbot for data or statistics that are impossible to know may result in a response that it doesn’t know but could also result in the model inventing an answer.

The extent to which users can grasp the reasoning behind an algorithm’s decision. It measures how accurately an AI’s output can be predicted by human users.

Large Language Models (LLMs) are a type of generative AI model trained on vast amounts of data. They can produce original content based on a user’s prompt or request such as text, images, audio, or video that uses Natural Language Processing (NLP).

Machine Learning (ML) is a subset of artificial intelligence rooted in training machines or programs on existing data. Once training is complete, the machine or program can apply what it learned to new, unseen data to identify patterns, make predictions, or execute tasks.

Machine learning classification (MLC) uses algorithms and aims to categorize data accurately. They identify patterns in existing data to determine how to label or define previously unseen examples.

Multimodal models analyze and produce many different types of data. Current LLM chatbots such as ChatGPT and Gemini are multimodal – they can read and analyze many types of data (text, video, audio, images), and produce other data types as part of requested outputs, such as taking a text input and producing an infographic.

Natural Language Processing (NLP) combines computational linguistics, statistical modeling, machine learning, and deep learning to enable computers to understand and generate realistic humanlike text and speech.

Neural networks are complex, layered algorithms that mimic the structure of human brains. Nodes in neural networks perform calculations on numbers passed between them on connective pathways.

Optical character recognition (OCR) is a technology using automatic data extraction and both hardware and software systems to convert images of text into a format readable by a computer. It identifies individual letters in an image, assembles them into words, and then forms those words into sentences, thus transforming physical, printed documents into text that a computer can read.

Overfitting in machine learning happens when an algorithm models the training data too precisely, making it ineffective at predicting or interpreting new data. This undermines the model’s utility, as the ability to generalize to new data is crucial for making accurate predictions and classifications.

Retrieval-augmented generation enhances LLM-generated responses by incorporating external sources of knowledge, enriching the model’s internal understanding. This approach allows the system to access the most current and reliable data while providing users with source references to verify the accuracy of its claims.

Responsible AI refers to principles guiding the design, development, deployment, and use of AI to foster trust and empower organizations and collaborators. It addresses the broader societal impacts of AI systems, ensuring they align with social values, legal standards, and ethical principles. The aim is to integrate these ethical guidelines into AI applications and workflows, thereby reducing risks and negative outcomes while enhancing positive results.

Timeboxing refers to the fact that machine learning models’ training data is only current up to a certain date. Foundation models require a massive amount of time, resources, and computing power to train, and can cost hundreds of millions or close to a billion dollars to train. Thus, continuous training is not feasible. For more information on the costs to develop foundation models, see 2024 AI Index Report (Stanford) under Top Resources.

Training data is used by machine learning systems to learn how to recognize patterns and generate writing. The quality of training data, plus additional learning, directly impacts the quality of system outputs.

- For ASR, training data that contains varied voices, accents, conversation styles, background noise levels, etc., will improve the model’s performance. Models that have not been trained with diverse data will struggle to accurately capture conversations that include sounds, accents, or manners of speaking they were not trained on.
- For LLMs, training data that includes biased, discriminatory, or violent content will be reflected in model reasoning and outputs, in ways that may or may not be clear in any given conversation. Foundation models have added extensive fine-tuning after training to attempt to surface, correct, and eradicate harmful vectors and concepts.
- Currently, private development companies keep details of training data secret, so consumers have no way to discover what data current frontier models were trained on.

Transformer models are advanced neural networks designed to handle data such as words or sequences of words. They work by converting input sentences into numerical values that capture their meaning. These values are processed through multiple layers, each refining the data by focusing on different aspects of the sentence. This method allows transformer models to effectively translate languages and perform other complex tasks by learning patterns and relationships within the data.

Unimodal models can only read and produce a single type of data, such as an LLM that can only read and produce text.

Weight refers to how LLMs map the probability of different possible word or concept pairings. The more frequently words or concepts are paired together, the more weight the model will give them.

Word error rate (WER) is the percentage of words inaccurately translated by the ASR system. WER counts all words equally in an exchange. For specialized fields, such as medicine, the use of specialized WERs can help fine-tune the model to ensure high performance for capturing critical health information. Healthcare WERs include Medical Concept WER (MC-WER), Doctor-Specific Word Error Rate (D-WER), or Patient-Specific Word Error Rate (P-WER).

Word vectors are created by LLMs during training to map how words are related. They are created to capture relationships between words or concepts and extrapolate to identify analogies.

Acknowledgement and legal

The AI Learning Centre was developed using the Centre for Effective Practice’s (CEP’s) rapid knowledge translation approach as part of the Knowledge Translation in Primary Care (KTinPC) Initiative. Funded by the Ministry of Health, the KTinPC Initiative supports primary care clinicians with a series of clinical tools and health information resources. Learn more about the KTinPC Initiative.

This Tool was developed for licensed health care professionals in Ontario as a guide only and does not constitute medical or other professional advice. Health care professionals are required to exercise their own clinical judgement in using this tool. Neither the CEP, Government of Ontario, collaborating partner(s), nor any of their respective agents, appointees, directors, officers, employees, contractors, members or volunteers: (i) are providing medical, diagnostic or treatment services through this Tool; (ii) to the extent permitted by applicable law, accept any responsibility for the use or misuse of this Tool by any individual including, but not limited to, primary care providers or entity, including for any loss, damage or injury (including death) arising from or in connection with the use of this Tool, in whole or in part; or (iii) give or make any representation, warranty or endorsement of any external sources referenced in this Tool (whether specifically named or not) that are owned or operated by their parties, including any information or advice contained therein.

This Tool is a product of the CEP. Permission to use, copy, and distribute this material for all non-commercial and research purposes is granted, provided the above disclaimer, this paragraph and the following paragraphs, and appropriate citations appear in all copies, modifications, and distributions. Use of the CEP: AI Learning Centre resource for commercial purposes or any modifications of the Learning Centre are subject to charge and use must be negotiated with the CEP (Email: info@cep.health).

For statistical and bibliographic purposes, please notify the CEP (info@cep.health) of any use or reprinting of information from the AI Learning Centre. Please use the following citation when referencing the Tool: Reprinted with Permission from Centre for Effective Practice. (Month Year). [insert tool name]: Ontario. Toronto: Centre for Effective Practice.

Developed by:

Artificial Intelligence (AI) Learning Centre

What’s new

Key messages

Expect change and safely embrace experimentation

Take a balanced approach considering benefits and risks

Ensure professional obligations are met

A note on AI regulation

Cheat Sheets New

AI Tools in Primary Care

Jump to:

AI in the digital health spectrum

What tools should I use?

Physical environment

Healthcare terminology

Understand context

Optimize clinical accuracy

Regulatory compliance

Accuracy and performance

Current limitations in assessing performance of AI models

What can I do right now?

Risk spectrum

Key Terms

Varied risk levels in AI clinical assistant products

AI in CDSS: Challenges and considerations

Assessing AI-CDSS products

A note on evidence:

Resources:

AI Scribes

AI Scribes: Know the legal and privacy risks (OMD)

OMA support to review AI scribe vendor contracts

Ontario primary care evaluation study

Examples of AI medical scribe outputs

Ethical Landscape

Ethical landscape

Biases in AI: Implications for primary care

Spotlight: Impact of biases in training data

Movements to incorporate ethics into the design and use of AI

Sources for this section include:

AI Glossary

AI glossary

AI biases

AI models

Algorithms

Artificial intelligence (AI)

Automated data extraction

Automatic speech recognition (ASR)

Black box

Classification algorithms

Computer vision

Deep learning

Ensemble models

Explainable AI/Explainability (XAI)

Foundation models

Generative AI

Hallucinations

Interpretability

Large language models (LLMs)

Machine Learning (ML)

Machine learning classification (MLC)

Multimodal models

Natural Language Processing (NLP)

Neural networks

Optical character recognition (OCR)

Overfitting

Retrieval-Augmented Generation (RAG)

Responsible AI

Timeboxing

Training data

Transformer models

Unimodal models

Weight

Word error rate (WER)

Word vectors

Acknowledgement and legal