Insight,

Risks of Gen AI: the black box problem

AU | EN
Current site :    AU   |   EN
Australia
China
China Hong Kong SAR
Japan
Singapore
United States
Global

Tell me in two minutes

  • A key factor to consider when deploying an AI system is its level of explainability — i.e., can the behavior of an AI system be explained in understandable human terms?
  • Explainability may be achieved either by relying on AI models that are low complexity and easily understandable (intrinsic interpretability) or through techniques that seek to explain more complex models (post-hoc explanations).
  • Providing good explanations involves balancing numerous factors, such as ensuring that explanation is not only accurate but also understandable and useful for its intended audience.
  • Explainability brings with it a number of benefits, including improving accuracy, safety and trust in AI systems. A lack of explainability, by contrast, could have both commercial and legal consequences, including under negligence, anti-discrimination law and administrative law.
  • Organisations deploying AI systems should consider the explainability requirements of their intended use case at the outset, including in the decision of whether to use an AI system in the first place, and ensure that any deployed AI system is capable of providing sufficient explanations to meet these requirements.

The “black box” problem

Recent advances in the capabilities of artificial intelligence (AI) systems have led many organisations to begin to explore new use cases and new AI technologies. However, some of the new wave of AI technologies suffer from an important shortcoming: they lack explainability. That is, it is difficult for the user to understand how the AI system arrives at its prediction or decision. In other words, these systems operate as “black boxes”.

Back in May 2024, McKinsey reported that many companies have already experienced negative consequences due to deployment of generative AI systems. The third most common cause was reported to be explainability.

In light of this, as organisations increase their reliance on technologies like large language models (LLMs) to assist in tasks such as analysing data and making decisions, it will be increasingly important to get to grips with the issue of explainability and the risks of deploying AI systems that lack explainability.

This article is part of KWM’s series on AI-related risks and examines the concept of explainability and offers some recommendations for addressing the “black box” problem.

What is explainability?

Australia’s AI Ethics Principles

The concept of explainability is included within the following principle in the Australian government’s AI Ethics Principles:

Transparency and explainability: There should be transparency and responsible disclosure so people can understand when they are being significantly impacted by AI, and can find out when an AI system is engaging with them.

Notably, the term “explainability” is not defined in this principle. In fact, it is difficult to even work out which part of the text of the principle relates to the concept of “explainability”. So to get a grip on the concept of explainability, it is necessary to dive a bit deeper.

Explainability v interpretability

A good way to understand the concept of explainability is in contrast to the related concept of “interpretability”:

  • Interpretability: Interpretability focuses on examining the inner mechanics of a model to understand exactly how and why a model is generating predictions. Can you understand each step, or otherwise understand what the algorithm does in simple terms? In the context of large language models, this includes mechanistic interpretability, which is the attempt to reverse engineer the detailed computations that occur within LLMs and other similar AI models.
  • Explainability: Explainability, by contrast, is higher level. It relates “to conveying an effective mental model of the system’s decision process to a stakeholder, even if they don’t fully understand the internal workings”. It is about explaining the behavior of an AI model in human terms.

It should be noted that the terms “explainability” and “interpretability” are sometimes used inconsistently, interchangeably or without a clear definition, often making discussions of these concepts confusing. However, the distinction above is consistent with the one adopted by the National Artificial Intelligence Centre (NAIC), which is now part of the Commonwealth’s Department of Industry, Science, and Resources (DISR), in their guidance on implementing Australia’s AI ethics principles.

These concepts are also not equivalent to algorithmic transparency, which understood in a narrow sense describes the algorithm that creates the AI model from its training data. It is possible to have this level of transparency without understanding the behaviour of the learned model and the reasons behind its predictions.

What is intrinsic interpretability?

There are two main ways of achieving explainability. The first of these is through intrinsic interpretability, which refers to a model being simple enough for its outputs to be interpretable on their face. If a user wants to understand the reason for an output, they can trace the relationship from the inputs, through the model all the way to the outputs.

This level of interpretability is achievable in “low complexity” models — i.e., models with relatively few parameters and simple functions. The following are a few examples of types of AI models where it is possible to achieve explainability through intrinsic interpretability:

WHAT IS IT?
HOW TO ACHIEVE EXPLAINABILITY
Example uses 2
Decision tree

A model that predicts an output by splitting the data into branches based on feature values until a decision is made.

By examining the decision rules and paths from the root to the leaves.

Linear regression model

A model that predicts an output as a linear weighted sum of multiple inputs.

By examining the weights in the linear model.

Nearest Neighbours

A model that predicts an output based on how close the input data points are to each other.

By providing the neighbours used for prediction.

This can be contrasted with more complex types of model, like LLMs (which use neural networks). If you tried to trace back through an LLM to determine what calculations led to its output, you would be confronted with a long list of numbers without any clear intuitive meaning. Therefore, while scale and complexity of neural networks is one of the reasons for their success, it comes at the expense of interpretability and therefore explainability.

As a result, even though neural networks may offer better performance on certain tasks than simpler models, simpler models may be preferred due to their explainability. For example, simpler models are capable of performing classification tasks like credit scoring. Given that the output of this task could have a significant impact on individuals, a simpler intrinsically interpretable AI system may be a preferable alternative to neural network AI for these applications. In fact, in some jurisdictions the transparency afforded by interpretable models may be required by law.

What are post-hoc explanations?

A second way of achieving interpretability is through post-hoc explainability. Post-hoc explainability refers to methods that analyse a model after it has been trained and can be applied to models even if they are not intrinsically interpretable.

Some examples of post-hoc techniques include:

  • providing summary statistics about how particular features of the input data impact the prediction of a model (e.g. feature importance).
  • developing partial dependence plots, which attempt to visualise the relationship between a particular feature of the input data and the output of the model.
  • approximating the predictions of a complex model with a simpler, intrinsically interpretable model.
  • example-based approaches, which identify similar, contrasting or representative prototype examples of model predictions.

Some of these methods can be global, attempting to describe how the model works as a whole across a wide range of possible inputs. Others will be local, attempting to describe why a model makes a particular prediction or decision.

All of these techniques are approximations of model behavior and there can be significant differences in the fidelity of the explanations they produce (that is, how closely the explanation approximates the predictions of the black box model). However, complete accuracy and comprehensiveness is not the goal. Instead, post-hoc explanations require balancing factors such as completeness and accuracy with ensuring that the explanation is understandable and useful to its intended audience.

As a result, whether explainability has been achieved through post-hoc methods is not a simple 'yes or no' question, but instead a question of whether the explanation is sufficient in the context in which the AI system is deployed. So, what makes an explanation sufficient? There is no settled answer to this, but the following have been identified as important features of a good explanation:

  • selective: focusing on the key factors influencing a prediction, rather than seeking to be completely comprehensive (which can be practically impossible with complex models).
  • contrastive: explaining why a particular prediction was made instead of another (e.g. what features of a loan application led to it being rejected instead of being approved).
  • identifying the abnormal: highlighting any abnormal features in the input data that may have had a significant impact on the decision.
  • appropriate for the audience: taking into account the level of background knowledge of the audience for the explanation.
  • accurate: correctly identifying the actual reasons that led to the prediction and capable of anticipating future outputs of the model.
  • generalisable: capable of providing explanations for similar scenarios.

Why is explainability important?

Why do we care about explainability? Firstly, explainability brings with it a number of benefits:

  • Improving accuracy and safety: Understanding how an AI model makes predictions and decisions can help identify the cause of errors and potentially anticipate future errors. For example, if the explanation for an AI vision system’s ability to detect bicycles is solely due to it identifying two wheels, this could indicate the system might fail in edge cases where one wheel is obscured from view.
  • Provide actionable insights: Similarly, understanding the reasons why a decision has been made can help identify what would change the decision. For example, it would be useful for a person whose mortgage application has been rejected to know that it was due to credit card debt or that it would have been approved if their income had been slightly higher.
  • Identifying bias: Explainability can also help identify bias. For example, deep learning models make predictions based on statistical models derived from their training data. This training data can reflect numerous societal and historical biases and an AI model derived from it may make decisions based on factors that are inappropriate to rely upon (such as gender or race). If an AI model lacks explainability, the impact of these biases can go undiscovered for years.
  • Trust: Being able to provide an accurate account of how a decision was made can improve trust and adoption in AI systems. Currently, Australia is one of the world leaders in AI-anxiety, with a recent survey by Ipsos reporting that 64% of Australians were nervous about the use of AI in products and services and only 42% reporting as trusting AI not to discriminate or show bias towards groups of people. Deploying explainable AI systems will help alleviate these concerns and may lead to increased trust and broader acceptance of these systems.

Secondly, there can be legal consequences for using unexplainable systems:

  • Negligence: Organisations need to exercise reasonable care in deploying AI systems. The use of unexplainable AI systems will make it more difficult to meet this standard of care, potentially exposing the deployer to liability, especially in high-risk scenarios.
  • Anti-discrimination law: The issues of bias mentioned above may expose organisations to liability for breach of anti-discrimination laws.
  • Misleading or deceptive conduct: Attempts to explain the output of AI systems that lack explainability could run afoul of the Australian Consumer Law’s prohibitions against misleading or deceptive conduct. This risk is heightened by the fact that LLMs can produce purported explanations of their decisions that do not accurately reflect how they arrived at a decision – which can be thought of as hallucinated explanations.
  • Administrative law: The Commonwealth Ombudsman has described providing reasons as “a fundamental requirement of good administrative decision-making”. The use of unexplainable AI systems may make it difficult to meet this standard of good decision-making and decision makers relying on such systems may find their decisions open to challenge under administrative law, including on the grounds of failing to consider relevant factors or considering irrelevant ones.

What now?

Explainable AI is an open area of research and the techniques intended to make AI systems explainable continue to evolve. For example, the AI company Anthropic has recently made steps towards understanding what is going on in LLMs through an approach called mechanistic interpretability, which seeks to recognise patterns of neuron activations that are associated with human-interpretable concepts.

In the meantime, organisations intending to deploy AI systems should:

  • assess the level of explainability that will be required for their use case (which may be determined in part by legal requirements).
  • consider whether these explainability requirements rule out the use of models that are not intrinsically interpretable (such as LLMs).
  • where post-hoc methods may be sufficient, explore some of the explainability tools that are currently commercially available.
  • continue to ensure that any explanations provided are sufficiently accurate and useful in the context of the use case and intended audience.
  • monitor for changes in law (including, for example, the proposed rules on automated decision-making that the Commonwealth government agreed to in its response to the Privacy Act Review Report).

Getting lost in the changing landscape of AI regulatory requirements?

View our resources and videos developed by our experts to help you stay on top of the latest GenAI and tech developments.

Our GenAI regulatory map will help you to understand and keep up with this fast moving regulatory and stakeholder landscape.

OPEN THE MAP

This easy-to-use and regularly updated timeline will help you stay on top of important developments across key areas of tech-related regulation, including GenAI.

OPEN THE TRACKER

LATEST THINKING
Insight
Australia’s new wage theft criminal offence is now in operation, having formally commenced on 1 January 2025.

13 February 2025

Insight
As the private markets have grown, and regulation becomes a more often used tool for managing market risks, discussion has, inevitably, turned to whether private capital is adequately regulated.

11 February 2025

Insight
The Government has tabled a report on the review of the Online Safety Act 2021 (Online Safety Act or OSA).

10 February 2025