{"id":9918,"date":"2024-01-08T19:09:36","date_gmt":"2024-01-08T19:09:36","guid":{"rendered":"https:\/\/www.afiniti.com\/?p=9918"},"modified":"2025-04-08T18:01:05","modified_gmt":"2025-04-08T18:01:05","slug":"dr-caroline-obrien-professor-elazer-r-edelman-discuss-ai-safety-testing-in-venturebeat","status":"publish","type":"post","link":"https:\/\/staging.afiniti.com\/dr-caroline-obrien-professor-elazer-r-edelman-discuss-ai-safety-testing-in-venturebeat\/","title":{"rendered":"Dr. Caroline O&#8217;Brien &#038; Professor Elazer R. Edelman Discuss AI Safety Testing in VentureBeat"},"content":{"rendered":"<h6>The use of AI in consumer-facing businesses is on the rise \u2014 as is the concern for how best to govern the technology over the long-term. Pressure to better govern AI is only growing with the Biden administration\u2019s <a href=\"https:\/\/www.whitehouse.gov\/briefing-room\/statements-releases\/2023\/10\/30\/fact-sheet-president-biden-issues-executive-order-on-safe-secure-and-trustworthy-artificial-intelligence\/\" rel=\"sponsored nofollow\">recent executive order <\/a>that mandated new measurement protocols for the development and use of advanced AI systems.<\/h6>\n<p>AI providers and regulators today are highly focused on explainability as a pillar of AI governance, enabling those affected by AI systems to best <a href=\"https:\/\/oecd.ai\/en\/dashboards\/ai-principles\/P7\" target=\"_blank\" rel=\"noreferrer noopener\">understand and challenge<\/a> those systems\u2019 outcomes, including bias.<\/p>\n<p>While explaining AI is practical for simpler algorithms, like those used to approve car loans, more recent AI technology uses complex algorithms that can be extremely complicated to explain but still provide powerful benefits.<\/p>\n<p>OpenAI\u2019s GPT-4 is trained on massive amounts of data, with billions of parameters, and can <a href=\"https:\/\/www.technologyreview.com\/2023\/03\/14\/1069823\/gpt-4-is-bigger-and-better-chatgpt-openai\/\" target=\"_blank\" rel=\"noreferrer noopener\">produce human-like conversations<\/a> that are revolutionizing entire industries. Similarly, Google Deepmind\u2019s <a href=\"https:\/\/www.nature.com\/articles\/s41586-019-1799-6\" target=\"_blank\" rel=\"noreferrer noopener\">cancer screening models<\/a> use deep learning methods to build accurate disease detection that can save lives.<\/p>\n<p>These complex models can make it nearly impossible to trace where a decision was made, but it may not even be meaningful to do so. The question we must ask ourselves is: Should we deprive the world of these technologies that are only partially explainable, when we can ensure they bring benefit while limiting harm?<\/p>\n<p>Even US lawmakers who seek to regulate AI are quickly <a href=\"https:\/\/time.com\/6289953\/schumer-ai-regulation-explainability\/\" target=\"_blank\" rel=\"noreferrer noopener\">understanding the challenges around explainability<\/a>, revealing the need for a different approach to AI governance for this complex technology \u2014 one more focused on outcomes, rather than solely on explainability.<\/p>\n<p><strong>Dealing with uncertainty around novel technology isn\u2019t new<\/strong><\/p>\n<p>The medical science community has long recognized that to avoid harm when developing new therapies, one must first identify what the potential harm might be. To <a href=\"https:\/\/venturebeat.com\/security\/the-password-identity-crisis-evolving-authentication-methods-in-2024-and-beyond\/\">assess the risk<\/a> of this harm and reduce uncertainty, the randomized controlled trial was developed.<\/p>\n<p>In a randomized controlled trial, also known as a clinical trial, participants are assigned to treatment and control groups. The treatment group is exposed to the medical intervention and the control is not, and the outcomes in both cohorts are observed.<\/p>\n<p>By comparing the two demographically comparable cohorts, causality can be identified \u2014 meaning the observed impact is a result of a specific treatment.<\/p>\n<p>Historically, medical researchers have relied on a stable testing design to determine a therapy\u2019s long-term <a href=\"https:\/\/venturebeat.com\/security\/cybersecurity-new-years-resolutions-every-enterprise-leader-and-user-should-make\/\">safety and efficacy<\/a>. But in the world of AI, where the system is continuously learning, new benefits and risks can emerge every time the algorithms are retrained and deployed.<\/p>\n<p>The classical randomized control study may not be fit for purpose to assess AI risks. But there could be utility in a similar framework, like A\/B testing, that can measure an AI system\u2019s outcomes in perpetuity.<\/p>\n<p><strong>How A\/B testing can help determine AI safety<\/strong><\/p>\n<p>Over the last 15 years, A\/B testing has been used extensively in product development, where groups of users are treated differentially to measure the impacts of certain product or experiential features. This can include identifying which buttons are more clickable on a web page or mobile app, and when to time a marketing email.<\/p>\n<p>The former head of experimentation at Bing, Ronny Kohavi, introduced the concept of <a href=\"https:\/\/dl.acm.org\/doi\/10.1145\/1281192.1281295\" target=\"_blank\" rel=\"noreferrer noopener\">online continuous experimentation<\/a>. In this testing framework, Bing users were randomly and continuously allocated to either the current version of the site (the control) or the new version (the treatment).<\/p>\n<p>These groups were constantly monitored, then assessed on several metrics based on overall impact. Randomizing users ensures that the observed differences in the outcomes between treatment and control groups are due to the interventional treatment and not something else \u2014 such as time of day, differences in the demographics of the user, or some other treatment on the website.<\/p>\n<p>This framework allowed technology companies like Bing \u2014 and later Uber, Airbnb and many others \u2014 to make iterative changes to their products and user experience and understand the benefit of these changes on key business metrics. Importantly, they built infrastructure to do this at scale, with these businesses now managing potentially <a href=\"https:\/\/hbr.org\/2017\/09\/the-surprising-power-of-online-experiments\" target=\"_blank\" rel=\"noreferrer noopener\">thousands of experiments concurrently<\/a>.<\/p>\n<p>The result is that many companies now have a system to iteratively test changes to a technology against a control or a benchmark: One that can be adapted to measure not just business benefits like clickthrough, sales and revenue, but also causally identify harms like disparate impact and discrimination.<\/p>\n<p><strong>What effective measurement of AI safety looks like<\/strong><\/p>\n<p>A large bank, for instance, might be concerned that their new pricing algorithm for personal lending products is unfair in its treatment of women. While the model does not use protected attributes like gender explicitly, the business is concerned that proxies for gender may have been used when training the data, and so it sets up an experiment.<\/p>\n<p>Those in the treatment group are priced with this new algorithm. For a control group of customers, lending decisions were made using a benchmarked model that had been used for the last 20 years.<\/p>\n<p>Assuming the demographic attributes like gender are known, distributed equally and of sufficient volume between the treatment and control, the disparate impact between men and women (if there is one) can be measured and therefore answer whether the AI system is fair in its treatment of women.<\/p>\n<p>The exposure of AI to human subjects can also occur more slowly for a controlled rollout of new product features, where the feature is gradually released to a larger proportion of the user base.<\/p>\n<p>Alternatively, the treatment can be limited to a smaller, less risky population first. For instance, <a href=\"https:\/\/www.microsoft.com\/en-us\/security\/blog\/2023\/08\/07\/microsoft-ai-red-team-building-future-of-safer-ai\/#:~:text=The%20practice%20of%20AI%20red,generation%20of%20potentially%20harmful%20content.\" target=\"_blank\" rel=\"noreferrer noopener\">Microsoft uses red teaming<\/a>, where a group of employees interact with the AI system in an adversarial way to test its most significant harms before releasing it to the general population.<\/p>\n<p><strong>Measuring AI safety ensures accountability<\/strong><\/p>\n<p>Where explainability can be subjective and poorly understood in many cases, evaluating an AI system in terms of its outputs on different populations provides a quantitative and tested framework for determining whether an AI algorithm is actually harmful.<\/p>\n<p>Critically, it establishes <a href=\"https:\/\/oecd.ai\/en\/dashboards\/ai-principles\/P9\" target=\"_blank\" rel=\"noreferrer noopener\">accountability of the AI system<\/a>, where an AI provider can be responsible for the system\u2019s proper functioning and alignment with ethical principles. In increasingly complex environments where users are being treated by many AI systems, continuous measurement using a control group can determine which AI treatment caused the harm and hold that treatment accountable.<\/p>\n<p>While explainability remains a heightened focus for AI providers and regulators across industries, the techniques first used in healthcare and later adopted in tech to deal with uncertainty can help achieve what is a universal goal \u2014 that AI is working as intended and, most importantly, is safe.<br \/>\n<b><br \/>\nAbout Dr. Caroline O\u2019Brien<\/b><br \/>\nCaroline O\u2019Brien is chief data officer and head of product at <a href=\"https:\/\/www.afiniti.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">Afiniti<\/a>, a customer experience AI company.<\/p>\n<p><b>About Professor Elazer R. Edelman<br \/>\n<\/b>Elazer R. Edelman is the Edward J. Poitras professor in medical engineering and science at MIT, professor of medicine at Harvard Medical School and senior attending physician in the coronary care unit at the Brigham and Women\u2019s Hospital in Boston.<\/p>\n<p><em>Originally <a href=\"https:\/\/venturebeat.com\/ai\/how-important-is-explainability-applying-critical-trial-principles-to-ai-safety-testing\/\">published<\/a> in VentureBeat on January 7, 2024.<br \/>\n<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>The use of AI in consumer-facing businesses is on the rise \u2014 as is the concern for how best to govern the technology over the long-term. Pressure to better govern AI is only growing with the Biden administration\u2019s recent executive order that mandated new measurement protocols for the development and use of advanced AI systems. [&hellip;]<\/p>\n","protected":false},"author":7,"featured_media":13076,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[42,41],"tags":[32],"class_list":["post-9918","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-leader-insights","category-responsible-ai","tag-latest-thinking"],"acf":[],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/staging.afiniti.com\/api\/wp\/v2\/posts\/9918","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/staging.afiniti.com\/api\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/staging.afiniti.com\/api\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/staging.afiniti.com\/api\/wp\/v2\/users\/7"}],"replies":[{"embeddable":true,"href":"https:\/\/staging.afiniti.com\/api\/wp\/v2\/comments?post=9918"}],"version-history":[{"count":10,"href":"https:\/\/staging.afiniti.com\/api\/wp\/v2\/posts\/9918\/revisions"}],"predecessor-version":[{"id":10215,"href":"https:\/\/staging.afiniti.com\/api\/wp\/v2\/posts\/9918\/revisions\/10215"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/staging.afiniti.com\/api\/wp\/v2\/media\/13076"}],"wp:attachment":[{"href":"https:\/\/staging.afiniti.com\/api\/wp\/v2\/media?parent=9918"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/staging.afiniti.com\/api\/wp\/v2\/categories?post=9918"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/staging.afiniti.com\/api\/wp\/v2\/tags?post=9918"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}