AI / LLM Testing Podcast

Transcription

Introduction

Hello, my name is Alan Barr and this is the Alan Barr show. I’m passionate about software, software testing, and writing. This is the intersection of metacognition, organization, and technology. It’s January 2024 and a lot is going to change this year. I’m gonna open today’s episode with a question and that question is Should I do this for myself, or would I pay somebody else to do this? And this is basically about AI and using large language models, a tool. In the past I would research, write, organize, and it would take a lot of time and effort. Many of those things I would probably pay somebody else to do it because I’m not the best at all these things. Now with ChatGPT and Copilot, I’m more independent and I have more energy. In today’s episode, I’m going to talk about how we’re using AI, AI products and testing, and AI testing.

How are we using AI?

How are we using AI? Testers, software engineers, architects, these are all tools we are using every day. Why is this important than before? In the past, when I needed to learn, I would go to a search engine and I would craft a query on Google. Then I would browse a handful of blogs. I would get lost and distracted. I may find the answer I needed, keep searching, or move on. With an AI assistant, the friction and conversation is less distracting. Today, there is much less friction to learn. By using AI, whether that’s ChatGPT or Copilot, a lot of people are using AI for a lot of reasons. That could be doing a code thing, a database logic thing. It could be cleaning up your grammar, generating pictures or visuals, you know, as a jumping point, asking questions back and forth with a virtual software app. You can expand topics that you want to learn and understand, even recipes. Now with Microsoft Copilot on the App Stores, there really is no reason to at least try it. The more you play with AI, you start to learn what works and what doesn’t. It’s not magic. It’s a game. You start to learn what works and what doesn’t. It’s not magic. It’s a gift. It might be a good gift and it might not be what you expected or it could be wrong. To be clear, I’m really practical about how I use technology. AI and large language models are weird, kind of wrong, not really correct all the time. I can shape this and this saves a lot of time and energy. I have a stepladder now. We’ve all seen this over the years where something seems amazing and then we really understand how it really works, how it’s a little better. We get a little grandiose with these big ideas like artificial general intelligence or AGI. But let’s be realistic. Be grounded.

AI Products and Testing

AI, products and testing. The value of specialized AI products enables the work to flow. Many of us are asking ourselves, when should I use this tool on a product so that I can continue to do my work without being distracted? We need testing and testing is changing. Teaching how to test is difficult because the bar keeps rising. We’ve seen software going from physical virtual containers and how we test is different. There is still a lot of testing that is product focus. And with software development today, we just require less people than before. If you can tell a good story, write, be thorough, work with a team of people, then you’ll still be critical. How do you spend your time on testing these days? In the past, we would test all the potential risks of a database. These days, an entire company or platform may have a group of people that are dedicated to testing that software. For the rest of us, we don’t need to worry about that testing as much. With AI products, it’s not clear what the boundaries are for the product. Latency could be a big problem. An example would be when I use ChatGPT with my voice, sometimes it’s magical. And then sometimes the latency is not working. It stops working. I might be driving. As I’m driving, ChatGPT stops answering. And it’s annoying. Is it because, you know, all the other issues with technology? Or is it something else? And I think with testing, we don’t really know the bounds of testing. And I think, in a way, it’s exciting because we have so much work to do. But also, we don’t know what’s possible, what’s not. How can we improve it? Is it going to be a good business or not? And at the back of our minds, we’re also probably thinking about AI transparency, ethics, bias, safety, preparation, security. They’re all important topics that we’re learning today. However, you know, we want to use the tool today and we want the features. It’s going to be an exciting year for sure, especially with testing and these new products.

AI Testing

All right, let’s talk about AI testing. And in this section, I want to talk about one particular blog that I read. Testing large language model-based applications, strategy, and challenges. And this is from Scott Logic. I might play around with this topic each episode if there’s a lot of interest. In this case, we’re going to talk about this generated text through a prompt. And vision and other modalities could be another way to test. But for now, we’re going to just talk about text. In this article, the author Xin Chen covers the challenges with an internal chatbot using a large language model. In summary, nondeterminism in tests, the cost of running the tests, and the statistical nature of AI testing. Meaning 100% tests are not realistic. Imagine you are using Microsoft Word and you’re typing a sentence. You should expect to see the same result of what you typed. Large language models are a combination of math, statistics, and weights. As you prompt the AI, that next token could change completely. The cost of running the test is expensive and may require a lot of structured prompts. Most of us are going to be using another API like OpenAI or others. And there is a price for those tokens. If you plan on running a large language model in production for a product, you’ll need test cases. Those test cases will require scenarios with terminologies and knowledge of business domain. As a tester, you want to make sure that the consumer can still achieve their goal through each update. You may be caught between the price of running the test and the statistical mishaps, then deciding what’s worth spending the time on the interpretation of that test case. ScottBot can access and find the internal confidence pages Google and Wikipedia for the question that a questioner wants. In order to test a large language model product, there will be a lengthy checklist. I definitely recommend reading the article and the checklist. It’s very thorough. If you can afford the price, run the test suites multiple times. Average the aggregate pass fail over time and specific test failures to evaluate performance. I have some questions. How does ScottLogic write their internal documentation? In my experience, people who do not write the obvious. How do they use the information architecture for their knowledge base? Are they using labels, spaces, or other types of organization? I’m excited to encourage others to write in their internal wikis documentation systems. I think that there is a lot of missed opportunity in these businesses because it’s never shared. So what do you think? Please let me know. Give me feedback. I’d like to do more articles about this topic. One of those is testing language models and prompts like we test software. And another one is principled instructions are all you need for questioning. So let me know what you think and we’ll keep going, right? Please subscribe on YouTube, on your podcaster.

Summary

In summary, we are using AI every day. How about you? AI products and testing? It’s messy. It’s not magic. And it’s not the future yet, but it’s getting pretty good. So there’s more work for us to do. AI testing, checklists, scenarios, and statistics. It’s not going to be always deterministic, but we have to do what we can. And so maybe that’s going to be a big challenge in the next year of just really trying to make it easier for all of us. Thank you for listening. Let me know, you know, online. I’m on Twitter. Just reach out anytime and I’d like to keep doing this. So thank you.