Tools for Thought: Coding Qualitative Insight in a Generative World

Beyond the Interface: AI and the Epistemological Disruption of Qualitative Research

In traditional research – be it qualitative, quantitative or cultural – we have the ability to link the output on the screen back to the real world. We can follow the breadcrumbs from PowerPoint slide, through the analysis, the discussion guide all the way to the field. Like a Michael Burry of insight, we can unpack the information to understand how it represents the real world.

If our consumers have an average income of £1000, we do not expect the next consumer to have an income of £1000 – one might, but if we see a whole lot under the £1000 mark, we expect to see a few above soon.

Qual is a bit fuzzier. When a participant says, "I would totally buy that," we know it’s not gospel. It’s a social statement ("I want to be agreeable"), it’s partially informed guess work ("I think I would want this... under ideal conditions"), and the wishful thinking ("I might, in a version of the future I’m making up right now"). We understand this complexity, not just because we’ve been trained to analyse it, but because we are also people.

When AI enters the picture, that relationship to knowledge has the potential to shift drastically.

The Illusion That Persists Even When Understood

Pornography is (in some respects) an interesting parallel to AI. This might not have crossed the average enthusiast’s mind but what they are looking at is an illusion. And I’m not talking about the onscreen romance, I’m talking about the image itself. Millions of pixels light up a formation and if you sit far enough away, your brain recognises an image of people doing things. And it doesn't matter if you are aware of the illusion or not, your physiological reaction remains captured by it. Your body responds as if it's real. The illusion captures you whether you cognitively assent to it or not. AI works the same way. We type cohesive sentences, thoughts and ideas, and something comes back. Equally cohesive (if not more), it ties into our thinking and it makes sense - it feels like a response. And whether you understand LLMs or not, your physiology responds – your mind sees another mind. It feels like you're talking to someone. And the impact of that feeling is not clear.

The implications of this phenomenon will no doubt be vast. Recent history is littered with examples of technology feeding into immutable urges, in part because we overestimate our ability and our intuition to tell truth from fiction. We end up blindly walking into blunders that only become obvious in hindsight.

For this article I’d like to focus on the quiet little suburb of market research. This is in itself quite a complex topic and in order to talk about the potential and potential risk of AI, I’m going to look at a few broad applications ranging from intuitive to seemingly magical.

Transcript tamers

Clean up transcripts and remove filler. You can certainly argue that these are not AI tools at all. But the definition of AI is ambiguous and I like this as a starting point. These are programs and systems that take our transcripts, the messy verbatim output from respondents, and turns it into a format that feels more like reading material. These pretrained models do more than just a word search. Utilising NLP tools (for example NLTK or spaCy) they can make sense of language and remove empty phrases and fillers.

By removing all the ‘speech’ terminology, you not only cut down on the sheer amount of words you have to read, but you also lower cognitive load by simply reading something that is better suited to reading.

These systems can be understood and they feel contained. They do not add information, they largely remove and clarify and you can definitely recognise the original work in the output. More than that, they feel ‘contained’ and can often work through large sets of data using nothing more than a modest laptop.

Tools of this nature, let’s call them non-generative or minimal-generation tools do not require us to philosophise too much about what we know when we know. That question belongs on the input side – once you understand your input data, you’ll be comfortable with the output data of these tools.

To create a tool like this you can think of it as two processes: data shuffling and data processing. The first requires nothing more than Python and a script to deal with data in the format that you tend to create it in - for example a folder full of word documents. In its simplest terms, this program should grab all those files, convert them to Python strings, pass it through a process and save it somewhere intuitive.

The second step is the data processing. I like keeping things like this separate. It means you can change the data processing without having to tinker with your file management and analysis setup (where you store things, the files you use and so on). I won’t spend too much time on this as it’s a whole topic on its own but it basically takes from your data shuffle process, does stuff and hands back the output. The ‘stuff’ can be a simple word search, removing duplicate sentences and even handing off to an LLM (local or API depending on how much fire power you have under your desk).

Why write your own program for this? Firstly because it’s great to create new things. And secondly, the world of language processing is vast and once you get stuck in you’ll find that existing products often combine functions and hide parameters in the interest of usability or common use cases. It limits your own imagination and strangles curiosity. Once you have the freedom to build, you can explore not only what is available, but start peering into what is possible.

Meaning mappers

Transcripts are supposed to paint pictures, we want to overlap them, combine them and get some idea of where all these opinions fit in. What is the bigger map of meaning. Here is where we need to start understanding the context of the information we’re getting - how do the ideas we’re seeing here relate to other ideas.

To do this locally you can use a combination of sentence-transformers and scikit-learn. Sentence-transformers turns words into vectors or simply put, into something a computer can understand. It allows the machine to code proximity – to say this idea is like that idea and gives that likeness a value. Sentence transformers allows the machine to ‘understand’ words in relation to other words creating the impression of understanding meaning.

Scikit-learn on the other hand gives you the ability to organise, explore and test the findings. You can create clusters, map them and find out how close they are in similarity – giving you a clear image of what meaning in your data looks like.

The alternative of course is to take all of your data, send it to a model like GPT 4 and ask it to do the job. You will always get a response and with a good prompt, you will get good results.

So why bother with all the libraries and development work when you can just paste and wait? The benefit of speed and ease comes with a cost. Other than your prompt, you have no control over what happens to the data, how it is interpreted and what the generation process does. One alternative here is to use more than one provider and cross reference, but at some point, this becomes a more complicated task and you’re back to developing.

With your own local models, you understand what it does, you can control and standardise parameters and you can start comparing your own analysis to that done by the machine.

An ideal solution is however a combination of the two. For example:

Take your data, feed it through a sentence transformer, use scikit learn to create clusters - bundles of words that sort of sit together based on their ‘likeness’ mentioned earlier. Here you can decide whether you want a fixed number of clusters or whether is should simply create clusters based on similarity. So it either sorts the words into a certain amount of buckets or it bundles them and creates a number of bundles based on the parameters you’ve set. Once you’ve got your clusters, you can send these to GPT to find the nuance of those clusters in the larger dataset. That is to say, you get a cluster around ‘uncertainty’, lots of people say things like ‘I’m not sure’ or ‘I feel uncertain’. GPT now has a far narrower window of interpretation: ‘tell me about ‘uncertainty’ in the context of the whole conversation. The results is a standard and predictable clustering model with an ability to find nuance.

This is one example and once you get more familiar with the abilities of scikit learn or matplotlib, your imagination can take over.

In this case however, we can see the value of understanding what happens to the information. Sentence-transformers creates vectors and looks for similarity – you’re using a dimension of the data and comparing. This is a fundamentally different approach to a LLM which takes input, builds an unfathomably large attention map and start calculating tokens to generate an output.

This is not to say the simple way of just pasting and asking is a waste of time and the only way to really get anything done is through a more complex system, but it does give you more options given the data you want to work through.

If you have one or two transcripts with some decent prompting you’ll be fine. A large dataset might lose some nuance, and if the dataset you’re pasting exceeds the context window it will just delete some of it. In the case of such a large dataset, it might work to embed with sentence transformers (or even use GPT’s embedding API for better output). This can be done in chunks if the dataset is too large. This allows you to create the vectors (which is used for likeness) and because they are not context dependent, you can just do it in chunks. You can then use the local machine to find clusters and then go back to GPT4 with your clusters for nuance. I recommend mixing and matching to see what sort of setup balancing speed, cost, accuracy and control.

Personally I find the local clustering process more reassuring. Dumping long texts (or anything for that matter) into GPT will always give you an output. But you then have to validate this output which in some way asks you to redo the work. On smaller datasets or when you have a good idea of what you’re looking for it’s a bit more reliable because you’re really asking it to do a summary on work that you already understand. But with large sets where you really don’t know or can’t remember what’s in there, clustering allows you to surface themes that GPT might overlook. And once you have them you can head back to ask for nuance.

We’re looking at transcripts in this example but there is no reason why you can’t throw in desk research, social media posts and anything else. If you have different types of content in a single study, you may want separate them out to compare field research to online sentiment to traditional media and so on. But in principle, you’ll apply the same logic to analyse and map everything.

Virtual co-moderators

That’s all on the output side. What about the input side?

Here we are looking at software that follows you conversation and co-moderates with you and can for example evaluate the conversation and gently nudge you in directions of interest, or recommend that you stick around in an area for a bit longer before moving on. It’s easy to conceptualise - a voice to text function that sends your conversation away to a computer of sorts, that then responds to the prompt and asks if the following question has been answered, or for recommendations on further prompting. You’ll have to play a little with prompts and workflows not to make it too clunky but in principle, we can see how we have all the parts to make at least the tech work.

This example has the potential to demonstrate the possibility for being ‘misled’ by AI. Misled might be the wrong word, there is no intent, the machine does what the machine does, but you are trying to marry the flow of a conversation into the flow of a program.

Software development is a deeply creative endeavour so there will be many ways to skin this particular cat. I’d like to go through one particular stream of thinking that can demonstrate the areas where you could lose too much control.

The first step in building this program is a matter of formatting - getting data into the program. Where our previous example used something like cut and paste, this will require a sort of speech to text mechanism. We can talk about data loss and inaccuracy in speech to text here but this is not the sort of dataloss I’m generally concerned about. Words can go missing or be misheard but this is down to hardware and software inefficiencies - not logical missteps or shortcomings inherent in the process. For the purposes of this piece we assume a perfect speech to text transcription.

The next two steps are where the opportunity for ‘not knowing what you’re not knowing’ becomes risky - not because of data security (all of this can in theory happen on a local machine) but because the process of interpreting and making judgement calls on recommendations become nuanced.

For the next step, an open ended clustering (kmeans) would work. You can play with the libraries and packages but this step listens and starts to build out the themes of the discussion and how this happens is key. Here I would recommend steering away from an LLM as much as possible. The reason is twofold. Firstly there is the concern about how it comes up with themes - you do not know how it surfaces one theme over another. Whatever you get will be correct to an extent it but introduces variation to a process where we desperately look for anchors and points of comparison.

We’re also trying to surface themes in real time (or as close to real time as possible). This means we have to come up with a dynamic system that looks for themes as they emerge. The actual ‘theme spotting’ mechanism will borrow from the previous section. You might start wondering how this becomes dynamic, remember, out theme spotter looks at static texts after the research has been done. In order to create dynamism, we will reanalyse at a certain interval. This can be either at a word count, after a period of time or at moments of silence, depending on the availability of computing power and how quickly you want to update the themes.

For example, set this to 100 words and the themes will update after a 100 words, set it to 60 seconds and the analysis will run every 60 seconds. Pauses might take a bit more thinking - you’ll be looking for a audio threshold that triggers the process but this is also very possible. For processes that depend on ‘softer’ triggers like this I tend to add a failsafe: “check themes when your hear silence or after one minute” sort of logic.

This process will check the same text over and over again - themes has to evolve on the context of the entire text, not just the last 100 words or 60 seconds. If you run this process through GPT you’re likely to find variations on every iteration which means your whole set of themes will be altered on every run rather than a slow evolution as the conversation moves.

Not that you have your series of themes, slowly evolving as you speak, you can measure them against a series of predetermined themes - the themes you want to cover off in the discussion. You can do this using the same logic - measuring the likeness of themes. If the new themes are ‘like’ your predetermined ones, you can. tick it off as discussed. This likeness is not straight forward, you’ll have to play with variables like the amount of instances that has a likeness and how his that likeness has to be. This might even vary per theme. But this sort of tinkering - how do ew quantify likeness - is probably closer to what researchers should be thinking about in the context of software.

This requires a bit of adjustment, but you do end up having complete control over where your program will prompt or ignore. And when you’re in a real time conversation this sort of understanding helps. A random word from a machine you don’t understand introduces a new thought process, you now have to evaluate the machine output. Where a program that you’ve calibrated according to parameters that you understand will give you the ability to judge and respond appropriately.

At this point you may have surfaced new themes that you did not anticipate during questionnaire design. It looks like a rich area of interrogation but you need a good angle in. This is where generative AI shines.

A prompt is pre populated with demographics of the group, and if you really went for it the demographics of individuals around the table. The topic that was surfaced still has its context behind it and your transcription program is still transcribing in real time. A prompt goes to a generative AI of your choice: “this demographic mentioned this topic in the following conversation, please suggest three prompts that allows me to interrogate.” You can engineer this prompt and the UI as much as you see fit, but the principle stands, analysis was done by you, generation is done by AI. You may not know how AI came to these prompts and you don’t have to. It’s not analysis. It’s a tool for moving the discussion forward and this is exactly what LLMs are good at - figuring out how to get to the next word.

This is an example - there are an infinite mount of architectures, tools and designs that can be dreamt up (and AI will help you build it). The point is that once we understand what the strength of which tool is and what our hardware limits are, we can start making informed decisions as to what happens where, what the crucial decisions are and which we can almost ignore.

Synthetic Participants / Virtual Personas

This has been a hot topic ever since computers started talking back. The idea that you can prompt an AI into being your consumer and then interrogate it. It’s faster, cheaper, scalable - the list goes on. Pick on any particular weakness of real life qualitative fieldwork and you can argue that AI has improved it. At no cost?

We can be lulled into thinking there is no cost - mostly because we are vulnerable to things that ‘looks’ like another mind and because AI is really good at looking like another mind. It basically just has to speak plausible and coherent sentences and our brain starts treating it like a brain.

I don’t think the qualitative research industry bought this idea. Not for one second. Qualitative researchers understand the nuance of the interview - especially when we start entering the space of biases, aspirations and contradictions. We’re not listening for words at face value, we’re looking for meaning and intent. The familiar struggle of a respondent trying to sound coherent while realising we’ve stumbled upon a seeming contradiction. Then trying to make sense of it themselves in real time. If you can maintain a open and honest relationship with a respondent, you slowly start unpacking the seeming contradiction, very often revealing a series of values and beliefs that explains any apparent tension. On countless occasions I’ve ended an interview with respondents saying that it felt like a therapy session, even offering to do another session.

It’s not that AI is attempting to do this but can’t, rather, it’s that AI was designed to do something else entirely. It’s not supposed to a be an individual. You cannot interrogate it’s decisions, point of view on life or social perception - it has none.

That is not to say AI is not welcome here - there are use cases where virtual persona’s can help us drive insight further.

Virtual respondents can help us break into new space. While true insight might be hard to find, we can leverage AI to formulate well known facts about opinion, behaviour and values into a coherent response. It’s not a human response, but it is a great search engine. So you’re not getting insight, you’re getting useful facts. The conversational nature also prompts you into new spaces, asking questions and building an understanding of an entirely new audience. Knowing the audience better shapes an intuition that can later be useful when you engage in face to face reseach.

We can explore fringe cases, allowing us to expand the edges of our findings. Adjusting parameters and demographics means we can start shining light on interesting corners of our audience. Once again, this is not definitive insight, but it is a very educated guess based on a lot of training data.

Stress test our existing approach through running it on virtual respondents first. Anyone who has spent time interviewing a narrowly defined demographic on the same topic will find that answers and conversations become predictable. We can leverage this natural predictability of conversations to test the broad flow of our questionnaire.

We can run our questionnaire or shooting guide several times - go through it manually or use our previous tools to surface themes and areas of investigation. While we can make this discovery in the field, virtual respondents can give us the opportunity to make small adjustments or to make fundamental changes to our approach to get a better angle on certain topics or to make room for exploring new areas.

Creating a program that will do this is in some terms very straight forward. You’re essentially creating a prompt editor with a API functionality. So let’s talk through this in broad terms.

In a separate piece I demonstrated a very simple piece of code that would either send a single question to GPT or allow you to engage in a longer conversation. In this case you’d obviously want to create a conversation but you’d want the system to behave a certain way.

Create two datasets describing respondents. The first dataset captures universal group characteristics that all respondents share (e.g., interest in a certain topic, geographic region, etc.). The second data set should describe individuals - age, gender and so on. You can play with these but breaking them up means you can use one set as a general descriptor and then loop through the second set to simulate a slightly diverse group. What you’d like to achieve that if you send three calls, it’s not just for a person, but rahter three different people from the group. You can even store these and slowly create a large bank of respondents.

Using the code my earlier article, you can simply loop through a series of questions and because each API call can “remember” your conversation, you are in effect simulating a real chat with a respondent.

Using a series of loops and API calls, you will now run your conversation through all your carefully crafted respondents. And by using our earlier programs, you can easily have it highlight areas where your conversation is unlikely to naturally hit certain topics, where new and interesting topics emerge and where responses might become repetitive or predictable.

You can also check for similarity between responses and revisit the sample. Here I would recommend caution. This is after all a system that only creates patterns and because it’s messy AI responses, variation or similarity between responses needs to be measured accurately with a carefully calibrated tool.

Either way, we can see how we can, at the click of a button (and to be honest a good few minutes of staring at your screen) get an extensive list of responses that looks real… but are not. This is a very AI heavy methodology and as I’ve been cautioned, you can’t really out prompt the model’s training. When it comes to large models like GPT4, you’re safe on scale of training data but gaurdrails might still leave you vulnerable. Uncensored models on the other hand will tell you anything but you might want to double check the last training update and do some reading on the actual scale of data used.

But when used with the caveats in mind, we can see how a very powerful augmenting tool can be created to run on a modest to (dare I say) very modest machine.

It is also worth mentioning that a virtual personality can, in principle, be created who’s answers and behaviours will be indistinguishable from the real personality. Based on patterns and data, output will match the likeness of such a character. A philosophical point is to ask whether behaviour will be indistinguishable in likeness or in practice. That is to say, ‘that sounds exactly like something that person would say’ or ‘that is exactly what that person would say’. In most respects this is irrelevant, but it’s worth keeping these details in mind. What’s that old saying about details and the devil?

The point is that custom LLMs can get you a lot closer to something like insight into human behaviour and some agencies are already experimenting with synthetic respondents to accelerate preliminary hypothesis generation and streamline research workflows. But I’m making the case for light, local tools.

Training your own LLM is possible - it’s local but not really light. Mistral’s 7B model will run on a RTX3060 (very affordable GPU). If you add a good amount of system memory and use LoRa, you can actually train the thing but it’s a slow and tedious process. If you’re looking to get work done you’re better off spending your graphics card money on api credits. But then you don’t have a generative AI living under your desk.

Conclusion

This is by no means an extensive list of what can be done, it’s barely a list at all. Read instead as a few examples or inspiration for when you think about your research challenges and want to create tools to improve your efficiency without compromising its integrity.

The biggest challenge with generative AI today seems to be not ‘seeing if it can do something’, it’s seeing if we know whether it’s really done something or not. Ask it to be a 35 year old man from a middle class background, ask it a question and the math will without a doubt be able to generate token after token, printing word after word in cohesive language that even conceptually make sense. You’ll get an answer. But this is not what we as researchers should be looking for.

Words, images, gestures, a pause in the middle of a sentence and more are all just the symbols we use to approximate what is going on inside the mind and heart of a respondent. Generative AI is extremely good at creating a narrow stream of symbolism. But if you’re looking for intent, meaning and mind behind the words you won’t find it. Not because the technology ‘is not there yet’ but because the technology cannot be there. This is not a matter of speculation or scepticism. It is a logical conclusion. Insight into AI is by definition not insight into humans and we are looking for human insight. We want to understand human behaviour instead of AI behaviour, for no other reason than the fact that humans are the only consumers of branded goods. One day, when AI has grown a heart and a soul, it orders body soap or trainers or digital products online and we can delve into the hidden drivers behind its behaviour to sell more products, maybe then there will be a case for seriously interrogating an AI as an end in itself.

Lastly—and perhaps most importantly in the long term—is the ethical dimension. As qualitative researchers, we talk to people to understand them. We do so to make real world decisions, sometimes on their behalf, sometimes to alter their lives and behaviours. We have no idea what will happen if we slowly start engaging machines that are like humans to alter the lives not of the machine but of the human. It is therefore crucial to understand how these systems work, how they relate to the world of the people we talk to, and continue to actively engage in thoughtful discussions about the implications of our innovations.

Previous
Previous

Knowing what we know

Next
Next

Can a computer think and should we care?