Transforming research insights into AI parameters

The rise of artificial intelligence in design is both exciting and overwhelming. On one hand, it unlocks once unimaginable possibilities—automating complex tasks, analyzing massive data sets, and even adapting interfaces dynamically. On the other hand, it forces us designers to rethink everything we know about how people interact with technology.

Traditional design principles, which often focus on aesthetics, usability, and functionality, are being reshaped to include the complexities of working with AI—both as a tool and a core component of the user experience.

Designing for AI requires a shift in mindset. Designers must now navigate unfamiliar territories, such as creating user experiences for systems driven by algorithms and probabilistic decision-making. Designers must step up to bridge this gap, ensuring that AI works effectively and aligns with human goals, values, and expectations.

In this article, I’ll explain how we transformed research insights into AI parameters, highlight the lessons learned in balancing technology and human-centred design, and explain why getting hands-on with AI is crucial when building tools like this.

The project

Have you ever received a medical prescription and struggled to understand the treatment you needed to follow? Or perhaps, after a doctor's visit, you found it difficult to remember their recommendations for the next steps? Clear and precise medical documentation ensures patients receive the proper care and guidance.

For this to happen, doctors must follow specific guidelines when writing reports to provide a better experience and service. This ensures that medical information is structured, accessible, and easily interpretable for patients and healthcare professionals. However, evaluating the quality and usability of medical reports remains a critical yet inefficient process in healthcare, often relying heavily on human effort.

Together with ReportAId, a company dedicated to transforming medical reports through AI-powered solutions to make healthcare more organized, personalized, and accessible, we designed a new compliance feature to automate the most repetitive and time-consuming aspects of the evaluation workflow.

In the current process, health directors and quality managers manually review reports, assessing clarity, proper acronym usage, and structural consistency. This project aimed to reduce cognitive load, enhance objectivity, and streamline evaluations, all while maintaining high-quality standards and ensuring compliance with stringent medical regulations.

The discovery phase

The design discovery process consisted of traditional research methods to profoundly understand the problem and translate findings into an AI-driven solution. This phase included:

Talking to stakeholders: We interviewed health directors and quality managers to learn about their frustrations and needs.
Mapping workflows: We broke down the current process to pinpoint where AI could make an impact.
Analyzing reports: We looked at actual reports to identify common mistakes and patterns.
Building user personas and journeys: We created profiles of the key people involved and mapped how they interact with the system to understand their needs better.

As the discovery phase is not the primary focus of this article, we won’t discuss the process in more detail. However, in the article “What We Learned from Google's People + AI Guidebook,” we explore the methodology and guidelines we applied also for this project.

Translate research into AI.

One of this project's most fascinating (and challenging) parts was figuring out how to turn our research into something AI could actually use. We had all these insights from the discovery phase—interviews, report analyses, workflow breakdowns—but AI doesn’t think as we do. It doesn’t just “get it.” It needs clear rules, structured data, and precise instructions. So, how do you translate something as nuanced as a medical report evaluation into a format AI can interpret and act upon?

I started by looking at how medical reports were currently assessed. The process wasn’t random—health directors and quality managers already had criteria they followed to evaluate reports. Through interviews, they shared the best practices they relied on and the common mistakes they encountered repeatedly.

The first step was to make sense of these insights. I put everything into a detailed table, defining:

Each evaluation criterion (e.g., acronym usage, structural consistency, copy-paste detection).
How was it assessed (Was it a manual check? Was there a clear yes/no rule?).
Best practices (What did a “good” report look like?).
Standard errors (What were the patterns of mistakes?).

This was the foundation of everything that came next.

Of course, knowing how humans evaluate reports wasn’t enough. I had to translate this knowledge into something AI could process. That meant looking at the reports themselves and figuring out how they were structured.

Some parts of the reports had mandatory fields—sections that must be filled out to meet regulatory compliance. These were easy to flag for AI evaluation. If a required field was missing or incomplete, that was a clear issue.

Other parts were dynamic fields, meaning they only appeared under specific conditions. This made things trickier. If a patient was a smoker, for example, the report needed a section specifying what they smoked, how often, and for how long. If they weren’t a smoker, that section wasn’t required. AI needed to understand when to check for specific fields and when to ignore them.

By analyzing past reports, I could spot patterns—which sections were most prone to errors, which fields were often incomplete, and where inconsistencies tended to appear. This allowed me to create clear rules for the AI to follow.

Once I had the structure down, it was time to map everything into a set of AI parameters. Each parameter was tied to an evaluation criterion, forming a parameters map.

This parameters map became the foundation for testing AI prompts. It helped us determine which aspects of report evaluation AI could reliably automate and which would still require human oversight.

Custom ChatGPT

I didn’t want to dive straight into designing solutions without truly understanding how AI would interact with the structured data we had defined. I’ve seen it happen before—designers create great-looking interfaces only to realize later that the technology can’t support what was envisioned.

So, instead of locking into a solution too soon, I took a hands-on approach. I wanted to explore the technological constraints before designing anything. This meant experimenting directly with AI, refining data interpretation, and identifying limitations and opportunities before shaping a final solution.

To properly test how AI would handle medical report evaluations, I created a custom version of ChatGPT and gave it a specific job: act as a medical report analyst. It needed to assess reports based on four key criteria:

Structure – Were the mandatory sections present and filled correctly?
Quality – Was the content complete and precise?
Copy-Paste Detection – Had parts of the report been reused improperly?
Acronym Consistency – Were approved medical acronyms used correctly?

To reinforce how to handle acronyms in medical reports, I uploaded a PDF containing a predefined set of approved acronyms, which the AI can reference while analyzing reports.

This process ensured that acronyms were used correctly and consistently. It also helped reduce errors when AI misinterpreted medical terminology or flagged incorrect information.

Each report needed an overall conformity score determined by aggregating the results of each criterion. I aimed to see how accurate and structured the AI could perform these evaluations.

I pulled real-world examples of past errors as a baseline to refine how AI identified issues in medical reports. Rather than relying on theoretical mistakes, I wanted the AI to learn from actual problems that had occurred in reports.

For example, when checking report structure, I found that some doctors used symbols like dashes (“-”) or periods (“.”) to bypass mandatory fields instead of filling them out correctly. The AI needed to recognize these workarounds and flag them as non-conformant.

To address this, I created detailed prompts explaining how each report section should be structured. The biggest challenge was ensuring AI followed consistent logic. I couldn’t just tell it, “Check if the report includes smoking history.” for example. That was too vague. I had to spell out precisely what that meant.

Here is an example of a prompt segment that evaluates the structure criteria, explicitly focusing on the patient’s smoking history.

General Anamnesis (Mandatory)
- Subcategory "Smoking" (Optional)
	- If present, it must indicate whether the patient is a smoker, non-smoker, or former smoker.
- If the patient is a former smoker, the following details must be specified:
     - What they used to smoke (cigarettes, cigars, or pipe)
     - The quantity per day
     - Until when they smoked
- If the patient is a current smoker, the following details must be specified:
     - What they smoke (cigarettes, cigars, or pipe)
     - The quantity per day
     - How long they have been smoking
 
Example:

- Former Smoker
	- 2-3 cigarettes/day until one year ago.
- Current Smoker
	- 5 cigars/day for the past 10 years.

This level of detail ensured the AI understood precisely what information it needed to extract and how to format it consistently.

After finalizing the structure of the prompts, I uploaded anonymous reports into the custom ChatGPT for testing. The goal was to simulate real-world conditions and see how well the AI performed when analyzing reports with different structures and potential errors.

Testing prompts

When I started working on this project, I had no idea how much fine-tuning was going into crafting the correct prompts. I assumed the AI would figure it out if I clearly described what I needed. That wasn’t the case.

Instead, I faced inconsistent results, unexpected formatting, and outputs that didn’t align with what I envisioned. The AI wasn’t interpreting my instructions the way I intended.

So, I had to take a step back and rethink my approach. This wasn’t just about writing prompts but understanding how AI processes information and adjusting accordingly. I kept refining my prompts through an iterative testing process to improve accuracy and consistency. This trial-and-error approach was essential to shaping the AI into a tool that could deliver reliable results.

At first, I was unfamiliar with this testing and prompt engineering level. I wasn’t used to tweaking phrases repeatedly to see what worked and what didn’t. To bridge this gap, I started talking with developers who had experience working with AI models. They had insights into how these systems processed language and gave me strategies to refine my approach.

I learned a lot along the way, and I’m sharing my process here in case you’re also struggling with AI-generated inconsistencies and want to improve your prompts.

Breaking down prompts into smaller, testable parts: One of my first big realizations was that long, complex prompts are a nightmare to troubleshoot. If I packed too many instructions into one giant prompt, it was impossible to figure out which part was breaking the output. So, I started splitting my prompts into smaller, focused sections and testing them individually.
Being specific instructions: Vague prompts lead to vague results. The AI doesn’t “guess” what I want—it follows whatever logic I give it. If my instructions were too open-ended, the AI would make assumptions, which led to inconsistent outputs. By spelling out exactly what I expected, the AI stopped improvising and delivered consistent, structured responses.
Giving clear examples and user cases: I quickly learned that AI works best with examples to follow. When I wasn’t getting the desired results, I started adding example outputs instead of rewriting the prompt from scratch.
Testing, testing, and more testing: If I had to give one ultimate piece of advice, it’s this: test your prompts as many times as possible. Each iteration helps fine-tune the results, making them more accurate, more structured, and less prone to errors. Initially, I thought my prompts were clear enough. But the more I tested, the more small improvements I found that made a huge difference.

Another major challenge was keeping track of tested prompts. I ran the same tests multiple times because I wasn’t documenting what worked and what didn’t.

To fix this, I maintained a detailed record of every tested prompt, the exact AI output, and why it worked (or didn’t). This turned out to be invaluable, especially when working with developers. Instead of starting from scratch, they could build on my testing, avoiding repetition and improving the AI model more efficiently.

It was a new workflow for me, but it highlighted how design processes must evolve when collaborating with AI technologies. This refinement process ensured the AI not only met the technical requirements but also delivered results that aligned with the expectations and needs of the end users.

Feasibility Workshop

When I completed my initial testing and started analyzing the results, I felt a mix of confidence and uncertainty. The AI was showing promise, but I wasn’t entirely sure how realistic some of the assumptions regarding actual implementation were. Theoretical feasibility is one thing, but making something work in a real-world system is another.

That’s why we decided to bring the technical team for a workshop. The goal was simple: refine everything I had learned from testing with input from those who knew the technical constraints inside out. I wanted to validate which parameters were workable, which needed more fine-tuning, and which were just wishful thinking.

During the session, I presented my findings—how the AI was interpreting data, where it struggled, and what seemed to work well. We went through the parameters one by one and categorised them. We broke the evaluation parameters into three groups:

Feasible – Things AI could handle well right away.
Difficult – Areas that required some workarounds or additional refinement.
Impossible – Parameters given current limitations weren’t realistic enough for AI to evaluate accurately.

What became immediately apparent was that some things were much more manageable for AI to handle than others. By discussing with the team, I pinpointed where AI could be most powerful and deliver the most value to users, even in an MVP (Minimum Viable Product) version. This was key because not everything needed to be perfect immediately—some features could be refined over time, while others needed to be spot-on from the beginning.

Key findings from the workshop revealed apparent differences in feasibility among the parameters. For example:

Some parameters were rule-based and straightforward, making them well-suited for automation. Acronyms and Structure were placed in the high feasibility category because they followed clear, standardized rules. AI could cross-check acronyms against an existing database and flag inconsistencies without ambiguity. Similarly, ensuring a report’s structure adhered to the required sections was a simple matter of validation.
Other parameters weren’t as black and white. Quality assessment fell into the moderately feasible category because it was more subjective. The AI could flag apparent issues, like missing sections, but judging the clarity of a report required a level of human interpretation that AI wasn’t fully capable of handling. This meant we needed hybrid solutions—where AI could do a first-pass evaluation, but human reviewers would still be involved in more nuanced decision-making.
Some parameters weren’t feasible at this stage. Copy-paste detection was a significant challenge, not because AI couldn’t identify duplicated text but because of privacy constraints around anonymized patient data. AI couldn’t always access the necessary context to determine whether copy-pasting was problematic. This realization made us rethink whether this was a feature worth pursuing in the short term or if it needed to be left out of the MVP version.

This workshop helped align expectations with reality. We identified where AI could have the most impact, needed human oversight, and wasn’t worth pursuing—at least not yet. It reinforced an important truth:

AI design isn’t just about what’s theoretically possible. It’s about what’s practical and functional and can be implemented to benefit users.

And that’s a lesson I’ll carry into every AI-driven project I work on in the future.

Conclusion: Designing AI with intention

Working on this project made me realize that designing for AI isn’t just about interfaces—it’s about shaping how AI interacts with people. Unlike traditional design, where everything is predefined, AI-driven products evolve, learn, and sometimes behave unpredictably. That meant I couldn’t just assume things would work. I had to test, refine, and challenge assumptions to ensure the technology aligned with real user needs.

I’ve always believed that great design comes from understanding the problem firsthand. With AI, that means going beyond the surface level and really understanding how the technology works. This isn’t about becoming a developer or data scientist but ensuring that the AI aligns with real user needs rather than just showcasing its technical capabilities. That’s why engaging multidisciplinary teams is essential—bringing together designers, developers, and AI specialists to bridge the gap between user experience and AI capabilities.

One of the biggest challenges is keeping the design user-centred rather than technology-driven. AI is powerful, and it’s tempting to focus on what it can do instead of what it should do.

I won’t lie—this process was a steep learning curve. It required me to embrace uncertainty, collaborate across disciplines, and rethink how I approached design. But through this, I realized something important:

When designers actively engage with AI, we don’t just adapt to new technology—we shape it.

This is what excites me the most about working with AI. We have the power to define how these systems interact with people. If we step back and leave it all to engineers, we risk creating technically impressive tools disconnected from human needs.

But if we get involved—if we experiment, challenge assumptions, and refine AI with intention—we can ensure that technology serves people, not the other way around.

And that’s a challenge worth taking on.