Developing Generative AI products introduces challenges like indeterminism, data volume, and scalability. Teams must adapt, and selecting the right tools becomes essential, as we did with Prefect, a Pythonic workflow engine.
With the introduction of GenAI technologies, teams building digital products worldwide are exploiting the latest model capabilities to create enhanced user experiences. Together with the new opportunities come some new and old challenges: non-functional requirements such as scalability, expensive runtime, and costly API calls are now common, even for smaller projects and teams. In this post, we share our latest experience in tackling these issues, focusing on adapting our processes and choosing the proper tooling for the job: Prefect.
By indeterminism, we mean that, at the end of the day, you can’t be sure about the results of your AI program. You can and should increase overall confidence in the results by employing, for example, ad-hoc LLM-as-judge test suites (eg. DeepEval), basic unit tests and human expert reviews. However, confidence cannot be guaranteed by design, and it also comes down to a matter of cost: the more confident you want to be, the more it costs you.
You have direct costs due to additional API calls to LLM-as-a-Service or maintaining infrastructure for custom LLM models. You also incur indirect costs in designing, implementing, and keeping metrics well-defined for your LLM-as-a-judge, which are, in general, more complex and error-prone than unit tests. The cost factor can’t be ignored as we now address these challenges even for smaller teams and projects where budget constraints might be a primary concern.
For this reason, observability, inspection, and error management become essential for a project's success. Enabling the team from the beginning to debug and inspect how the AI program is behaving provides significant value in the medium to long term.
GenAI likely involves handling vast amounts of data and dependencies on external providers, leading to expensive API calls, complex data management, and time-consuming computations. Therefore, it is necessary to structure the software into data-driven and/or domain-driven modules to enable:
By structuring the application in a data-driven (or domain-driven) manner, teams can differentiate development loops. This helps decouple the workload within the team, allowing each team member to run their module without incurring the computation time and budget overhead of other modules. To achieve this, module dependencies must be cached or mocked. This is nothing new from standard software engineering practices; however, these strategies now become essential for cost reduction, as each module may require intensive computation or data retrieval routines.
From a governance perspective, aligning expectations for AI-based deliverables is inherently challenging. Whether it's a client or an end-user, setting clear expectations for an LLM’s output is more complex. In deterministic software, specific functional requirements and user experience can be agreed upon more straightforwardly. Results can also be anticipated using low- or high-fidelity mockups, making comparing expectations against final outcomes easy. However, in GenAI projects:
Moreover, in domain-specific contexts, users tend to have even higher expectations for AI-generated content, as they are accustomed to high-quality, highly specialized outputs. Think about medical documentation, financial reports, or engineering consultancy.
Technical solutions alone cannot address these challenges. Proper governance with well-aligned processes must be established to mitigate the risks of misaligned expectations. Consumers must be informed about what is feasible and the degree of potential variation, while producers must provide detailed insights in advance through product investigation and PoCs.
We want to share how we tackled the challenges described earlier using a real-world use case. In a nutshell, the project's key characteristics are:
Textual Reports Generation — The objective is to generate a large volume of text reports (one page each), which involves a high level of indeterminism.
Financial Domain — A high degree of specificity in wording, tone, and text structure is expected. Additionally, precise numeric data is mandatory.
Massive Amount of Data — Various structured (e.g., performance figures, portfolio data) and unstructured (e.g., market news) data sources must be queried and processed to generate the final reports.
Several Financial Products — The project scope includes dozens of financial products, their comparable products, and indices. Additionally, reports are generated in multiple languages and for different reference timeframes, which results in computationally expensive and costly report generation.
We found the right tool in Prefect to bridge our processes to code implementation. Prefect is a Pythonic workflow engine designed to build, deploy, and run pipelines. It fits nicely in scenarios involving a lot of data, processing, and offline computation.
The first thing we did was break down the content of the textual report examples. We focused on segregating the content based on financial domain insights, ensuring that the data required for each output part was as decoupled as possible from other parts.
This breakdown enabled us to architect our software modules based on domain-driven and data-driven dependencies. As a result, team members could develop each module independently.
Prefect allows you to design a workflow using flows and tasks. Flows can be nested, while tasks are the smallest unit of work.
We mapped our textual generation process into flows aligned with our module segregation. This made running isolated flows (i.e. sub-contents) easy and cost-effective. Team members could configure which flows to run and specify parameters, allowing us to optimize development and computation costs.
Beyond the traditional organizational benefits, we also significantly reduced costs. Team members could generate sub-content without relying on external data, API calls, or the computation time of other modules.
At deployment time, periodic (i.e. weekly) generations were as simple as configuring a trigger through an RRule. The Prefect engine handled error management and retries with detailed tracking, ensuring that scaling generation to hundreds of reports remained safe and controlled.
We leveraged the Prefect Artifacts feature to support a continuous feedback loop, human feedback, and historical tracking. Artifacts are persisted outputs designed for human consumption.
Data input transformed into markdown, prompts, and outputs are stored as artifacts, enabling the team to inspect generation runs for debugging and review purposes easily. Additionally, intermediate, more fine-grained artifacts were employed for data validation (pre-generation) and data auditing (during generation).
@task(name="Get Data", task_run_name="Get Data {year}-{month}")
async def get_data(self, year: int, month: int):
# Business logic...
rows = [row.to_dict() for index, row in df.iterrows()]
await create_table_artifact(
key=f"{year}-{month}-data",
table=rows,
description="...",
)
We found in Artifacts a human-based observability and inspection tool: not efficient in terms of quantity but highly valuable in terms of quality. Based on a colleague's reference or user feedback, developers could check intermediate and final results with a few simple clicks.
The risk of diverging expectations was a primary concern. We mitigated this risk by establishing a periodic and formal feedback loop with end-users from the outset.
A specific example that benefited from this approach was wording and phrasing. From the very first versions of partial contents, users raised concerns about how acronyms, financial lingo, and text structure didn’t align well with human-generated reports. Users were accustomed to wording that was different from what the LLM had learned.
By incorporating early feedback, we managed to implement specific solutions from the beginning, ensuring they propagated effortlessly and without additional costs to the yet-to-be-implemented sub-content. In this case, the solution involved building a dataset for fine-tuning and also providing the LLM with well-defined vocabulary instructions and dictionary mappings, along with deterministic post-generation routines.
Updating prompts to improve or fix results without realizing the unintended side effects is easy. In our case, it was common to encounter specific issues with certain financial products or specific parts of the output, and fixing them often introduced new issues in other products—sometimes without immediate notice.
The testing field is evolving rapidly, with new solutions emerging to address AI-driven challenges. LLM-as-a-judge is a widely used approach in which an LLM evaluates another LLM’s response.
We developed a test suite using DeepEval to assess regressions and output quality. To reduce cost, we re-used the Artifacts already generated for the same code changes instead of generating new outputs to evaluate.
subcontent_generic_phrases = GEval(
name="Avoid generic phrases",
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
evaluation_steps=[...],
threshold=0.2,
)
targets = read_artifacts(id)
test_cases = [
LLMTestCase(input="", actual_output=target)
for target in targets
]
dataset = EvaluationDataset(test_cases=test_cases)
@pytest.mark.parametrize("test_case", dataset)
def test_subcontent(test_case: LLMTestCase):
assert_test(test_case, [subcontent_generic_phrases])
However, we struggled to determine the overall quality of deliverables through automated testing alone. Designing precise metrics that fully align with user expectations is complex and not always feasible.
At the end of the day, human feedback on a sample of final outputs remains essential. Automated tests can still guide humans in selecting the "most problematic" outputs to review. Integrating automated tests with a two-step human review process is extremely valuable, internal and external. During the feature development phase, we shared progressive iterations of the final output for internal review. Then, we relied on the previously described continuous feedback loop for an external review and further iteration.
Most of the time, we ended up updating already successful tests based on human feedback—AI deliverables evolve, and so do their evaluation metrics.
Historic Tracking
Domain segregation, continuous feedback loops, and human feedback require a fundamental approach to track data input, prompts, and outputs. In order to correctly evaluate the advancement of the final textual outputs, you must continuously compare new changes against previous ones. This is crucial when dealing with large text outputs, as in our case.
What helped us the most was establishing a well-defined process and structure to store information for development workflow purposes. We used Prefect Artifacts feature to persist input data, intermediate results and final outputs as markdown reports. After a dev workflow run, developers could easily check how new changes affected the entire pipeline. Moreover, generated artifacts were easy to share enabling team members to make collaborative considerations and to ease changes review.
Having a historical record of (data, prompt, output)
helped us to precisely identify which changes impacted certain outputs. This allowed us to revert or update changes based on external feedback, even if it arrived in later feedback cycles.
Prefect is designed with caching in mind to prevent re-computing the same task when the output remains unchanged based on parameter values. In our case, this was essential for two main reasons:
First, we had to iterate over business logic and prompts relying on a large amount of data such as performance metrics, news, and portfolio compositions. The data changes over time, but retrieving it for each run is unnecessary when updating unaffected parts. By implementing your workflow with Prefect caching in mind, avoiding insignificant data retrieval becomes effortless and compelling.
Secondly, because module segregation can still involve a certain degree of data dependencies, Prefect caching provides a straightforward way to share data between modules. Even though it might seem an unpopular choice, we discovered that leveraging the disk cache resulted in a highly simplified way to avoid unnecessary data retrieval between different tasks. Thankfully, Prefect caching behaves the same, at no extra effort, even when used via docker work pool, so it can equally be used in different environments, including a production scenario.
Below here an example of how set up caching for an expensive task:
@dataclass
class CustomCachePolicy(CachePolicy):
def compute_key(self, inputs: dict[str, Any], **kwargs) -> Optional[str]:
hashed_inputs = {}
inputs = inputs or {}
if not inputs:
return None
for key, val in inputs.items():
if key in custom_list:
hashed_inputs[key] = val
return hash_objects(hashed_inputs)
@task(persist_result=True, cache_policy=(TASK_SOURCE + CustomCachePolicy()))
async def get_expensive_data(arg: dict):
# Business logic ...
return ...
We leveraged Prefect's custom cache policy API: we compute the cache key specifically for a selected set of argument object keys. Additionally, a custom CachePolicy can support caching depending on non-serializable arguments. Multiple different flows invoking the same task with the same input will result in a disk read instead of re-computing all the data based on which argument key changed. This helped us avoid a large amount of expensive API calls across different financial products.
Prefect provides its logging utilities for monitoring, troubleshooting, and auditing. Compared to the Artifacts review, we found logging useful for orchestration observability rather than inspecting data and outputs.
When generating textual reports, we rely on multiple providers to fetch a large quantity of data. Providers typically have rate limits, both in terms of frequency and overall usage. A typical use case implemented in Prefect involves several parallel instances runs, meaning sending multiple API calls to external providers in a short time.
Prefect provides global concurrency and rate limit utilities, allowing you to set locks at specific points in your application to prevent API calls from exceeding the limit. This was extremely helpful in avoiding temporary provider blocks on the requesting client.
Creating a Task concurrency limit in Prefect is as simple as running
prefect concurrency-limit create get_data 50
while setting up a global limit
prefect global-concurrency-limit create -l 1 --slot-decay-per-second 1 custom-limit
allows you to arbitrarily wait client-side at any point of your code
@task(name="Get Data")
def get_data(year: int, month: int):
# Business logic...
await rate_limit("custom-limit")
# Provider API call...
# Process response...
Projects based on Generative AI bring both new challenges and common ones typical of large-scale products. For a project's success, even small teams must systematically address these challenges through governance, processes, and ad-hoc tooling. Flexibility is key: designing ad-hoc strategies for your specific use case and domain is typically a winning move, even if they result in contradictions to best practices.
In Buildo, we are evolving our processes to adapt methodologies and workflows to better suit GenAI project needs: choosing the proper tooling makes a difference.
We found in Prefect the right tool to support our internal processes and enhance our deliverables. Prefect authors now also provide ControlFlow which is GenAI-oriented and specifically address the project needs discussed in this post: we will for sure consider this option for future projects in this space.
Matteo is a product-minded Software Engineer. He thrives on bootstrapping and scaling digital products, seamlessly integrating AI into software engineering. He cares about business impacts and UX/UI while keenly nerding on software architectures.
Are you searching for a reliable partner to develop your tailor-made software solution? We'd love to chat with you and learn more about your project.