Software engineer Ayodeji Erinfolami shares insights from architecting AI systems

When software engineer Ayodeji Erinfolami first started building AI-driven platforms, he thought the hardest part would be getting the artificial intelligence to work properly. He was wrong.
“People get excited about the AI part, but that’s only about 20% of the work,” he says. “The other 80% is about building systems that can deal with AI’s unpredictability, like responses taking different amounts of time, costs that can quickly get out of hand, and having backup plans for when the models don’t work.”
As a co-founder of an AI-powered SaaS company and former senior engineer at multiple tech firms, he has been working with AI infrastructure challenges as the technology has evolved. His experiences offer insights into what it takes to build AI systems that work reliably.
The Token Cost Reality
Ayodeji learned about AI costs early. “We quickly realized during development that costs could quickly get out of hand if we weren’t careful about context management,” he recalls. “A single conversation could get expensive if you’re not strategic about it.”
The problem wasn’t just spending, it was understanding how AI pricing works differently from traditional APIs. Unlike regular APIs that charge per request, AI services charge based on the amount of text processed, including conversation history that provides context for responses.
“Every message in a conversation gets sent to the AI along with all the previous messages,” he explains. “So a long conversation isn’t just one API call, it’s processing all that text every time.”
His solution was what he calls “smart context pruning.” The system tracks what’s actually important in a conversation and gradually removes unnecessary information while preserving context that matters for response quality.
“We’re building custom logic that understands conversation patterns,” Ayodeji says. “A customer asking about product features needs different context preservation than someone going through technical support. You can’t just cut everything randomly.”
The challenge becomes more complex when the AI needs fresh information from external sources. “Sometimes the AI needs to know current information that’s not in its training data,” he explains. “That’s where tools like crawl4ai come in handy – we can grab fresh content from websites and feed it to the AI along with the conversation context. But that adds even more tokens to manage.”
This approach helps reduce token usage significantly without affecting conversation quality, the difference between a sustainable business model and burning cash.
Message Queues as Critical Infrastructure
Most developers think of message queues as nice-to-have infrastructure. For AI platforms, he says they’re absolutely essential.
“RabbitMQ became our backbone pretty quickly, but not in the way most people use it,” Ayodeji explains. “We had to think about priority queues because not all AI requests are equally urgent. When someone’s asking the AI a question, that needs to happen fast. But updating analytics or processing billing information can wait a bit.”
The priority system treats user-facing AI responses as highest priority, while background tasks like analytics updates and billing can wait. “If someone’s chatting with the AI, they expect a response quickly,” he notes. “But if their usage analytics update a bit later, nobody notices.”
The message queue also handles the complexity of AI workflows that involve multiple steps. “Getting an AI response isn’t just one API call,” he explains. “You might need to retrieve context from Pinecone, crawl some fresh content with crawl4ai, process the user’s message, call the AI API, then store the response and update usage tracking. RabbitMQ helps orchestrate all those steps reliably.”
More importantly, the message queue system handles the inevitable failures that happen with AI services. APIs go down, models get overloaded, and rate limits hit without warning.
“We built what I call ‘smart backoff,’” he describes. “Instead of just retrying failed requests immediately, the system waits longer each time and eventually routes to fallback options.”
This approach ensures the platform can handle AI service outages by automatically switching to cached responses and alternative models when needed.
Database Design for Conversations
Traditional database design assumes relatively predictable data patterns. AI platforms break those assumptions.
“Conversations grow indefinitely, token usage needs precise tracking for billing, and you might have multiple versions of the same conversation as the AI generates different responses,” he lists the challenges.
The solution centers on MongoDB for conversation storage, but they also use Pinecone for vector search capabilities and crawl4ai for extracting content from websites. “MongoDB handles the conversation flow and user data well, but for the AI’s knowledge retrieval, we use Pinecone,” Ayodeji explains. “And when we need to pull information from websites, crawl4ai helps us get clean, structured content that the AI can actually work with.”
The query patterns prioritize retrieving recent conversation history quickly. “Conversation recency is more important than being able to search every word someone ever said to the AI,” he notes. “And when the AI needs to find relevant information from documents or knowledge bases, Pinecone handles that vector search much better. The crawl4ai integration means we can feed the AI current information from websites”.
The multi-tenant nature of AI platforms adds another layer of complexity. “Every query needs to be tenant-aware,” he emphasizes. “You can’t accidentally show one customer’s conversation to another, and you need to quickly calculate usage by tenant for billing.”
Redis for Conversation Memory
Traditional web applications cache static content and database query results. AI platforms need caching for a different purpose – conversation memory.
“We use Redis for conversation memory,” he explains. “The AI needs to remember what happened earlier in a conversation, and Redis is perfect for that kind of temporary but important data.”
The key challenge was finding the right approach for conversation memory. “Langgraph memory was good but wasn’t giving us exactly what we wanted,” Ayodeji notes. “So we built our own conversation memory system using Redis. It’s much more responsive and gives us better control over how conversation context gets stored and retrieved.”
Planning for Monitoring
Standard application monitoring tracks response times and error rates. AI platforms will need different metrics that reflect the quality and cost of interactions.
“We’re planning to track token usage per conversation, response quality scores, fallback activation rates, and cost per interaction,” he lists. “These metrics will tell you if your AI platform is healthy in ways that traditional monitoring can’t capture.”
One particularly useful metric will be conversation abandonment rate, how often users stop talking to the AI mid-conversation. “If this spikes, it usually means the AI is giving poor responses or the system is too slow,” Ayodeji notes. “It’s an early warning sign of user dissatisfaction.”
The monitoring setup will use Prometheus and Grafana with custom metrics designed for AI workloads. “We’ll alert on things like ‘token usage spike’ or ‘high fallback rate’ because those indicate problems that matter for AI platforms specifically,” he said.
Planning for Failure
AI services will fail. His philosophy is to accept this reality and build accordingly.
“We’re building three levels of fallback,” Ayodeji outlines. “First, we try a different model from the same provider. If that fails, we switch to a different AI provider entirely. If everything fails, we fall back to pre-written responses”.
The key is making these transitions invisible to users. “Users shouldn’t know or care which AI model is answering their question,” he emphasizes. “They just want good, fast responses.”
This requires building an abstraction layer that routes requests to different AI providers based on availability and cost. “We can switch between different AI services without changing any application code,” Ayodeji explains. “The routing logic handles provider differences automatically.”
Handling Unpredictable User Behavior
One of the biggest challenges in building AI platforms is how unpredictable real user behavior can be compared to development assumptions.
“During development, you think about normal use cases,” he says. “But users will try to break your AI, ask it to write novels, or have conversations that go on for hundreds of messages.”
Some users might try to get the AI to generate extremely long content that could be very expensive in tokens and take a long time to complete. “Our cost controls need to kick in for these situations,” Ayodeji notes. “We need better request validation to handle edge cases.”
The solution involves intelligent request filtering that detects unusual requests before they reach the AI. “We can spot when someone’s trying to push the limits and either adjust their request or suggest alternatives,” he said.
Cost Controls That Actually Work
Token costs can quickly make AI platforms unprofitable without proper controls.
“We’re building spending limits per user, per conversation, and globally,” Ayodeji describes the multi-layered approach. “But the limits are smart, they consider the user’s subscription level and the value of their requests.”
The system also learns from usage patterns to optimize costs automatically.
Security Beyond Traditional Threats
AI platforms face unique security challenges that traditional applications don’t encounter. “People will try to get your AI to reveal information about other users, expose your prompts, or bypass safety restrictions,” he warns.
The security approach involves multiple protection layers: input sanitization before reaching the AI, output filtering before reaching users, and comprehensive logging for audit purposes.
“But the real security is in the architecture,” Ayodeji emphasizes. “Making sure tenant data is completely isolated and that the AI can’t accidentally access information it shouldn’t.”
The audit trail is particularly important. “We need to be able to trace every response back to the specific inputs and context that generated it,” he explains. “If there’s ever a security issue or compliance question, we need that history.”
Scaling Complexity, Not Just Volume
Scaling AI platforms isn’t just about handling more users, it’s about managing exponential complexity growth. “Every new user doesn’t just add one more conversation stream,” he explains. “They add conversation history, usage tracking, personalization data, and unique interaction patterns.”
The solution requires building systems where different components can scale independently. “We can scale the AI processing separately from the message handling, separately from the conversation storage,” Ayodeji describes. “When AI services are slow, we scale up our retry mechanisms. When we have lots of new conversations, we scale up storage.”
Each component scales based on its specific bottlenecks rather than scaling everything uniformly.
Future-Proofing for AI Evolution
The AI landscape evolves rapidly, with new models and capabilities appearing regularly. He believes successful platforms will be those that can adapt quickly without rebuilding everything.
“We’re designing our system so we can plug in new AI models, new providers, or even completely different types of AI without changing the core platform,” Ayodeji explains. “When new AI models come out, we want to be able to integrate and evaluate them quickly.”
The flexibility comes from treating AI as a pluggable component rather than the center of the architecture. “You’re not just building software,” he concludes. “You’re building a bridge between unpredictable AI capabilities and users who expect reliable, fast, and cost-effective experiences. That bridge needs to be incredibly robust.”
His advice for other engineers tackling similar challenges is straightforward: expect the unexpected, plan for failures, and remember that the AI is just one piece of a much larger puzzle.
Ayodeji Erinfolami is a Professional Member, BCS – The Chartered Institute for IT, Senior Software Engineer and a Co-founder of an AI-Powered SaaS Company.










