Orchestrating Multiple LLMs: Comparing GPT-4o, Claude, Gemini & DeepSeek

Today, we will focus on orchestrating multiple large language models (LLMs). This session will be practical, involving a lot of coding and calling various LLMs.

What is LLM Orchestration?

LLM orchestration is the process of managing and coordinating multiple Large Language Models (LLMs) to work together effectively to perform complex tasks. Instead of relying on a single model to handle everything, orchestration allows you to assign specialized tasks to models that excel at them, use smaller, faster models for simple tasks, and combine different models to enhance accuracy. This approach is crucial for optimizing performance, managing costs, and improving the reliability and scalability of LLM-driven applications.

The orchestration layer acts as the central control system, managing the interactions between LLMs, prompt templates, vector databases, and other AI components to ensure a cohesive workflow.


Key Concepts and Components

To understand LLM orchestration, it's helpful to break down the core components and principles:

  • Prompt Management and Chaining: Orchestration frameworks manage and store prompts in a knowledge base or library. They can dynamically select and sequence prompts based on real-time inputs, user preferences, and conversation history. This also involves chaining multiple prompts or models together to achieve more nuanced or multi-step results.
  • Data and Context Management: Orchestrators retrieve and preprocess data from various sources (e.g., databases, APIs, documents) to make it compatible with LLMs. This ensures the models have the relevant, high-quality context they need to generate accurate responses. This is often achieved through a Retrieval-Augmented Generation (RAG) pipeline, which involves fetching external data and using it to inform the model's response.
  • Multi-Agent Systems: In a multi-agent system, specialized AI agents work collaboratively to complete a task. An orchestrator can delegate sub-problems to these agents, which are often trained for specific roles (e.g., a web researcher, a proofreader, a trend analyst). Common multi-agent architectures include a supervisor model, where a central agent directs the workflow, and a network model, where agents communicate with each other.
  • Error Handling and Reliability: To ensure robust systems, orchestration includes features like retry logic with exponential backoff to handle failed API requests and graceful error handling to prevent crashes.
  • Performance Monitoring and Evaluation: Orchestration frameworks provide tools for continuous monitoring of LLM performance. This includes tracking metrics like latency, throughput, and token consumption, as well as using diagnostic tools for debugging. Dashboards can provide real-time visibility into how models are performing. Human evaluation is also essential for catching subtleties that automated systems might miss, such as source selection biases or hallucinated answers.
  • Dynamic Routing: A key function of an orchestrator is to intelligently route user queries to the most suitable model or tool. This allows for cost optimization and improved performance by using smaller, cheaper models for simple tasks and reserving more powerful, expensive models for complex, multi-step problems.

We will be working with both paid APIs and open-source models, running them in the cloud as well as locally.

If you prefer not to spend any money, you can use local models exclusively, though performance may vary. You can compare your results with those I obtain using the models I select, and explore what you can achieve with free open-source models.

For more detailed information on selecting, applying, and deploying models—whether open or closed source—I recommend my other course. We will not cover those details extensively here to avoid distractions. I assume you have some familiarity with different model types and their appropriate use cases. If not, you can either follow along and learn as we go or consult my other course and the guides I provide.

Overview of Models We Will Use

Let's review the cast of characters—the different models we will experience in this course:

  • GPT-4o Mini from OpenAI: The most well-known model, which we have already used in previous calls.
  • GPT-4: The larger cousin of GPT-4o Mini.
  • Reasoning Models: Models trained to think through their steps in an agentic workflow, improving outcomes by reasoning through intermediate steps. Examples include versions like 0.1 and 0.3 Mini, though these are less central to this course.

Most of the time, we will focus on GPT-4o Mini.

  • Anthropic's Claude Models: Founded by former OpenAI members, Anthropic offers competitive models. We will primarily use Claude 3.7 Sonnet, with Claude 3 Haiku as a cheaper alternative.
  • Google's Gemini 2.0 Flash: We will use the Flash version, which is currently free within certain usage limits. Gemini offers a path to use open frontier models without cost.
  • DeepSeek: A Chinese startup that surprised the community with powerful models like DeepSeek V3 and R1. While not the absolute strongest, DeepSeek achieved comparable performance to OpenAI's GPT-4 at a fraction of the training cost—approximately 30 times less spend. They also open-sourced their models, including smaller distilled versions based on Llama and Qwen fine-tuned on DeepSeek-generated data, which are freely available.
  • Grok: There are two entities named Grok. One is the model from the company formerly known as Twitter (now X), which we might use later. The other, Grok with a Q, is a company providing a fast, low-cost inference runtime for large models like Llama 3.3 (70 billion parameters) and other open-source models including DeepSeek variants.
  • Olama: A platform to run open-source models locally with consistent API endpoints similar to OpenAI's. It uses llama.cpp, a high-performance C++ library, to run models locally.
  • Llama: An open-source model we will also use, often through Olama or Grok infrastructure.

If some of these terms are unfamiliar, such as "inference," or if you are unsure about the differences between running models on Llama or Grok, I suggest reviewing the background materials in the guides or considering my LLM engineering course, which covers these topics in depth.

Leaderboards and Model Performance

I am a strong advocate for using leaderboards to compare model performance and costs. In fact, I was humorously nicknamed "Leaderboard" by my friend John Crone on his Super Data Science podcast.

One excellent resource is the Vellum leaderboard website, which compares many leading closed-source and open-source models side by side. It includes metrics such as cost, context window size, and benchmark performance across various dimensions. I highly recommend bookmarking this site and using it to evaluate the costs and capabilities of different APIs as you explore them.

Resources and Encouragement

There are abundant resources available for this course, including videos, guides, troubleshooting tips, and a GitHub repository that I continuously update. As new model versions are released, I will update the labs to keep you current.

While the lectures provide one perspective, you will gain a deeper understanding by working through the labs yourself.

Expect to encounter roadblocks and problems; this is where real learning happens. Debugging and diagnosing issues, though sometimes painful, is highly rewarding when you resolve them and get your code working correctly.

If you get stuck, do not hesitate to reach out to me via LinkedIn or email. I am responsive and eager to help you overcome challenges.

Additionally, leveraging multiple LLMs like ChatGPT and Claude can be a powerful technique. For difficult questions, I often ask both models to compare answers. Sometimes one model's response is too long-winded or tangential, so cross-checking helps.

You can even create a manual agent workflow by asking a question to ChatGPT, then to Claude, and having one evaluate the other's response for accuracy. This evaluator-optimizer pattern can improve the quality of your results.

Conclusion and Next Steps

With this introduction and overview, we are ready to begin our next lab. Let's get started with orchestrating multiple LLMs.

Key Takeaways

  • The lecture introduces orchestrating multiple large language models (LLMs) including GPT-4o, Claude, Gemini, DeepSeek, and others.
  • Emphasis on practical coding experience with both paid APIs and open-source models, locally and in the cloud.
  • Overview of different LLMs, their origins, capabilities, and cost considerations.
  • Encouragement to use resources, troubleshoot actively, and leverage multiple LLMs for better results.

Conceptual Diagram: An LLM Orchestration Workflow

This diagram illustrates how an orchestration layer acts as a central control system, managing the flow of data and tasks between various large language models and other components. It shows the journey of a user's request from start to finish.

+----------------+
|                |
|     USER       |
|                |
+-------+--------+
        |
        | User Request
        v
+-------+--------------------+
|                            |
|    ORCHESTRATION LAYER     |
| (LangChain, LlamaIndex, etc.) |
|                            |
+-------+--------------------+
        |
+------------------------------------------------------------------------------------------+
|  Routing Logic (decides which model to call based on the task)                           |
+------------------------------------------------------------------------------------------+
  |              |                |                               |                        |
  |              |                |                               |                        |
  v              v                v                               v                        v
+--------+   +--------+   +--------+   +-------------------+   +-------------+   +--------------+
|        |   |        |   |        |   |                   |   |             |   |              |
| GPT-4o |   | Claude |   | Gemini |   | Local Models      |   | External    |   | Knowledge    |
| (API)  |   |  (API) |   | (API)  |   | (via Olama)       |   | API's       |   |  Base        |
|        |   |        |   |        |   |                   |   | (Weather,   |   | (Vector DB)  |
|        |   |        |   |        |   |                   |   | Stock, etc.)|   |              |
+--------+   +--------+   +--------+   +-------------------+   +-------------+   +--------------+
  ^              ^                ^                               ^                        ^
  |              |                |                               |                        |
  +--------------+----------------+-------------------------------+------------------------+
                  |
                  | Processed Response
                  v
+-------+--------------------+
|                            |
|    ORCHESTRATION LAYER     |
| (post-processing, parsing) |
|                            |
+-------+--------------------+
        |
        | Final Output
        v
+-------+--------+
|                |
|     USER       |
|                |
+----------------+

Explanation:

  1. User Request: The user submits a query or task.
  2. Orchestration Layer: This is the brain of the operation. It receives the request and uses predefined logic to determine the best path forward. It can:
    • Route the request to a specific LLM based on its strengths (e.g., GPT-4 for complex reasoning, Gemini for low cost).
    • Call external tools (APIs) to get real-time information.
    • Access a Knowledge Base to retrieve specific data (a pattern known as Retrieval Augmented Generation or RAG).
  3. Model Calls: The orchestration layer makes calls to the chosen models, which can be cloud-based paid APIs (like GPT and Claude) or local, open-source models (like Llama via Olama).
  4. Processed Response: Once a model or tool completes its task, it returns the output to the orchestration layer.
  5. Final Output: The orchestration layer post-processes the response—combining information from different sources, cleaning up text, or running a final "evaluator" model—before delivering the polished output to the user.

Study Materials and Key Takeaways

Here is a structured breakdown of the models, concepts, and resources discussed , perfect for a quick review before you start the labs.

The Cast of Characters: A Model Reference Guide

Model NameProviderTypeKey Characteristics
GPT-4o MiniOpenAIPaid APIFast, most well-known, and a cost-effective version of the flagship model. Often the default choice for general tasks.
GPT-4OpenAIPaid APIThe larger, more powerful predecessor to GPT-4o Mini. Known for strong general capabilities and reasoning.
Claude 3.7 SonnetAnthropicPaid APIA strong competitor to GPT models, with a focus on "Constitutional AI" for safety. Known for clear, structured outputs.
Claude 3 HaikuAnthropicPaid APIA cheaper, faster alternative to Sonnet. Ideal for high-volume, quick tasks.
Gemini 2.0 FlashGooglePaid API (with free usage limits)A very fast and low-cost model. Excellent for tasks where speed is a top priority, such as chatbots.
DeepSeek V3 & R1DeepSeekPaid API & Open-sourceA powerful Chinese startup model that achieved GPT-4 level performance at a significantly lower cost. Also offers open-source versions.
LlamaMetaOpen-sourceA family of openly available models. A popular choice for local, private, and customizable applications.
OlamaOpen-sourceLocal PlatformA tool that provides a consistent API for running open-source models like Llama locally.
Grok (with a "Q")GroqInference Runtime PlatformA company that provides a lightning-fast, low-cost platform for running large, open-source models.

Core Concepts and Patterns

  • Orchestration: The process of coordinating multiple LLMs and other tools (like APIs and databases) to perform complex, multi-step tasks.
  • Paid vs. Open-Source: Paid APIs (like GPT and Claude) are easy to use but cost money. Open-source models (like Llama) are free to run but require more setup and local hardware.
  • Local vs. Cloud: Running models locally on your machine offers privacy and no cost, but performance depends on your computer's power. Running in the cloud provides access to powerful models and high speed but comes with costs.
  • The "Evaluator-Optimizer" Pattern: An advanced workflow where one model generates a response (optimizer) and another model critically evaluates that response (evaluator). If the response isn't good enough, the evaluator provides feedback to the optimizer for another attempt, creating a self-correcting loop. This pattern is key to improving the reliability and quality of your LLM applications.

Essential Resources

  • Vellum Leaderboard: A highly recommended website for comparing the performance, cost, speed, and context window of different LLMs. Bookmark this resource to make informed decisions about which models to use for your projects.
  • Course Guides & GitHub: The provided guides and GitHub repository contain all the code and reference materials for the labs. Use them as a primary resource for troubleshooting and staying up to date.
  • Debugging: Expect to encounter problems in your code. The lecture emphasizes that debugging is a crucial part of the learning process. When you hit a roadblock, take a step back and methodically diagnose the issue.
  • Query successful

What is LLM Orchestration?

LLM orchestration is the process of managing and coordinating multiple Large Language Models (LLMs) to work together effectively to perform complex tasks. Instead of relying on a single model to handle everything, orchestration allows you to assign specialized tasks to models that excel at them, use smaller, faster models for simple tasks, and combine different models to enhance accuracy. This approach is crucial for optimizing performance, managing costs, and improving the reliability and scalability of LLM-driven applications.

The orchestration layer acts as the central control system, managing the interactions between LLMs, prompt templates, vector databases, and other AI components to ensure a cohesive workflow.


Key Concepts and Components

To understand LLM orchestration, it's helpful to break down the core components and principles:

  • Prompt Management and Chaining: Orchestration frameworks manage and store prompts in a knowledge base or library. They can dynamically select and sequence prompts based on real-time inputs, user preferences, and conversation history. This also involves chaining multiple prompts or models together to achieve more nuanced or multi-step results.
  • Data and Context Management: Orchestrators retrieve and preprocess data from various sources (e.g., databases, APIs, documents) to make it compatible with LLMs. This ensures the models have the relevant, high-quality context they need to generate accurate responses. This is often achieved through a Retrieval-Augmented Generation (RAG) pipeline, which involves fetching external data and using it to inform the model's response.
  • Multi-Agent Systems: In a multi-agent system, specialized AI agents work collaboratively to complete a task. An orchestrator can delegate sub-problems to these agents, which are often trained for specific roles (e.g., a web researcher, a proofreader, a trend analyst). Common multi-agent architectures include a supervisor model, where a central agent directs the workflow, and a network model, where agents communicate with each other.
  • Error Handling and Reliability: To ensure robust systems, orchestration includes features like retry logic with exponential backoff to handle failed API requests and graceful error handling to prevent crashes.
  • Performance Monitoring and Evaluation: Orchestration frameworks provide tools for continuous monitoring of LLM performance. This includes tracking metrics like latency, throughput, and token consumption, as well as using diagnostic tools for debugging. Dashboards can provide real-time visibility into how models are performing. Human evaluation is also essential for catching subtleties that automated systems might miss, such as source selection biases or hallucinated answers.
  • Dynamic Routing: A key function of an orchestrator is to intelligently route user queries to the most suitable model or tool. This allows for cost optimization and improved performance by using smaller, cheaper models for simple tasks and reserving more powerful, expensive models for complex, multi-step problems.

Study Materials and Frameworks

To learn more about LLM orchestration, you can explore the following frameworks and resources:

  • ZenML: This tool acts as an orchestrator, bridging the gap between ML and MLOps by creating reproducible workflows and managing pipelines, artifacts, and metadata. It uses the concept of "stacks" to connect to different cloud services and can be used to build modular and scalable LLM systems. The LLM Engineer's Handbook uses ZenML to orchestrate its pipelines, as described in the provided context.
  • LangChain: A popular open-source framework that provides a comprehensive toolkit for building LLM-powered applications. It is known for its wide plugin ecosystem and for making it easier to implement RAG. It allows for programmatic control over complex workflows using "chains" and "agents".
  • LangGraph: An extension of LangChain that uses a graph-based approach to visually manage and orchestrate complex AI workflows and decision-making. It gives developers explicit control over the state and flow of their multi-agent systems.
  • LlamaIndex: A data-centric framework that focuses on ingesting, structuring, and retrieving private or domain-specific data to be used with LLMs. It's particularly useful for applications that involve querying large datasets or building knowledge bases.
  • CrewAI: A framework designed for orchestrating teams of specialized LLM agents. It's built for collaborative multi-agent systems and helps with task decomposition and delegation.
  • Mirascope: A library designed to handle the complexity of LLM orchestration while adhering to software development best practices. It offers tools for data integration and allows developers to code in Python.