Quantitative Analysis of AI Model Architectures (2023–2025)

Introduction

Since 2023, leading AI companies have developed increasingly powerful large language models (LLMs) with varying architectural approaches. A key distinction is monolithic (dense) models versus mixture-of-experts (MoE) models. Monolithic models are traditional transformers where all parameters are active for every input, while MoE models contain multiple sub-model “experts” and activate only a subset of parameters per token (Is GPT-4 a Mixture of Experts Model? Exploring MoE Architectures for Language Models | Now Next Later AI) (Is GPT-4 a Mixture of Experts Model? Exploring MoE Architectures for Language Models | Now Next Later AI). This report examines the model architectures used by OpenAI, Google, Anthropic, xAI, Meta, and DeepSeek, comparing their production and research models, parameter scales, compute efficiency, and trends in adopting (or abandoning) MoE architectures. Quantitative details and comparisons are drawn from corporate announcements, technical papers, and benchmark reports.

Monolithic vs. Mixture-of-Experts Architectures

Monolithic architectures (dense transformers) have been the default for most production LLMs. Every layer processes the input with the full set of model weights, which simplifies deployment but requires enormous computation as model size grows. Mixture-of-Experts architectures, by contrast, route each input token through only a few specialized “expert” subnetworks (Is GPT-4 a Mixture of Experts Model? Exploring MoE Architectures for Language Models | Now Next Later AI). This sparse activation means MoE models can scale to trillions of parameters without proportional increases in per-token computation (Is GPT-4 a Mixture of Experts Model? Exploring MoE Architectures for Language Models | Now Next Later AI). For example, Google’s GLaM (Generalist Language Model) has 1.2 trillion parameters but activates only ~8% per input, achieving similar or better performance than a 175B dense model at half the inference FLOPs ([2112.06905] GLaM: Efficient Scaling of Language Models with Mixture-of-Experts) (Is GPT-4 a Mixture of Experts Model? Exploring MoE Architectures for Language Models | Now Next Later AI). Table 1 summarizes key models and their architectures.

Table 1. Major LLMs (2023–2025) by architecture, size, and context length.

Company/Model Release Architecture Parameters (B) Context Length (tokens)
OpenAI – GPT-4 2023 Mixture-of-Experts Transformer – 16 experts (2 active) (GPT-4 architecture, datasets, costs and more leaked) ~1,800 (total) (GPT-4 architecture, datasets, costs and more leaked) (≈2×111B active) 8,000 (text), vision-enabled (GPT-4 architecture, datasets, costs and more leaked)
Google – PaLM 2 2023 Dense Transformer (monolithic) ([What Is Llama 2? IBM](https://www.ibm.com/think/topics/llama-2#:~:text=* GPT,parameter version of Claude 2.12)) 340 ([What Is Llama 2?
Google – GLaM 2022 (R&D) Mixture-of-Experts (64 experts, ~8% active) ([Is GPT-4 a Mixture of Experts Model? Exploring MoE Architectures for Language Models Now Next Later AI](https://www.nownextlater.ai/Insights/post/is-gpt-4-a-mixture-of-experts-model-exploring-moe-architectures-for-language-models#:~:text=language tasks by intelligently allocating,computation)) 1,200 ([2112.06905] GLaM: Efficient Scaling of Language Models with Mixture-of-Experts) (96B active per token)
Anthropic – Claude 2 2023 Dense Transformer (monolithic) ~175 (est.) ([What Is Llama 2? IBM](https://www.ibm.com/think/topics/llama-2#:~:text=Llama 2 models’ 70 billion,parameters))
xAI – Grok-1 2024 Mixture-of-Experts Transformer – 25% weights active (Open Release of Grok-1) 314 (Open Release of Grok-1) (≈78B active) 8,000
Meta – LLaMA 2 2023 Dense Transformer (monolithic) 70 ([What Is Llama 2? IBM](https://www.ibm.com/think/topics/llama-2#:~:text=Llama 2 models’ 70 billion,parameters))
Meta – LLaMA 3.1 2024 Dense Transformer (monolithic) 405 ([Meta unleashes its most powerful AI model, Llama 3.1 VentureBeat](https://venturebeat.com/ai/meta-unleashes-its-most-powerful-ai-model-llama-3-1-with-405b-parameters/#:~:text=After months of teasing and,3.1))
DeepSeek – R1 2025 Mixture-of-Experts – 256 experts/layer (8 active) ([DeepSeek-R1 Now Live With NVIDIA NIM NVIDIA Blog](https://blogs.nvidia.com/blog/deepseek-r1-nim-microservice/#:~:text=DeepSeek,experts in parallel for evaluation)) 671 ([DeepSeek-R1 Now Live With NVIDIA NIM

(“Active” parameters = subset used per token; context lengths for text models unless noted.)

As shown above, OpenAI, xAI, and DeepSeek have deployed MoE models with trillions of total parameters, whereas Google, Anthropic, and Meta have primarily used large dense models (hundreds of billions of parameters). We next analyze each company’s approach in detail, including model specifications, efficiency, and scalability.

OpenAI: From GPT-3 to GPT-4 (Monolith to MoE)

OpenAI’s notable production model since 2023 is GPT-4, which represented a major scale-up from the 175B-parameter GPT-3. OpenAI did not publish GPT-4’s architecture in detail, but leaked reports indicate GPT-4 uses a mixture-of-experts design to reach an unprecedented scale (GPT-4 architecture, datasets, costs and more leaked). Key details include:

OpenAI has not announced other new model architectures publicly since GPT-4. The company’s strategy appears to favor a single, very large general model (GPT-4 and its future successors) rather than many specialized models. There are rumors of an “GPT-5” in development, but no official details. In summary, OpenAI’s production model architecture evolved toward MoE in 2023 to achieve greater scale, although this came with higher complexity and inference cost.

(OpenAI’s earlier models like GPT-3 and GPT-3.5 (ChatGPT’s base) were dense transformers, but those precede 2023. Since the focus is 2023 onward, GPT-4 is the primary model of interest.)

Google: Large Dense Models with MoE in Research

Google’s AI efforts (including Google Brain/DeepMind) in 2023 centered on the PaLM family for production, while exploring MoE in research. Google’s flagship LLM serving products like Bard is PaLM 2, which uses a conventional dense architecture:

Looking forward, Google’s Pathways system (announced 2021) envisioned using MoE to handle multi-modal and multi-task learning in a unified model. In late 2023, Google/DeepMind also teased a next-generation model called Gemini, expected to be multi-modal and highly capable; while details are sparse, some speculate it might integrate techniques from both dense and MoE models. As of 2024, though, Google’s trend has been to rely on large monolithic LLMs for deployed services, even as its researchers continue to publish advances in MoE training algorithms (e.g. improved routing, load balancing (Mixture-of-Experts with Expert Choice Routing - Google Research)).

Efficiency and Scalability: Google’s approach illustrates a cautious balance – dense models like PaLM2 offer reliability and easier deployment, whereas their MoE experiments (Switch, GLaM) showed impressive training efficiency gains (e.g. GLaM used 1/3 of GPT-3’s energy for better results ([2112.06905] GLaM: Efficient Scaling of Language Models with Mixture-of-Experts)). The lack of a production MoE model in 2023 suggests that factors like inference efficiency and infrastructure maturity influenced Google’s decisions. Still, the research confirms that MoE is a viable path to scale if those challenges are overcome.