Quantitative Analysis of AI Model Architectures (2023–2025)

Introduction

Since 2023, leading AI companies have developed increasingly powerful large language models (LLMs) with varying architectural approaches. A key distinction is monolithic (dense) models versus mixture-of-experts (MoE) models. Monolithic models are traditional transformers where all parameters are active for every input, while MoE models contain multiple sub-model “experts” and activate only a subset of parameters per token (Is GPT-4 a Mixture of Experts Model? Exploring MoE Architectures for Language Models | Now Next Later AI) (Is GPT-4 a Mixture of Experts Model? Exploring MoE Architectures for Language Models | Now Next Later AI). This report examines the model architectures used by OpenAI, Google, Anthropic, xAI, Meta, and DeepSeek, comparing their production and research models, parameter scales, compute efficiency, and trends in adopting (or abandoning) MoE architectures. Quantitative details and comparisons are drawn from corporate announcements, technical papers, and benchmark reports.

Monolithic vs. Mixture-of-Experts Architectures

Monolithic architectures (dense transformers) have been the default for most production LLMs. Every layer processes the input with the full set of model weights, which simplifies deployment but requires enormous computation as model size grows. Mixture-of-Experts architectures, by contrast, route each input token through only a few specialized “expert” subnetworks (Is GPT-4 a Mixture of Experts Model? Exploring MoE Architectures for Language Models | Now Next Later AI). This sparse activation means MoE models can scale to trillions of parameters without proportional increases in per-token computation (Is GPT-4 a Mixture of Experts Model? Exploring MoE Architectures for Language Models | Now Next Later AI). For example, Google’s GLaM (Generalist Language Model) has 1.2 trillion parameters but activates only ~8% per input, achieving similar or better performance than a 175B dense model at half the inference FLOPs ([2112.06905] GLaM: Efficient Scaling of Language Models with Mixture-of-Experts) (Is GPT-4 a Mixture of Experts Model? Exploring MoE Architectures for Language Models | Now Next Later AI). Table 1 summarizes key models and their architectures.

Table 1. Major LLMs (2023–2025) by architecture, size, and context length.

Company/Model	Release	Architecture	Parameters (B)	Context Length (tokens)
OpenAI – GPT-4	2023	Mixture-of-Experts Transformer – 16 experts (2 active) (GPT-4 architecture, datasets, costs and more leaked)	~1,800 (total) (GPT-4 architecture, datasets, costs and more leaked) (≈2×111B active)	8,000 (text), vision-enabled (GPT-4 architecture, datasets, costs and more leaked)
Google – PaLM 2	2023	Dense Transformer (monolithic) ([What Is Llama 2?	IBM](https://www.ibm.com/think/topics/llama-2#:~:text=* GPT,parameter version of Claude 2.12))	340 ([What Is Llama 2?
Google – GLaM	2022 (R&D)	Mixture-of-Experts (64 experts, ~8% active) ([Is GPT-4 a Mixture of Experts Model? Exploring MoE Architectures for Language Models	Now Next Later AI](https://www.nownextlater.ai/Insights/post/is-gpt-4-a-mixture-of-experts-model-exploring-moe-architectures-for-language-models#:~:text=language tasks by intelligently allocating,computation))	1,200 ([2112.06905] GLaM: Efficient Scaling of Language Models with Mixture-of-Experts) (96B active per token)
Anthropic – Claude 2	2023	Dense Transformer (monolithic)	~175 (est.) ([What Is Llama 2?	IBM](https://www.ibm.com/think/topics/llama-2#:~:text=Llama 2 models’ 70 billion,parameters))
xAI – Grok-1	2024	Mixture-of-Experts Transformer – 25% weights active (Open Release of Grok-1)	314 (Open Release of Grok-1) (≈78B active)	8,000
Meta – LLaMA 2	2023	Dense Transformer (monolithic)	70 ([What Is Llama 2?	IBM](https://www.ibm.com/think/topics/llama-2#:~:text=Llama 2 models’ 70 billion,parameters))
Meta – LLaMA 3.1	2024	Dense Transformer (monolithic)	405 ([Meta unleashes its most powerful AI model, Llama 3.1	VentureBeat](https://venturebeat.com/ai/meta-unleashes-its-most-powerful-ai-model-llama-3-1-with-405b-parameters/#:~:text=After months of teasing and,3.1))
DeepSeek – R1	2025	Mixture-of-Experts – 256 experts/layer (8 active) ([DeepSeek-R1 Now Live With NVIDIA NIM	NVIDIA Blog](https://blogs.nvidia.com/blog/deepseek-r1-nim-microservice/#:~:text=DeepSeek,experts in parallel for evaluation))	671 ([DeepSeek-R1 Now Live With NVIDIA NIM

(“Active” parameters = subset used per token; context lengths for text models unless noted.)

As shown above, OpenAI, xAI, and DeepSeek have deployed MoE models with trillions of total parameters, whereas Google, Anthropic, and Meta have primarily used large dense models (hundreds of billions of parameters). We next analyze each company’s approach in detail, including model specifications, efficiency, and scalability.

OpenAI: From GPT-3 to GPT-4 (Monolith to MoE)

OpenAI’s notable production model since 2023 is GPT-4, which represented a major scale-up from the 175B-parameter GPT-3. OpenAI did not publish GPT-4’s architecture in detail, but leaked reports indicate GPT-4 uses a mixture-of-experts design to reach an unprecedented scale (GPT-4 architecture, datasets, costs and more leaked). Key details include:

GPT-4 (2023) – An MoE transformer with ~1.8 trillion parameters spread across 120 layers (GPT-4 architecture, datasets, costs and more leaked). The model reportedly consists of 16 expert networks (each ~111B parameters focused in the MLP layers), with 2 experts activated per forward pass (GPT-4 architecture, datasets, costs and more leaked). This design means only ~222B parameters are used for a given token, keeping inference feasible. GPT-4 was trained on ~13 trillion tokens of data (text and code) and fine-tuned with human feedback, achieving state-of-the-art performance on diverse tasks (GPT-4 architecture, datasets, costs and more leaked). Despite its sparse activation, GPT-4’s enormous scale required significant compute: an estimated $63 million in training cost (GPT-4 architecture, datasets, costs and more leaked). Inference also demands large clusters – reportedly 128 GPUs with specialized parallelism – and costs about 3× more per query than OpenAI’s previous 175B model (GPT-4 architecture, datasets, costs and more leaked). GPT-4’s design (if confirmed) shows OpenAI’s shift from the monolithic GPT-3 to an MoE approach to push model capability beyond 1 trillion parameters. Notably, GPT-4 accepts both text and image inputs (multimodal), incorporating a vision encoder as well (GPT-4 architecture, datasets, costs and more leaked).

OpenAI has not announced other new model architectures publicly since GPT-4. The company’s strategy appears to favor a single, very large general model (GPT-4 and its future successors) rather than many specialized models. There are rumors of an “GPT-5” in development, but no official details. In summary, OpenAI’s production model architecture evolved toward MoE in 2023 to achieve greater scale, although this came with higher complexity and inference cost.

(OpenAI’s earlier models like GPT-3 and GPT-3.5 (ChatGPT’s base) were dense transformers, but those precede 2023. Since the focus is 2023 onward, GPT-4 is the primary model of interest.)

Google: Large Dense Models with MoE in Research

Google’s AI efforts (including Google Brain/DeepMind) in 2023 centered on the PaLM family for production, while exploring MoE in research. Google’s flagship LLM serving products like Bard is PaLM 2, which uses a conventional dense architecture:

PaLM 2 (2023) – A dense transformer with reportedly 340 billion parameters (What Is Llama 2? | IBM), trained on 3.6 trillion tokens of data. PaLM 2 is an update to 2022’s PaLM (540B params) with improved training efficiency and data scaling. This model powers Google’s Bard chatbot and other applications, and it remains a monolithic model for simplicity in deployment. Despite being smaller in parameter count than PaLM-540B, PaLM 2’s greater training data and refinements allowed it to improve multilingual and reasoning performance (PaLM - Wikipedia) (What Is Llama 2? | IBM). Google did not incorporate mixture-of-experts in PaLM 2 – each query utilizes all 340B parameters.
MoE Research (Switch Transformer & GLaM) – Google has been a pioneer in MoE research, even if these models have not been deployed in production. The Switch Transformer (2021) introduced an MoE routing strategy that achieved up to 8× faster training than an equally-sized dense model by sparsely activating experts (Is GPT-4 a Mixture of Experts Model? Exploring MoE Architectures for Language Models | Now Next Later AI). Google’s GLaM model (2022) scaled this idea: GLaM has 1.2 trillion parameters across 64 experts, with only 8% of weights active per token (Is GPT-4 a Mixture of Experts Model? Exploring MoE Architectures for Language Models | Now Next Later AI). In zero/few-shot NLP tasks, GLaM outperformed the 175B GPT-3, while using half the inference FLOPs and one-third the training energy of GPT-3 ([2112.06905] GLaM: Efficient Scaling of Language Models with Mixture-of-Experts). These results demonstrated the efficiency potential of MoE at scale. However, Google chose a dense model for PaLM and PaLM 2, possibly due to the complexity of serving MoE models in production (which require sophisticated model-parallel infrastructure).

Looking forward, Google’s Pathways system (announced 2021) envisioned using MoE to handle multi-modal and multi-task learning in a unified model. In late 2023, Google/DeepMind also teased a next-generation model called Gemini, expected to be multi-modal and highly capable; while details are sparse, some speculate it might integrate techniques from both dense and MoE models. As of 2024, though, Google’s trend has been to rely on large monolithic LLMs for deployed services, even as its researchers continue to publish advances in MoE training algorithms (e.g. improved routing, load balancing (Mixture-of-Experts with Expert Choice Routing - Google Research)).

Efficiency and Scalability: Google’s approach illustrates a cautious balance – dense models like PaLM2 offer reliability and easier deployment, whereas their MoE experiments (Switch, GLaM) showed impressive training efficiency gains (e.g. GLaM used 1/3 of GPT-3’s energy for better results ([2112.06905] GLaM: Efficient Scaling of Language Models with Mixture-of-Experts)). The lack of a production MoE model in 2023 suggests that factors like inference efficiency and infrastructure maturity influenced Google’s decisions. Still, the research confirms that MoE is a viable path to scale if those challenges are overcome.