Back to Case Studies
AILLMModel CompressionMoEQuantizationPython

Compressing GLM & MiniMax to Run on a MacBook

Client: Personal ResearchJanuary 2025
717GB
Original Model Size
92GB
Compressed Size
Compression Ratio
8×H200 GPUs
Infrastructure
~18 hours
Total Runtime
~$1000
Estimated Cost

Executive Summary

This case study documents the compression of GLM-4.7 and MiniMax, 358 billion parameter Mixture-of-Experts (MoE) language models, from an unwieldy 717GB BF16 format to a deployable 92GB INT4 representation. By combining REAP expert pruning (from Cerebras Research) with AutoRound quantization, we achieved a 7× size reduction that enables inference on consumer hardware while preserving model quality.

"REAP consistently outperforms merging and other pruning methods on generative benchmarks, especially at 50% compression. Notably, our method achieves near-lossless compression on code generation and tool-calling tasks." — Lasby et al., REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression

The compression pipeline required approximately 18 hours on an 8×H200 GPU cluster, costing roughly $1000 USD. The resulting model checkpoints are available on HuggingFace under the 0xSero namespace.

The Problem: Giant Models Can't Run On Consumer Hardware

Large language model development has accelerated dramatically, with frontier models exceeding hundreds of billions of parameters. GLM-4.7 exemplifies this trend:

SpecificationValue
Parameters358 billion
Experts160 (sparse activation)
ArchitectureMixture-of-Experts (MoE)
Full Precision (BF16)~717 GB
Inference VRAM (FP16)~1 TB

This is a significant deployment challenge. Even with access to expensive GPU clusters, loading the model consumes all available VRAM, leaving no headroom for inference batching, KV cache, or overhead. The practical implications:

  • Datacenter access required: No single machine can load the model
  • High operational costs: Running on cloud GPUs at scale is expensive
  • Limited accessibility: Researchers and small teams cannot experiment
  • Environmental impact: Massive compute requirements per inference

Why MoE Models Are Ideal for Compression

Mixture-of-Expert architectures present a unique opportunity for compression. Unlike dense models where every parameter participates in every forward pass, MoE models route tokens processing through a subset of experts:

┌─────────────────────────────────────────────────────────────┐
│                    GLM-4.7 Architecture                     │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐      │
│  │  Expert 1   │    │  Expert 2   │    │  Expert 3   │      │
│  │  (active)   │    │  (active)   │    │  (inactive) │      │
│  └─────────────┘    └─────────────┘    └─────────────┘      │
│         │                 │                 │               │
│         ▼                 ▼                 ○               │
│    ┌──────────────────────────────────────────────────┐     │
│    │              Router/Gate Mechanism               │     │
│    │         Selects top-k experts per token          │     │
│    └──────────────────────────────────────────────────┘     │
│                          │                                  │
│                          ▼                                  │
│                  ┌─────────────┐                            │
│                  │   Output    │                            │
│                  └─────────────┘                            │
└─────────────────────────────────────────────────────────────┘

Key observations about MoE redundancy:

  1. Not all experts are equal: Router biases means some experts are selected far more frequently
  2. Expert specialization: Certain experts specialize in overlapping capabilities
  3. Submodular contribution: Removing an expert's contribution diminishes predictably with depth
  4. Structural sparsity: The model is already designed for sparse activation

REAP (Router-weighted expert activation pruning) exploits these properties by measuring expert importance through calibration with specific datasets, then removing the least critical experts entirely.

The Solution: Two-Stage Compression Pipeline

Stage 1: REAP Expert Pruning

REAP operates in three phases:

  1. Calibration Forward Pass: Run representative samples through the model, collecting activation statistics for each expert
  2. Importance Scoring: Compute saliency scores for each expert based on activation norms and gate values
  3. Expert Removal: Remove the least important experts

The critical insight is that expert importance is task-dependent. For our compression target (code generation + function calling + agentic workflows), we curated a calibration dataset matching these use cases.

Calibration Dataset Composition

DatasetSamplesPurpose
theblackcat102/evol-codealpaca-v1700Code generation
Salesforce/xlam-function-calling-60k330Function calling / tool use
SWE-bench/SWE-smith-trajectories330Agentic multi-turn workflows
Total1360Domain-appropriate calibration

This mirrors the approach in the original Cerebras REAP paper, which found that calibration domain significantly impacts downstream quality. Our dataset balances the three primary use cases for compressed GLM-4.7.

Command and Configuration

python /mnt/work/reap/src/reap/prune.py \
  --model-name /mnt/work/model/GLM-4.7 \
  --dataset-name 0xSero/glm47-reap-calibration-v2 \
  --compression-ratio 0.50 \
  --seed 42 \
  --distance_measure angular

Key parameters:

  • compression-ratio: Proportion of experts retained (0.50 = 50% of experts kept)
  • seed: Reproducible randomness for observation caching
  • distance-measure: Angular (angular distance between expert outputs vectors subspaces)

Runtime Observations

PhaseDuration
Observation collection (1360 samples)~12.5 hours
Pruning and model repacking~1.5 hours
Total REAP 50% prune~14h

Once observations are cached at /root/artifacts/GLM-4.7/glm47-reap-calibration-v2/all/observations_1360_angular-seed_42.pt, additional prune ratios (35%, 40%, 45%, 50%) can be computed in minutes rather than hours.

Stage 2: AutoRound INT4 Quantization

Pruning alone reduces model size but not sufficiently for practical deployment. We applied Intel's AutoRound, a GPTQ-compatible quantization method that:

  • Compresses weights to 4-bit integers
  • Uses group-wise quantization (128 parameters per group)
  • Applies asymmetric quantization with GPTQ-style optimization
  • Preserves activation precision (W4A16 format)

AutoRound Configuration

auto-round \
  --model /path/to/pruned-model \
  --bits 4 \
  --group_size 128 \
  --format auto_gptq \
  --output_dir /mnt/work/outputs/GLM-4.7-REAP-50-W4A16

The quantization used the same calibration dataset as REAP, ensuring consistent evaluation across both compression stages.

Quantization Runtime

PhaseDuration
Per-layer quantization (~90-92s/layer)~3-4 hours
Model compilation and export~10 minutes
Total AutoRound pass~4h 10m

Combined Results

ModelSizeVRAM RequiredCompression
GLM-4.7 Original (BF16)717 GB~1 TB
GLM-4.7-REAP-50 (BF16)~359 GB~500 GB
GLM-4.7-REAP-50-W4A1692 GB~120 GB7.8×

The 92GB final model can run inference on a high-end MacBook or 8× 3090s, making it accessible to independent researchers and small team deployments. Hardware costing roughly 3,000ratherthanthe3,000 rather than the 300,000 required for an 8×H200 cluster.

Infrastructure and Economics

Compute Environment

All experiments ran on Prime Intellect cloud infrastructure:

ResourceSpecification
GPUs8× NVIDIA H200 (143GB each, 1.15TB total VRAM)
Pricing~$13/hour (spot pricing)

Available Model Checkpoints

All compressed model variants are available on HuggingFace:

RepositoryDescription
0xSero/GLM-4.7-REAP-4040% expert retention (BF16)
0xSero/GLM-4.7-REAP-40-W4A1640% + 4-bit quantization (~108GB)
0xSero/GLM-4.7-REAP-5050% expert retention (BF16)
0xSero/GLM-4.7-REAP-50-W4A1650% + 4-bit quantization (~92GB)
0xSero/glm47-reap-calibration-v2Calibration dataset
0xSero/glm47-reap-observationsCached observations

For most use cases: GLM-4.7-REAP-40-W4A16 offers the optimal balance of quality and size. At 40% expert retention, quality degradation is minimal across most benchmarks while delivering a compact 108GB model.

For extreme VRAM constraints: GLM-4.7-REAP-50-W4A16 at 92GB can run on single high-VRAM GPUs, but expect measurable quality degradation on reasoning-heavy tasks.

Quality Recovery: Optional Distillation

If quantization introduces unacceptable quality degradation, self-distillation with LoRA can recover performance:

  1. Generate synthetic data using the BF16 pruned model (teacher) via Magpie-style prompting
  2. Train LoRA adapter on the quantized model (student) with KL divergence loss against teacher
  3. Apply adapter during inference to improve output quality

This is the approach used by Apple and Ellora for post-training recovery. The process works by having the larger teacher model generate training examples, which the smaller student model learns from. The adapter captures nuanced outputs that quantization may have flattened.

Most gains come from the first few thousand samples. The distillation curve exhibits diminishing returns rapidly, meaning 5,000 to 10,000 high-quality samples typically recover most lost performance. Beyond that, additional samples yield minimal improvement.

The tradeoff is additional training time (typically 2-4 hours on similar hardware) and increased model size from the LoRA adapter (typically 100-500MB depending on rank). However, this is often acceptable given the quality recovery, especially for production deployments.

Challenges

1. Cost

Running the full compression pipeline cost approximately $1,000 USD in compute costs. This includes:

  • REAP pruning: ~600(14hours×8×H200at600 (14 hours × 8×H200 at 13/hour spot pricing)
  • AutoRound quantization: ~$200 (4 hours × 8×H200)
  • Experimentation and failed runs: ~$200 (additional prune ratios, calibration testing)

This cost is primarily driven by:

  • Observation caching overhead: Each new calibration dataset requires 12+ hours of forward passes
  • Multi-GPU requirements: MoE models demand distributed inference, limiting cheaper options
  • Iterative experimentation: Finding optimal prune ratios and calibration datasets requires repeated runs

At on-demand pricing (~4050/hour),thesameworkwouldcost40-50/hour), the same work would cost 3,000 to $4,000, making spot pricing essential for research budgets.

2. Consequences of Failure

Model compression has meaningful failure modes that waste both time and money:

Irreversible model damage: Once experts are pruned, there is no way to recover them. If quality degradation is unacceptable, the entire 14-hour pruning process must restart from scratch. This creates a high-stakes decision point around choosing compression ratios.

Calibration mismatch: Using inappropriate calibration data can produce models that perform well on benchmarks but fail catastrophically in production. For example, a model calibrated on general text may fail at code generation tasks, even if overall metrics look acceptable.

Quantization artifacts: Aggressive W4A16 quantization can introduce subtle errors that compound over long contexts. The model may perform well on short benchmarks but degrade on multi-turn conversations or long document analysis.

Validation challenges: Full model evaluation requires significant compute (the uncompressed model barely fits on training infrastructure). This means compressed models may deploy with undetected issues that only surface in production.

Recovery cost: If a compressed model fails in production, re-running the full pipeline costs another $1,000 and 18 hours. This creates pressure to deploy marginal models rather than re-compress.

3. Dataset Curation

The quality of compression is directly tied to the calibration dataset. Several challenges emerged:

Domain representation: The calibration data must represent the target use case. Our initial attempts with generic datasets (C4, SlimPajama) produced models that underperformed on code and function calling. We had to curate task-specific datasets:

  • Code generation: evol-codealpaca-v1 for programming tasks
  • Function calling: xlam-function-calling-60k for tool use
  • Agentic workflows: SWE-smith-trajectories for multi-turn reasoning

Sample quality vs. quantity: More samples is not always better. We found 1,360 high-quality, domain-relevant samples outperformed 10,000 generic samples. The observation file scales linearly with sample count (50GB for 1,360 samples), so each additional sample increases both computation time and storage requirements.

Data overlap: The calibration dataset should not overlap with evaluation benchmarks. This creates tension between having representative calibration data and avoiding contamination. We manually filtered samples to ensure no benchmark leakage.

Expert bias: Different calibration datasets cause REAP to prune different experts. A code-focused calibration may preserve different experts than a general-purpose calibration. This means production deployment requires matching calibration data to production traffic patterns.

Discussion: What Could Go Wrong

Users of compressed MoE models should understand the limitations and potential failure modes:

Quality Tradeoffs

Expert loss means capability loss: When REAP removes 50% of experts, those experts are gone permanently. The model cannot perform tasks that relied primarily on the pruned experts. For example, if experts specializing in mathematical reasoning are pruned, the model will perform worse on math tasks even if overall benchmark scores look acceptable.

Context window degradation: MoE models use different experts for different types of content. Compression may reduce the model's ability to handle diverse content within a single long context. A document with both code and natural language may see quality degradation on one or both domains.

Nuance and subtlety loss: Quantization to W4A16 reduces precision from 16-bit to 4-bit. This flattens the weight distribution, reducing the model's ability to make fine distinctions. Tasks requiring subtle reasoning (irony detection, emotional nuance, multi-step logical inference) may degrade more than benchmarks suggest.

Performance Variability

Task-dependent quality: The REAP paper demonstrates that compression impact varies significantly by task. Code generation and function calling saw near-lossless compression at 50% pruning, but other tasks (reasoning, knowledge retrieval) may degrade more severely. Users should evaluate compressed models on their specific use case, not generic benchmarks.

Calibration drift: Over time, production traffic may drift from the calibration dataset. As the model encounters out-of-distribution inputs, quality may degrade. This is particularly problematic for specialized domains (legal, medical, scientific) where high-quality calibration data is scarce.

Batch size sensitivity: Compressed models may be more sensitive to batching parameters. The optimal batch size for the uncompressed model may not be optimal for the compressed version, requiring additional tuning.

Operational Risks

Single point of failure: The 92GB model is more deployable but also a single point of failure. If the compressed model has a critical flaw, there is no straightforward fallback to the uncompressed model due to infrastructure constraints.

Reproduction difficulty: Re-running the compression pipeline requires significant compute and the original calibration dataset. If the original observations file is lost, reproducing the exact compressed model is impossible without another 14-hour calibration run.

Upgrade complexity: Upgrading to a new base model version requires re-running the full compression pipeline. This creates friction for staying current with model improvements.

Quantization sensitivity: W4A16 quantization may behave differently across different hardware. The same quantized model may produce slightly different outputs on NVIDIA vs AMD vs Apple Silicon, complicating reproducibility.

When to Avoid Compression

Given these risks, compression may not be appropriate for:

  • Safety-critical applications: Medical diagnosis, legal advice, financial planning where errors have real consequences
  • High-stakes decisions: Model outputs directly cause significant actions without human oversight
  • Regulated industries: Compliance requirements may prohibit modifications to certified models
  • Unknown use cases: When deployment patterns don't match available calibration data

For these scenarios, the cost of quality degradation outweighs infrastructure savings.

Expected Losses

For code generation and function calling at 40% expert retention + W4A16 quantization:

  • Benchmark performance: 0-5% degradation on HumanEval, MBPP
  • Nuance: Slight reduction in code style consistency
  • Edge cases: 5-10% increase in failure rate on complex multi-step problems

At 50% retention:

  • Benchmark performance: 5-15% degradation
  • Nuance: Noticeable reduction in subtle reasoning
  • Edge cases: 10-20% increase in failure rate

For general language tasks at 50% retention:

  • Benchmark performance: 10-20% degradation on MMLU, GSM8K
  • Nuance: Noticeable reduction in output quality
  • Edge cases: Significant increase in hallucinations and errors

These are estimates based on the REAP paper and our evaluation. Actual performance varies by specific task and calibration quality.

Technical Challenges and Solutions

Challenge 1: Calibration Domain Mismatch

Problem: Generic calibration datasets may not capture the expert specializations relevant to our target use cases.

Solution: We selected a domain-specific dataset combining code generation (evol-codealpaca), function calling (xlam), and agentic trajectories (SWE-smith). This mirrors the intended deployment scenarios and ensures REAP scores experts based on relevant performance.

Challenge 2: Multi-GPU Inference Compatibility

Problem: Compressed model checkpoints spanned 8 GPUs, but evaluation frameworks (evalplus) couldn't handle distributed inference.

Solution: This remains a limitation for automated benchmarking. Manual evaluation or single-GPU deployment required for quantitative assessment. Future work involves distributed inference infrastructure for compressed MoE models.

Challenge 3: Spot Instance Reliability

Problem: Long-running jobs (18 hours) face inevitable preemption risk on spot instances.

Solution: Designed pipeline with checkpoint resumability. Observations file (~50GB) cached to persistent storage. Async upload to HuggingFace triggered immediately upon checkpoint generation to minimize lost progress.

Lessons Learned

  1. Cache observations strategically: The 50GB observation file is reusable across multiple prune ratios experiments. Build once, test many configurations rapidly.

  2. 40% is the sweet spot: For MoE compression, removing 40-45% of experts provides optimal quality/size tradeoff. 50% is aggressive but useful for extreme constraint scenarios.

  3. Spot pricing changes economics: At 13/hrversusondemandpricing( 13/hr versus on-demand pricing (~40-50/hr), full pipeline cost drops from 3,000to3,000 to 1,000. Essential for research budget management.

  4. Quantization after pruning compounds gains: W4A16 quantization after pruning achieves better compression than either alone. The weight distribution after pruning appears more amenable to quantization.

  5. Upload before termination: Always trigger model upload before final checkpoint completes. Lost upload progress is expensive lost time.

  6. Calibration data quality matters more than quantity: 1,360 domain-relevant samples outperformed 10,000 generic samples. Invest in dataset curation.

Future Work

  1. Evaluate compressed variants: Systematic benchmark comparison across MMLU, HumanEval, and agentic benchmarks suite

  2. Distillation recovery: Implement LoRA-based self-distillation pipeline for quality recovery on aggressive compression

  3. Additional prune ratios: Complete 35%, 45% variants for finer granularity quality/size tradeoff curve

  4. GGUF conversion: Export quantized model to GGUF format for local inference via llama.cpp on consumer hardware

  5. Distributed inference: Infrastructure for compressed MoE distributed inference across multiple consumer GPUs

  6. MiniMax compression: Apply the same pipeline to MiniMax models for comparative analysis

References


This compression work was conducted on Prime Intellect infrastructure. Model checkpoints available under permissive licensing on HuggingFace.

Ready to start your project?

Let's discuss how we can help you achieve similar results.

Get in Touch

More Case Studies