Delivering Large LLMs via Google Play AI Pack — Limits and Workarounds-RingRing

Overview

This article is a report evaluating the feasibility of implementing local LLM inference on Android. Specifically, it summarizes the limits encountered when distributing large language models to Android apps via Google Play On-Device AI (AI Pack), and presents implementation patterns and workarounds discovered during verification.

The models under test are in the 2–3 GB range such as Gemma 4 E2B. The report covers inference mechanisms (LiteRT-LM), distribution methods (AI Pack split delivery), and implementation constraints in a comprehensive manner.

Background: Strategic Positioning of Local LLMs

Why validate local LLMs?

When integrating natural language processing features into an Android app, developers can choose from multiple approaches. The table below summarizes the pros and cons of each approach:

Approach	Model vendor API	Backend inference (GPU)	Local LLM
Initial development cost	Low (API integration only)	Medium (server setup & operations)	Medium (model optimization & testing)
Operational cost	High (per-call billing)	Medium–High (GPU instances, scaling)	Low (almost zero after distribution)
Latency	High (network round-trip)	Medium (server response time)	Low (on-device)
Inference quality	High (can use the latest models)	High (customizable)	Medium–High (possible degradation from quantization)
Offline support	No	No	Yes
Privacy	No (data sent off-device)	Partial (own servers)	Yes (data stays on device)
Scaling	Automatic (handled by API provider)	Manual (need to scale servers)	Zero (client-side)
Device compatibility	All devices	All devices	Memory-dependent (see Gemma docs)
Use case	Cutting-edge AI use, enterprise features	Large user base, custom models required	Offline support, privacy-sensitive apps

The local LLM examined in this report is intended for scenarios where the advantages in the table—offline support, privacy protection, and zero scaling costs—are important. Typical scenarios where validating a local LLM is valuable include:

Offline-first requirements: airplane mode, remote areas without reliable connectivity
Privacy protection: user data must not leave the device
Cost optimization: avoid per-call API billing and eliminate scaling costs
Latency reduction: avoid network round-trip delays for a better UX
Reduced external dependencies: minimize reliance on specific cloud services

Inference mechanisms: Why LiteRT-LM suits Gemma 4

When implementing local inference, developers must choose an inference mechanism. The Android AI Overview outlines the main options:

Gemini Nano: Google’s on-device official LLM, but available only on limited devices such as Pixel 8/9
LiteRT-LM (TensorFlow Lite for Language Models): runs open models (Gemma, etc.) on device; wider device compatibility and a broader choice of models
ML Kit: general-purpose ML tasks, not optimized for LLMs

We selected Gemma 4 for validation to avoid the limited device availability of Gemini Nano and to evaluate feasibility across a wider range of devices. The google-ai-edge/gallery repository provides reference Android implementations that are useful for this purpose.

LiteRT-LM (see Custom Models) is an extension of TensorFlow Lite targeted at language model inference. Standard LiteRT is optimized for static models such as image classification, but language model inference introduces the following challenges:

Sequence generation: text generation requires iterative token-by-token inference; efficient KV cache management is crucial
Memory efficiency: arranging large model weights and managing intermediate tensors during inference
Quantization compatibility: reducing model size while preserving inference quality

LiteRT-LM provides optimizations addressing these challenges. Quantizing open models such as Gemma and converting them to LiteRT-LM format enables practical inference performance even on lower-end devices.

Gemma 4: positioning and selection rationale

Gemma is a series of lightweight, efficient open-source language models from Google. Gemma 4 is the latest release and offers multiple quantization variants (E2B, E4B, etc.) suitable for devices with different memory and compute capabilities. We chose Gemma 4 for the following reasons:

Open-source licensing: suitable for commercial use without special distribution constraints
Multiple quantized variants: e.g., E2B (lighter) and E4B (higher accuracy) allow selecting the best trade-off for target devices
Broader device compatibility: unlike Gemini Nano’s limited device support, Gemma models can run on many Android devices
Reference implementations: the google-ai-edge/gallery repository contains Android reference examples and rich learning resources

This validation aims to clarify the challenges and solutions for production deployment by leveraging these advantages.

Problem: distribution limits for large models

Despite the benefits above, there are practical challenges. Large LLMs are sizable, and distribution through Google Play On-Device AI (AI Pack) imposes limits.

The (base_llm_e2b) asset pack exceeds the maximum compressed download size (1,500 MB).
The (base_llm_e2b_e4b) asset pack exceeds the maximum compressed download size (1,500 MB).
Your app bundle's asset packs exceed the total maximum compressed download size (4,000 MB).

AI Pack allows staged downloads that reduce the initial APK size, but the AI Pack documentation specifies a strict limit of 1,500 MB per pack (compressed), so single large models cannot be distributed in one pack.

Gemma 4 E2B: 2.4 GB
Gemma 4 E4B: 3.4 GB

Quantized LLM weight files compress poorly, remaining close to their original size even after ZIP compression. Therefore, single-pack distribution is infeasible.

Conclusion of the validation: feasibility of integrating Gemma 4 + LiteRT-LM + AI Pack split delivery

This report evaluates the feasibility of combining the following three elements:

Model: Gemma 4 (open-source, broad device support)
Inference: LiteRT-LM (language model-specific optimizations)
Distribution: AI Pack split delivery (to distribute large models via Google Play)

Although each element is technically established, integrating them in production raises numerous challenges. This validation documents encountered limits, root causes, and practical workarounds so other developers can assess adopting this architecture.

The report focuses on practical findings rather than theoretical descriptions, delivering real-world measurements, API constraints (for example, that AiPackManager.fetch() must be called from a foreground Activity), tool issues (e.g., bundletool behavior), and working Kotlin examples.

Implementation prerequisites: managing device compatibility

Gemma 4 E2B can run under environments described in the Gemma 4 specifications. In this validation, basic inference tests ran successfully on a Pixel 9a. However, long-term memory behavior was not tested exhaustively, so adequate load testing on target devices is recommended before production deployment.

It is possible to limit AI Pack distribution targets to reduce the risk of delivering large models to unsuitable devices. Google Play’s AI Pack settings allow specifying requirements such as minimum RAM, API level, and device models. Adding such distribution filters avoids sending large models to devices likely to suffer memory shortages and preserves user experience.

Solution: splitting model files and staged installation

1. Split the model binary across multiple packs

Divide the model file to fit the 1,500 MB per-pack limit. Because LLM weight files are continuous binaries, simple split-and-concatenate operations work. Example splitting for Gemma 4 E2B:

split -b 1200m gemma4_e2b.litertlm gemma4_e2b_part
mv gemma4_e2b_partaa gemma4_e2b_part1.bin  # base_llm_e2b_1/src/main/assets/
mv gemma4_e2b_partab gemma4_e2b_part2.bin  # base_llm_e2b_2/src/main/assets/

Keeping each split around 1,200 MB ensures each pack stays under the 1,500 MB limit after compression.

2. Configure each AI Pack module

Create two AI Pack modules for the split model files. Each module’s build.gradle.kts should be configured like this:

plugins { id("com.android.ai-pack") }

aiPack {
    packName = "base_llm_e2b_1"   // use underscores only (hyphens are not allowed)
    dynamicDelivery {
        deliveryType = "on-demand"  // use on-demand rather than fast-follow
    }
}

Important: AI Pack names must use underscores (_); hyphens (-) cause upload errors in Google Play Console.

3. Why choose `on-demand`

AI Pack delivery types include fast-follow and on-demand. While fast-follow auto-downloads packs after install, it can cause automatic re-downloads after removePack() is called. With on-demand, packs download only when fetch() is explicitly called, allowing effective removal and better storage control.

4. Release AI Pack storage after concatenation using `removePack()`

After both packs complete downloading, concatenate the binaries into an internal app file and remove the AI Pack storage to reclaim space.

The actual concatenation implementation follows production patterns: use helper methods, constants for filenames, and avoid hardcoding values. Example implementation:

private suspend fun assembleAndInstall(assetsPath1: String, assetsPath2: String) {
    _state.value = AiPackState.Assembling
    val dest = assembledModelFile()
    try {
        dest.parentFile?.mkdirs()
        if (dest.exists()) dest.delete()
        FileOutputStream(dest).use { out ->
            File("$assetsPath1/$PART1_FILENAME").inputStream().use { it.copyTo(out) }
            File("$assetsPath2/$PART2_FILENAME").inputStream().use { it.copyTo(out) }
        }
        manager.removePack(PACK_NAME_1)
        manager.removePack(PACK_NAME_2)
        _state.value = AiPackState.Installed(dest.absolutePath)
    } catch (e: Exception) {
        dest.delete()
        _state.value = AiPackState.Failed("Assembly failed: ${e.message}")
    }
}

private fun assembledModelFile() = File(context.filesDir, "models/$MODEL_FILENAME")

companion object {
    const val PACK_NAME_1 = "base_llm_e2b_1"
    const val PACK_NAME_2 = "base_llm_e2b_2"
    const val PART1_FILENAME = "gemma4_e2b_part1.bin"
    const val PART2_FILENAME = "gemma4_e2b_part2.bin"
    const val MODEL_FILENAME = "gemma4_e2b.litertlm"
}

Storage usage timeline for this approach:

Phase	AI Pack area	filesDir	Total
Download / during concatenation	~2.4 GB	increasing	up to ~4.8 GB
After concatenation & removal	0 MB	~2.4 GB	~2.4 GB

`AiPackManager.fetch()` call restrictions

The AI Pack API requires AiPackManager.fetch() to be called from the main thread while the app is in the foreground. Specifically, the following sources of calls fail:

Calling from an InputMethodService (results in error code -7: "Asset Pack requested but in background")
Calling from a WorkManager Worker (even when elevated to a foreground service with setForeground())

A correct example is invoking fetch() from a ViewModel scoped to a settings Activity that is in the foreground:

@HiltViewModel
class LLMSettingsViewModel @Inject constructor(
    private val aiPackRepository: AiPackRepository,
) : ViewModel() {
    init {
        viewModelScope.launch {
            aiPackRepository.fetch()  // Settings Activity must be in the foreground
        }
    }
}

Initial pack state detection

If the packs are already present at app startup, you can start assembly from the cached pack locations and avoid re-downloading:

val loc1 = manager.getPackLocation(PACK_NAME_1)
val loc2 = manager.getPackLocation(PACK_NAME_2)
if (loc1 != null && loc2 != null) {
    val path1 = loc1.assetsPath() ?: run {
        _state.value = AiPackState.Failed("assetsPath is null for $PACK_NAME_1")
        return
    }
    val path2 = loc2.assetsPath() ?: run {
        _state.value = AiPackState.Failed("assetsPath is null for $PACK_NAME_2")
        return
    }
    scope.launch { assembleAndInstall(path1, path2) }
    return
}

Download state transitions and completion detection

Because the AI Pack API does not provide a single method to list registered pack names, completion detection must rely on getPackLocation() and an update listener. The following example uses AiPackStateUpdateListener:

private fun handlePackStateUpdate(packState: PlayAiPackState) {
    when (packState.status()) {
        AiPackStatus.COMPLETED -> {
            val loc1 = manager.getPackLocation(PACK_NAME_1)
            val loc2 = manager.getPackLocation(PACK_NAME_2)
            if (loc1 != null && loc2 != null) {
                manager.unregisterListener(listener)
                val path1 = loc1.assetsPath() ?: run {
                    _state.value = AiPackState.Failed("assetsPath is null for $PACK_NAME_1")
                    return
                }
                val path2 = loc2.assetsPath() ?: run {
                    _state.value = AiPackState.Failed("assetsPath is null for $PACK_NAME_2")
                    return
                }
                scope.launch { assembleAndInstall(path1, path2) }
            }
        }
        AiPackStatus.FAILED -> handleTerminalError("Download failed (${packState.errorCode()})")
        AiPackStatus.CANCELED -> handleTerminalError("Download canceled")
        AiPackStatus.WAITING_FOR_WIFI, AiPackStatus.REQUIRES_USER_CONFIRMATION ->
            _state.value = AiPackState.WaitingForConfirmation

        else -> {
            val total = packState.totalBytesToDownload()
            val downloaded = packState.bytesDownloaded()
            val progress = if (total > 0) downloaded.toFloat() / total else 0f
            _state.value = AiPackState.Downloading(progress)
        }
    }
}

private fun handleTerminalError(cause: String) {
    manager.unregisterListener(listener)
    _state.value = AiPackState.Failed(cause)
}

Local testing pitfalls

Using bundletool for local testing has pitfalls. The bundletool build-apks --local-testing command uses 32-bit integers for ZIP offset calculations, causing ArithmeticException: integer overflow for files larger than ~2.1 GB. This issue affects bundletool 1.17.x and 1.18.x.

Workarounds:

Use dummy empty files for local testing
Use Google Play Internal App Sharing for real-device validation with actual model files

Gemma 4 implementation notes

During verification, insights were gathered about different quantized variants of Gemma 4 (E2B, E4B). Take note of the following when implementing these models:

Gemma 4 E2B: In a brief verification, it ran without issues on Pixel 9a for simple inference tasks.
Gemma 4 E4B: Higher accuracy but increased memory usage; on memory-constrained devices there is higher risk of OOM and may be unsuitable without careful testing.

Model selection must balance inference quality and device memory characteristics; run real-device tests in the prototype phase and choose the model variant appropriate for your target device set.

Summary

Key takeaways when distributing large LLMs via Google Play AI Pack:

File splitting: Split models so each pack fits under 1,500 MB
Choose on-demand: Prefer explicit control over automatic downloads to manage storage efficiently
Respect API constraints: Understand that fetch() must be called from a foreground Activity
Local testing considerations: Use Internal App Sharing to validate with real model files
Model selection: Balance accuracy and device memory characteristics

This implementation pattern can be applied to other Android apps distributing large assets. Understand AI Pack behavior and design distribution strategies that respect current constraints to maintain a good user experience.

Presently, this approach is a workaround within AI Pack’s current constraints (1,500 MB per pack). Techniques such as splitting into multiple packs, staged on-demand downloads, and deleting packs after concatenation can make distribution possible in practice. However, the following areas would benefit from improvements in the platform:

Increasing the per-pack limit would greatly reduce distribution complexity
A unified API for querying multiple pack status would simplify orchestration
Improvements in tooling (e.g., bundletool handling of >2GB files)

While these platform improvements are desirable, the patterns presented here are viable workarounds using currently available Android AI APIs and tools.

References:

Delivering Large LLMs via Google Play AI Pack — Limits and Workarounds

Overview

Background: Strategic Positioning of Local LLMs

Why validate local LLMs?

Inference mechanisms: Why LiteRT-LM suits Gemma 4

Gemma 4: positioning and selection rationale

Problem: distribution limits for large models

Conclusion of the validation: feasibility of integrating Gemma 4 + LiteRT-LM + AI Pack split delivery

Implementation prerequisites: managing device compatibility

Solution: splitting model files and staged installation

1. Split the model binary across multiple packs

2. Configure each AI Pack module

3. Why choose `on-demand`

4. Release AI Pack storage after concatenation using `removePack()`

`AiPackManager.fetch()` call restrictions

Initial pack state detection

Download state transitions and completion detection

Local testing pitfalls

Gemma 4 implementation notes

Summary

e-book

Android Apps

Search This Blog

Labels

Blog Archive

Report Abuse

Delivering Large LLMs via Google Play AI Pack — Limits and Workarounds

Overview

Background: Strategic Positioning of Local LLMs

Why validate local LLMs?

Inference mechanisms: Why LiteRT-LM suits Gemma 4

Gemma 4: positioning and selection rationale

Problem: distribution limits for large models

Conclusion of the validation: feasibility of integrating Gemma 4 + LiteRT-LM + AI Pack split delivery

Implementation prerequisites: managing device compatibility

Solution: splitting model files and staged installation

1. Split the model binary across multiple packs

2. Configure each AI Pack module

3. Why choose on-demand

4. Release AI Pack storage after concatenation using removePack()

AiPackManager.fetch() call restrictions

Initial pack state detection

Download state transitions and completion detection

Local testing pitfalls

Gemma 4 implementation notes

Summary

e-book

Android Apps

Search This Blog

Labels

Blog Archive

Report Abuse

3. Why choose `on-demand`

4. Release AI Pack storage after concatenation using `removePack()`

`AiPackManager.fetch()` call restrictions