Overview
This article is a report evaluating the feasibility of implementing local LLM inference on Android. Specifically, it summarizes the limits encountered when distributing large language models to Android apps via Google Play On-Device AI (AI Pack), and presents implementation patterns and workarounds discovered during verification.
The models under test are in the 2–3 GB range such as Gemma 4 E2B. The report covers inference mechanisms (LiteRT-LM), distribution methods (AI Pack split delivery), and implementation constraints in a comprehensive manner.
Background: Strategic Positioning of Local LLMs
Why validate local LLMs?
When integrating natural language processing features into an Android app, developers can choose from multiple approaches. The table below summarizes the pros and cons of each approach:
| Approach | Model vendor API | Backend inference (GPU) | Local LLM |
|---|---|---|---|
| Initial development cost | Low (API integration only) | Medium (server setup & operations) | Medium (model optimization & testing) |
| Operational cost | High (per-call billing) | Medium–High (GPU instances, scaling) | Low (almost zero after distribution) |
| Latency | High (network round-trip) | Medium (server response time) | Low (on-device) |
| Inference quality | High (can use the latest models) | High (customizable) | Medium–High (possible degradation from quantization) |
| Offline support | No | No | Yes |
| Privacy | No (data sent off-device) | Partial (own servers) | Yes (data stays on device) |
| Scaling | Automatic (handled by API provider) | Manual (need to scale servers) | Zero (client-side) |
| Device compatibility | All devices | All devices | Memory-dependent (see Gemma docs) |
| Use case | Cutting-edge AI use, enterprise features | Large user base, custom models required | Offline support, privacy-sensitive apps |
The local LLM examined in this report is intended for scenarios where the advantages in the table—offline support, privacy protection, and zero scaling costs—are important. Typical scenarios where validating a local LLM is valuable include:
- Offline-first requirements: airplane mode, remote areas without reliable connectivity
- Privacy protection: user data must not leave the device
- Cost optimization: avoid per-call API billing and eliminate scaling costs
- Latency reduction: avoid network round-trip delays for a better UX
- Reduced external dependencies: minimize reliance on specific cloud services
Inference mechanisms: Why LiteRT-LM suits Gemma 4
When implementing local inference, developers must choose an inference mechanism. The Android AI Overview outlines the main options:
- Gemini Nano: Google’s on-device official LLM, but available only on limited devices such as Pixel 8/9
- LiteRT-LM (TensorFlow Lite for Language Models): runs open models (Gemma, etc.) on device; wider device compatibility and a broader choice of models
- ML Kit: general-purpose ML tasks, not optimized for LLMs
We selected Gemma 4 for validation to avoid the limited device availability of Gemini Nano and to evaluate feasibility across a wider range of devices. The google-ai-edge/gallery repository provides reference Android implementations that are useful for this purpose.
LiteRT-LM (see Custom Models) is an extension of TensorFlow Lite targeted at language model inference. Standard LiteRT is optimized for static models such as image classification, but language model inference introduces the following challenges:
- Sequence generation: text generation requires iterative token-by-token inference; efficient KV cache management is crucial
- Memory efficiency: arranging large model weights and managing intermediate tensors during inference
- Quantization compatibility: reducing model size while preserving inference quality
LiteRT-LM provides optimizations addressing these challenges. Quantizing open models such as Gemma and converting them to LiteRT-LM format enables practical inference performance even on lower-end devices.
Gemma 4: positioning and selection rationale
Gemma is a series of lightweight, efficient open-source language models from Google. Gemma 4 is the latest release and offers multiple quantization variants (E2B, E4B, etc.) suitable for devices with different memory and compute capabilities. We chose Gemma 4 for the following reasons:
- Open-source licensing: suitable for commercial use without special distribution constraints
- Multiple quantized variants: e.g., E2B (lighter) and E4B (higher accuracy) allow selecting the best trade-off for target devices
- Broader device compatibility: unlike Gemini Nano’s limited device support, Gemma models can run on many Android devices
- Reference implementations: the google-ai-edge/gallery repository contains Android reference examples and rich learning resources
This validation aims to clarify the challenges and solutions for production deployment by leveraging these advantages.
Problem: distribution limits for large models
Despite the benefits above, there are practical challenges. Large LLMs are sizable, and distribution through Google Play On-Device AI (AI Pack) imposes limits.
The (base_llm_e2b) asset pack exceeds the maximum compressed download size (1,500 MB).
The (base_llm_e2b_e4b) asset pack exceeds the maximum compressed download size (1,500 MB).
Your app bundle's asset packs exceed the total maximum compressed download size (4,000 MB).
AI Pack allows staged downloads that reduce the initial APK size, but the AI Pack documentation specifies a strict limit of 1,500 MB per pack (compressed), so single large models cannot be distributed in one pack.
- Gemma 4 E2B: 2.4 GB
- Gemma 4 E4B: 3.4 GB
Quantized LLM weight files compress poorly, remaining close to their original size even after ZIP compression. Therefore, single-pack distribution is infeasible.
Conclusion of the validation: feasibility of integrating Gemma 4 + LiteRT-LM + AI Pack split delivery
This report evaluates the feasibility of combining the following three elements:
- Model: Gemma 4 (open-source, broad device support)
- Inference: LiteRT-LM (language model-specific optimizations)
- Distribution: AI Pack split delivery (to distribute large models via Google Play)
Although each element is technically established, integrating them in production raises numerous challenges. This validation documents encountered limits, root causes, and practical workarounds so other developers can assess adopting this architecture.
The report focuses on practical findings rather than theoretical descriptions, delivering real-world measurements, API constraints (for example, that AiPackManager.fetch() must be called from a foreground Activity), tool issues (e.g., bundletool behavior), and working Kotlin examples.
Implementation prerequisites: managing device compatibility
Gemma 4 E2B can run under environments described in the Gemma 4 specifications. In this validation, basic inference tests ran successfully on a Pixel 9a. However, long-term memory behavior was not tested exhaustively, so adequate load testing on target devices is recommended before production deployment.
It is possible to limit AI Pack distribution targets to reduce the risk of delivering large models to unsuitable devices. Google Play’s AI Pack settings allow specifying requirements such as minimum RAM, API level, and device models. Adding such distribution filters avoids sending large models to devices likely to suffer memory shortages and preserves user experience.
Solution: splitting model files and staged installation
1. Split the model binary across multiple packs
Divide the model file to fit the 1,500 MB per-pack limit. Because LLM weight files are continuous binaries, simple split-and-concatenate operations work. Example splitting for Gemma 4 E2B:
split -b 1200m gemma4_e2b.litertlm gemma4_e2b_part
mv gemma4_e2b_partaa gemma4_e2b_part1.bin # base_llm_e2b_1/src/main/assets/
mv gemma4_e2b_partab gemma4_e2b_part2.bin # base_llm_e2b_2/src/main/assets/
Keeping each split around 1,200 MB ensures each pack stays under the 1,500 MB limit after compression.
2. Configure each AI Pack module
Create two AI Pack modules for the split model files. Each module’s build.gradle.kts should be configured like this:
plugins { id("com.android.ai-pack") }
aiPack {
packName = "base_llm_e2b_1" // use underscores only (hyphens are not allowed)
dynamicDelivery {
deliveryType = "on-demand" // use on-demand rather than fast-follow
}
}
Important: AI Pack names must use underscores (_); hyphens (-) cause upload errors in Google Play Console.
3. Why choose on-demand
AI Pack delivery types include fast-follow and on-demand. While fast-follow auto-downloads packs after install, it can cause automatic re-downloads after removePack() is called. With on-demand, packs download only when fetch() is explicitly called, allowing effective removal and better storage control.
4. Release AI Pack storage after concatenation using removePack()
After both packs complete downloading, concatenate the binaries into an internal app file and remove the AI Pack storage to reclaim space.
The actual concatenation implementation follows production patterns: use helper methods, constants for filenames, and avoid hardcoding values. Example implementation:
private suspend fun assembleAndInstall(assetsPath1: String, assetsPath2: String) {
_state.value = AiPackState.Assembling
val dest = assembledModelFile()
try {
dest.parentFile?.mkdirs()
if (dest.exists()) dest.delete()
FileOutputStream(dest).use { out ->
File("$assetsPath1/$PART1_FILENAME").inputStream().use { it.copyTo(out) }
File("$assetsPath2/$PART2_FILENAME").inputStream().use { it.copyTo(out) }
}
manager.removePack(PACK_NAME_1)
manager.removePack(PACK_NAME_2)
_state.value = AiPackState.Installed(dest.absolutePath)
} catch (e: Exception) {
dest.delete()
_state.value = AiPackState.Failed("Assembly failed: ${e.message}")
}
}
private fun assembledModelFile() = File(context.filesDir, "models/$MODEL_FILENAME")
companion object {
const val PACK_NAME_1 = "base_llm_e2b_1"
const val PACK_NAME_2 = "base_llm_e2b_2"
const val PART1_FILENAME = "gemma4_e2b_part1.bin"
const val PART2_FILENAME = "gemma4_e2b_part2.bin"
const val MODEL_FILENAME = "gemma4_e2b.litertlm"
}
Storage usage timeline for this approach:
| Phase | AI Pack area | filesDir | Total |
|---|---|---|---|
| Download / during concatenation | ~2.4 GB | increasing | up to ~4.8 GB |
| After concatenation & removal | 0 MB | ~2.4 GB | ~2.4 GB |
AiPackManager.fetch() call restrictions
The AI Pack API requires AiPackManager.fetch() to be called from the main thread while the app is in the foreground. Specifically, the following sources of calls fail:
- Calling from an
InputMethodService(results in error code -7: "Asset Pack requested but in background") - Calling from a WorkManager
Worker(even when elevated to a foreground service withsetForeground())
A correct example is invoking fetch() from a ViewModel scoped to a settings Activity that is in the foreground:
@HiltViewModel
class LLMSettingsViewModel @Inject constructor(
private val aiPackRepository: AiPackRepository,
) : ViewModel() {
init {
viewModelScope.launch {
aiPackRepository.fetch() // Settings Activity must be in the foreground
}
}
}
Initial pack state detection
If the packs are already present at app startup, you can start assembly from the cached pack locations and avoid re-downloading:
val loc1 = manager.getPackLocation(PACK_NAME_1)
val loc2 = manager.getPackLocation(PACK_NAME_2)
if (loc1 != null && loc2 != null) {
val path1 = loc1.assetsPath() ?: run {
_state.value = AiPackState.Failed("assetsPath is null for $PACK_NAME_1")
return
}
val path2 = loc2.assetsPath() ?: run {
_state.value = AiPackState.Failed("assetsPath is null for $PACK_NAME_2")
return
}
scope.launch { assembleAndInstall(path1, path2) }
return
}
Download state transitions and completion detection
Because the AI Pack API does not provide a single method to list registered pack names, completion detection must rely on getPackLocation() and an update listener. The following example uses AiPackStateUpdateListener:
private fun handlePackStateUpdate(packState: PlayAiPackState) {
when (packState.status()) {
AiPackStatus.COMPLETED -> {
val loc1 = manager.getPackLocation(PACK_NAME_1)
val loc2 = manager.getPackLocation(PACK_NAME_2)
if (loc1 != null && loc2 != null) {
manager.unregisterListener(listener)
val path1 = loc1.assetsPath() ?: run {
_state.value = AiPackState.Failed("assetsPath is null for $PACK_NAME_1")
return
}
val path2 = loc2.assetsPath() ?: run {
_state.value = AiPackState.Failed("assetsPath is null for $PACK_NAME_2")
return
}
scope.launch { assembleAndInstall(path1, path2) }
}
}
AiPackStatus.FAILED -> handleTerminalError("Download failed (${packState.errorCode()})")
AiPackStatus.CANCELED -> handleTerminalError("Download canceled")
AiPackStatus.WAITING_FOR_WIFI, AiPackStatus.REQUIRES_USER_CONFIRMATION ->
_state.value = AiPackState.WaitingForConfirmation
else -> {
val total = packState.totalBytesToDownload()
val downloaded = packState.bytesDownloaded()
val progress = if (total > 0) downloaded.toFloat() / total else 0f
_state.value = AiPackState.Downloading(progress)
}
}
}
private fun handleTerminalError(cause: String) {
manager.unregisterListener(listener)
_state.value = AiPackState.Failed(cause)
}
Local testing pitfalls
Using bundletool for local testing has pitfalls. The bundletool build-apks --local-testing command uses 32-bit integers for ZIP offset calculations, causing ArithmeticException: integer overflow for files larger than ~2.1 GB. This issue affects bundletool 1.17.x and 1.18.x.
Workarounds:
- Use dummy empty files for local testing
- Use Google Play Internal App Sharing for real-device validation with actual model files
Gemma 4 implementation notes
During verification, insights were gathered about different quantized variants of Gemma 4 (E2B, E4B). Take note of the following when implementing these models:
- Gemma 4 E2B: In a brief verification, it ran without issues on Pixel 9a for simple inference tasks.
- Gemma 4 E4B: Higher accuracy but increased memory usage; on memory-constrained devices there is higher risk of OOM and may be unsuitable without careful testing.
Model selection must balance inference quality and device memory characteristics; run real-device tests in the prototype phase and choose the model variant appropriate for your target device set.
Summary
Key takeaways when distributing large LLMs via Google Play AI Pack:
- File splitting: Split models so each pack fits under 1,500 MB
- Choose
on-demand: Prefer explicit control over automatic downloads to manage storage efficiently - Respect API constraints: Understand that
fetch()must be called from a foreground Activity - Local testing considerations: Use Internal App Sharing to validate with real model files
- Model selection: Balance accuracy and device memory characteristics
This implementation pattern can be applied to other Android apps distributing large assets. Understand AI Pack behavior and design distribution strategies that respect current constraints to maintain a good user experience.
Presently, this approach is a workaround within AI Pack’s current constraints (1,500 MB per pack). Techniques such as splitting into multiple packs, staged on-demand downloads, and deleting packs after concatenation can make distribution possible in practice. However, the following areas would benefit from improvements in the platform:
- Increasing the per-pack limit would greatly reduce distribution complexity
- A unified API for querying multiple pack status would simplify orchestration
- Improvements in tooling (e.g., bundletool handling of >2GB files)
While these platform improvements are desirable, the patterns presented here are viable workarounds using currently available Android AI APIs and tools.
References: