AI Powered Mobile App

Explore how AI will change mobile dev

Siamak (Ash) Ashrafi
12 min readMar 1, 2024

AI/ML is transforming mobile apps!

Explore how modern smartphones are packed with the processing power and sensors to run AI on-device, and dive into the different tools and frameworks available to turn your ideas into reality. Get ready to unleash the true potential of your phone and discover how AI is revolutionizing mobile development …

Introducing the mobile application called Photodo which helps users manage tasks using machine learning.

Start with the app’s home screen which consists of three sections: Favorites, Tasks, and a button to add new tasks. Adding a task can be done by taking a picture of the task, adding a voice recording describing the task, and setting a due date and time. The machine learning algorithm in the app helps identify the object in the picture and understand the task. For example, in the video, the user takes a picture of broken sunglasses and the app recognizes them and suggests fixing them as the task.

Users can also add notes to their tasks by speaking or taking a picture of the note. The app then transcribes the image into text. Finally, users can set a budget for the task and the app helps search for local businesses to complete the task. Once a business is chosen, the app provides directions to the business. After completing the task, users can either mark it as complete or delete it.

Video: AI Powered Photodo:

AI Powered ToDo App

Software Architecture

For a detailed understanding of the modern Android architecture.

Please see our article below:

Following the architecture allows for testable and maintainable code base.

This is a phone/watch app in a multi-modular architecture.

  • Wear (watch) is just another feature!
  • Every feature is an isolated application with it’s own build env.
  • MVVM: Hilt-DI & Room-Reactive Updates
The Wear (watch) App is just another feature.

ML App

Android System Resource

  • Voice to text
override fun startSpeechToText(updateText: (String) -> Unit, finished: () -> Unit) {
val speechRecognizer = SpeechRecognizer.createSpeechRecognizer(appContext)
val speechRecognizerIntent = Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH)
speechRecognizerIntent.putExtra(
RecognizerIntent.EXTRA_LANGUAGE_MODEL,
RecognizerIntent.LANGUAGE_MODEL_FREE_FORM,
)
speechRecognizer.setRecognitionListener(object : RecognitionListener {
override fun onReadyForSpeech(bundle: Bundle?) {}
override fun onBeginningOfSpeech() {}
override fun onRmsChanged(v: Float) {}
override fun onBufferReceived(bytes: ByteArray?) {}
override fun onEndOfSpeech() {finished()}
override fun onError(i: Int) {}
override fun onResults(bundle: Bundle) {
val result = bundle.getStringArrayList(SpeechRecognizer.RESULTS_RECOGNITION)
if (result != null) {
updateText(result[0]) // upadtes the ViewModel
}
}
override fun onPartialResults(bundle: Bundle) {}
override fun onEvent(i: Int, bundle: Bundle?) {}

})
speechRecognizer.startListening(speechRecognizerIntent)
}

Called from the ViewModel, the UiState is update and the UI is rendered.

is AddPhotodoEvent.StartCaptureSpeech2Txt -> {
viewModelScope.launch {
// actual work happens in the use case.
audioFun.startSpeechToText(event.updateText, event.finished)
}
}

Documentation:

ML Frameworks

A review of which ML to use when.

Please see our article below:

  • ML Powerhouse : Gemini + AICore (Pixel 8 Pro only) for blazing fast, on-device text tasks like summaries and smart replies. Easy drop-in, but Pixel exclusive.
  • Pre-built Powerhouse: ML Kit for vision & language magic like face detection and text recognition. Fast & simple, great for basic needs.
  • Firebase Fusion: (Mostly Deprecated) Firebase ML offers familiar pre-built models within your existing Firebase ecosystem. One-stop shop for basic functionalities.
  • Visual Pipeline Playground: MediaPipe builds complex pipelines for vision, audio, and more with pre-built blocks. Drag & drop ease, even for ML newbies.
  • Deep Dive (Adventurous): TensorFlow Lite for training & tweaking your own custom models. Maximum control, requires expertise and longer development.

TensorFlow Lite (TFLite)

The hardest system to use:

Data

Get Data / Clean Data — Very Hard!

Explore, analyze, and share quality data. Learn more about data types, creating, and collaborating.

Model

Machine Learning Mastery is aimed at developers and practitioners interested in learning machine learning.

Machine Learning Mastery
  • Target Audience: Developers transitioning into machine learning
  • Content Focus: Practical application of machine learning with code examples (often Python)
  • Learning Style: Straightforward explanations without heavy emphasis on the underlying math (unlike academic papers)
  • Resources: Tutorials, articles, courses, and a free email course
  • Technology Coverage: Wide range of machine learning topics including deep learning, neural networks, natural language processing, and more.

Overall, Machine Learning Mastery is a valuable resource for developers who want to get started with machine learning and see real-world examples of how it’s used.

Machine learning is taught by academics, for academics.
That’s why most material is so dry and math-heavy.

Developers need to know what works and how to use it.
We need less math and more tutorials with working code.

https://machinelearningmastery.com/start-here/

Setup TFLite

After adding the model to Android Studio we can get the info about the model. Here the most important number is the images size. We must feed the model images of 321 by 321 or the model will not work!

Place the model in the assets folder not the ml folder.

Always try to start with a trained model, it will make your life much easier.

ML Model to ML App

For a detailed full tutorial please watch video below …

Philipp Lackner — Full Tutorial:

Model and Source included in tutorial …

Get the source code for this video on GitHub:

Only Thing … do not put your code in the view. Pass it in from the ViewModel!

Build the classifier in the Repository of your app.

Building the TFLite ML in the project repository

call the analyzer

class LandmarkImageAnalyzer(
private val classifier: LandmarkClassifier,
private val onResults: (List<LandMarkClassification>) -> Unit
): ImageAnalysis.Analyzer {

private var frameSkipCounter = 0

override fun analyze(image: ImageProxy) {
if(frameSkipCounter % 60 == 0) {
val rotationDegrees = image.imageInfo.rotationDegrees

val bitmap = image
.toBitmap()
.centerCrop(321, 321) // if off by one number will not work.

val results = classifier.classify(bitmap, rotationDegrees)
Log.d("Photodo Pre Class", results.toString())
onResults(results)
}
frameSkipCounter++

image.close()
}
}

LandmarkClassifier calls the TF Classifier with the image

// Setup Classifier
classifier = ImageClassifier.createFromFileAndOptions( ... )
// Call Classifier and get the list of landmarks.
override fun classify(bitmap: Bitmap, rotation: Int): List<LandMarkClassification> {

The ML Classifier generates a list of landmarks for the ViewModel.

ViewModel update the UiState which publishes to the Composable.

As the UiState is updated the list is published to the Composable view.

TFLite LandMark App

ML Kit

DO NOT USE TFLite if you can use ML Kit !!!

MK Kit Face Detection

Recognize and locate facial features Get the coordinates of the eyes, ears, cheeks, nose, and mouth of every face detected.

Get the contours of facial features Get the contours of detected faces and their eyes, eyebrows, lips, and nose.

Recognize facial expressions Determine whether a person is smiling or has their eyes closed.

Track faces across video frames Get an identifier for each unique detected face. The identifier is consistent across invocations, so you can perform image manipulation on a particular person in a video stream.

Process video frames in real time Face detection is performed on the device, and is fast enough to be used in real-time applications, such as video manipulation.

Setup the cameraController and the Composeable List

Again we just want to build the `LifecycleCameraController`

// Setup the LifecycleCameraController to pass to the camera
val cameraController = LifecycleCameraController(currContext)

val lifecycleOwner = LocalLifecycleOwner.current
cameraController.bindToLifecycle(lifecycleOwner)
cameraController.cameraSelector = CameraSelector.DEFAULT_BACK_CAMERA

Set the ImageAnalysis `setImageAnalysisAnalyzer` for the camera …

// Referecen example code to setup the detector

cameraController.setImageAnalysisAnalyzer(executor) { imageProxy ->
imageProxy.image?.let { image ->
val img = InputImage.fromMediaImage(
image,
imageProxy.imageInfo.rotationDegrees
)

val options =
FaceDetectorOptions.Builder()
.setPerformanceMode(FaceDetectorOptions.PERFORMANCE_MODE_FAST)
.setContourMode(FaceDetectorOptions.CONTOUR_MODE_ALL)
.build()

val detector: FaceDetector = FaceDetection.getClient(options)

detector.process(img)
.addOnSuccessListener(
OnSuccessListener<List<Any?>?> { faces ->
// mFaceButton.setEnabled(true)
// processFaceContourDetectionResult(faces)
faceList = faces
Log.d("Photodo", "Faces are here $faces")
})
.addOnFailureListener(
OnFailureListener { e -> // Task failed with an exception
//mFaceButton.setEnabled(true)
e.printStackTrace()
})
}
}

Face Detection is just a cameraController passed in from the ViewModel!

Output from ML Kit Face Detection …

Face{boundingBox=Rect(94, 238 - 345, 490), 
trackingId=-1,
rightEyeOpenProbability=0.8734263,
leftEyeOpenProbability=0.99285287,
smileProbability=0.009943936,
eulerX=2.2682128, eulerY=-6.074356, eulerZ=4.808043,
landmarks=Landmarks{
landmark_0=FaceLandmark{type=0, position=PointF(221.54976, 470.18063)},
landmark_1=FaceLandmark{type=1, position=PointF(148.79044, 402.5858)},
landmark_3=FaceLandmark{type=3, position=PointF(131.24284, 376.60953)},
landmark_4=FaceLandmark{type=4, position=PointF(162.24553, 330.65063)},
landmark_5=FaceLandmark{type=5, position=PointF(184.89821, 442.79837)},
landmark_6=FaceLandmark{type=6, position=PointF(208.09193, 385.20282)},
landmark_7=FaceLandmark{type=7, position=PointF(282.59137, 387.56708)},
landmark_9=FaceLandmark{type=9, position=PointF(330.84723, 373.07874)},
landmark_10=FaceLandmark{type=10, position=PointF(251.61719, 320.842)},
landmark_11=FaceLandmark{type=11, position=PointF(266.47348, 435.18048)}},

...

contours=Contours{Contour_1=FaceContour{type=1, points=[PointF(198.0, 252.0),
... PointF(179.0, 254.0)]},

Contour_2=FaceContour{type=2, points=[PointF(115.0, 322.0),
PointF(121.0, 312.0), PointF(133.0, 304.0), PointF(151.0, 301.0),
PointF(174.0, 300.0)]},
...
Contour_7=FaceContour{type=7, points=[PointF(238.0, 337.0),
PointF(241.0, 334.0), PointF(247.0, 329.0), PointF(256.0, 325.0),
PointF(266.0, 324.0), PointF(276.0, 325.0), PointF(282.0, 328.0),
PointF(285.0, 330.0), PointF(288.0, 332.0), PointF(283.0, 335.0),
PointF(278.0, 338.0), PointF(271.0, 340.0), PointF(261.0, 341.0),
PointF(253.0, 341.0), PointF(245.0, 339.0), PointF(240.0, 338.0)]},
...
Contour_8= ... Contour_15=FaceContour{type=15, points=[PointF(278.0, 410.0)]}}}]

Use the data to get info about the person.

for (face in faces) {
val bounds = face.boundingBox
val rotY = face.headEulerAngleY // Head is rotated to the right rotY degrees
val rotZ = face.headEulerAngleZ // Head is tilted sideways rotZ degrees

// If landmark detection was enabled (mouth, ears, eyes, cheeks, and
// nose available):
val leftEar = face.getLandmark(FaceLandmark.LEFT_EAR)
leftEar?.let {
val leftEarPos = leftEar.position
}

// If contour detection was enabled:
val leftEyeContour = face.getContour(FaceContour.LEFT_EYE)?.points
val upperLipBottomContour = face.getContour(FaceContour.UPPER_LIP_BOTTOM)?.points

// If classification was enabled:
if (face.smilingProbability != null) {
val smileProb = face.smilingProbability
}
if (face.rightEyeOpenProbability != null) {
val rightEyeOpenProb = face.rightEyeOpenProbability
}

// If face tracking was enabled:
if (face.trackingId != null) {
val id = face.trackingId
}
}

Upcoming ML Project :: Using face features to predict shop habits.

AI/ML powered smart Ads.

Fun Results:

It is perfect with real people images. Never works with stylistic art and is hit/miss with realistic art and AI generated faces …

MLKit Face Detection

It’s interesting …

  • The 1st and 2nd are always seen as a face.
  • The 3rd is never seen as a face.
  • The 4rd is 30% of the time seen as a face.
Trying to fool MLKit Face Detection

MK Kit Photo to Label 7 & Image to Text

ML Kit

We are using ML Kit features:

Photo (Image) to label —

We just use ML Kit `ImageLabeling`

// To use default options:
val labeler = ImageLabeling.getClient(ImageLabelerOptions.DEFAULT_OPTIONS)

Give it an image and get the label.

try {
val image = InputImage.fromFilePath(applicationContext, event.fromFile)
labeler.process(image)
.addOnSuccessListener { labels ->
// Task completed successfully
for (label in labels) {
val text = label.text
val confidence = label.confidence
val index = label.index
msg += "label: $text, $confidence, $index \n"
}
mlDescription = labels[0].text

viewModelScope.launch {
val updatedUiState = _uiState.value.copy(
photoPath = event.fromFile,
title = mlDescription ?: _uiState.value.title,
description = "fix " + (mlDescription
?: _uiState.value.description)
)
_uiState.emit(updatedUiState)
}
triggerAlert("ML Categories", msg)
}
.addOnFailureListener { e ->
// Task failed with an exception
// ...
}

} catch (e: IOException) {
e.printStackTrace()
}

Photo to text —

Again `setImageAnalysisAnalyzer` …

// Example Code
val textRecognizer = TextRecognition.getClient(TextRecognizerOptions.DEFAULT_OPTIONS)

...

cameraController.setImageAnalysisAnalyzer(executor) { imageProxy ->
imageProxy.image?.let { image ->
val img = InputImage.fromMediaImage(
image,
imageProxy.imageInfo.rotationDegrees
)

textRecognizer.process(img).addOnCompleteListener { task ->
isLoading = false
text =
if (!task.isSuccessful) task.exception!!.localizedMessage.toString()
else task.result.text

onEvent(NextTaskEvent.GetTextFromImg(text)) //Send to ViewModel
textFieldValue.value = TextFieldValue(text)

cameraController.clearImageAnalysisAnalyzer()
imageProxy.close()
}
}
}

Same with BarCode … set the `setImageAnalysisAnalyzer`

// create BarcodeScanner object
val options = BarcodeScannerOptions.Builder()
.setBarcodeFormats(Barcode.FORMAT_QR_CODE)
.build()
val barcodeScanner = BarcodeScanning.getClient(options)

cameraController.setImageAnalysisAnalyzer(
ContextCompat.getMainExecutor(this),
MlKitAnalyzer(
listOf(barcodeScanner),
COORDINATE_SYSTEM_VIEW_REFERENCED,
ContextCompat.getMainExecutor(this)
) { result: MlKitAnalyzer.Result? ->
// The value of result.getResult(barcodeScanner) can be used directly for drawing UI overlay.
}
)

Gemini AI

Gemini AI Android App

This literally to easy to use … just like using a chat app but inside you app.

val generativeModel = GenerativeModel(
// For text-only input, use the gemini-pro model
modelName = "gemini-pro",
// Access your API key as a Build Configuration variable (see "Set up your API key" above)
apiKey = BuildConfig.apiKey
)

val prompt = "Write a story about a magic backpack."
val response = generativeModel.generateContent(prompt)
print(response.text)

For our app we use image with text and just text.

val generativeModelTxt = GenerativeModel(
// For text-only input, use the gemini-pro model
modelName = "gemini-pro",
// Access your API key as a Build Configuration variable (see "Set up your API key" above)
apiKey = BuildConfig.GEMINI_API_KEY
)

val generativeModelImg = GenerativeModel(
// For text-and-images input (multimodal), use the gemini-pro-vision model
modelName = "gemini-pro-vision",
// Access your API key as a Build Configuration variable (see "Set up your API key" above)
apiKey = BuildConfig.GEMINI_API_KEY
)

We ask what is this image and how much to fix.

`text(“What is this and how much to fix this?”)`

is MLEvent.GenAiResponseImg -> {
viewModelScope.launch {
val inputContent = content {
image(event.value)
text("What is this and how much to fix this?")
}

var response = ""
generativeModelImg.generateContentStream(inputContent).collect { chunk ->
print(chunk.text)
response += chunk.text
}

val updatedUiState = _uiState.value.copy(
aiResponse = response ?: "Nothing sent"
)
_uiState.emit(updatedUiState)
}
}

Give the app a little more info.

“how much to fix Ray Ban Sunglasses”

is MLEvent.GenAiResponseTxt -> {
viewModelScope.launch {
val prompt = event.value
var response = ""
generativeModelTxt.generateContentStream(prompt).collect { chunk ->
print(chunk.text)
response += chunk.text
}

val updatedUiState = _uiState.value.copy(
aiResponse = response ?: "Nothing sent"
)
_uiState.emit(updatedUiState)
}
}
Image/Text(left) — Text (right)

ML gives us lots of powerful features.

  • Convert the photo to text to understand the basic task.
  • Convert the voice memo to text
  • Convert the notes to text
  • Ask Gemini to identify the issue and get a budget
  • Ask Gemini to give more details about our issue.

Putting all this together

Photodo is like having a personal assistant in your pocket, always ready to help you get things done.

Steps

1. Take a picture — AI/ML system identifies the object associated with the task.

2. Record a voice memo — AI/ML system translates the voice to text and processes the text to determine the task.

3. Take a picture of any notes to add to the task — AI/ML system converts the image to text and further processes the task.

App determines the Yelp category so we search for local business to help with the task.

In the future we can add our own list of local business on the App.

Output

  • The AI/ML system generates a budget.
  • The AI/ML system locates a local business to help with the task.
  • The AI/ML will try to group and schedule task together
ML understand what needs to be done and finds a business to help with the task

Here we give an example of how AI/ML can make a simple ToDo app work like magic. This is just the beginning …

~Ash

Gemini Nano (only Pixel 8 Pro) — Coming soon to your phone?

Say hello to Gemini Nano, Google’s new AI helper living directly on your Pixel 8 Pro (even when offline)!

Developers get a bonus — Android AICore makes integrating Gemini Nano into their apps a breeze, and they can even tailor it to specific tasks.

From the above we can see that Gemini Nano can:

  • Text summarization: Condensing content like meeting notes.
  • Contextual smart replies: Crafting responses in messaging apps.
  • Proofreading and grammar correction: Improving written communication.

--

--