Multimodal AI Interaction Prototype — Real-Time Perception & Response

AI Perception
Prototype

AI systems don’t just generate responses.

They continuously perceive, interpret, and act on the world around them.

This project explores how to turn that invisible process into an interactive, spatial interface.

Multimodal AR Engine: Making AI Perception Interactive

Role: AI Product Designer / Multimodal Systems Prototyper
Focus: Perception → Reasoning → Interaction systems

Problem

AI systems already perceive rich signals from images, audio, and context—but that perception is hidden from users.

Today’s interfaces collapse all of that complexity into a single output.

As a result:

Users cannot see what the AI is actually detecting in a scene
Users cannot understand how the system forms its decisions
Users cannot interact with AI at the level of perception

This creates a black-box interaction model, where trust is low and control is limited.

Insight

AI is not just a response generator.
It is a continuous perception system.

Trust and usability improve when users can see:

what the system perceives
how it structures context
where reasoning originates

Instead of hiding perception, I designed a system where:

AI perception becomes interactive UI

The interface shifts from:

prompt → response

to:

perception → reasoning → interaction

This prototype explores how an interface might visualize that internal reasoning loop.

Solution

I built a multimodal AR engine that connects AI perception directly to user interaction.

The system:

analyzes visual input using AI
converts perception into structured spatial data (bounding boxes + labels)
renders that perception as an interactive overlay
connects visual context to language-based reasoning
enables users to interact with the AI through the environment itself

This creates a continuous interaction loop, where perception and reasoning are no longer hidden—they are directly manipulable.

Key Features

Object Detection → Spatial Anchors

AI identifies elements in a scene and maps them into coordinate space, allowing UI to attach directly to real-world or image-based objects.

Structured Reasoning Layer

Instead of returning raw text, the system produces structured outputs that can drive interface behavior and interaction.

Interactive Spatial UI

Users interact with detected objects directly, transforming AI from a passive responder into an active interface system.

Camera-Based Perception Loop

The system supports real-time visual input, enabling continuous updates to perception and context.

3D AR Rendering Layer (Three.js)

A lightweight AR layer introduces depth, motion, and spatial presence—bridging 2D perception with 3D interaction

Architecture

User Input → Perception → Structured Representation → Reasoning → UI Overlay → Interaction Loop

This system reframes AI as a real-time cognitive pipeline:

Perception Layer: captures visual input (image / camera)
Structuring Layer: converts perception into usable data
Reasoning Layer: interprets context and determines actions
Interface Layer: renders perception as interactive UI

Rather than producing a single response, the system continuously updates and exposes its internal state.

Demo

Why This Matters

Most AI products stop at reasoning.

They produce outputs, but hide the process that led to them.

This project explores the missing layer:

Perception → Reasoning → Interaction

Making this loop visible unlocks:

more intuitive human-AI interaction
higher trust through transparency
new interface paradigms beyond chat

This becomes critical for:

AR glasses and spatial computing systems
real-time AI assistants
multimodal copilots
embodied AI and robotics

In these contexts, AI must operate as a continuous system, not a prompt-based tool.

Interaction Model

This prototype introduces a new interaction pattern:

Users interact with what the AI sees, not just what it says
AI exposes its internal state (listening, reasoning, responding)
The system operates as a continuous loop rather than discrete steps

This shifts AI from:

tool

to:

interface layer between perception and action