Architecting an Autonomous Personal Knowledge Assistant: Integrating iOS Shortcuts, Whisper.cpp, and Claude for Local-First PKM
The paradigm of Personal Knowledge Management (PKM) is undergoing a fundamental shift. For years, the industry has been dominated by specialized note-taking applications—Notion, Obsidian, Evernote—each with proprietary databases and complex linking structures. However, the emergence of Large Language Models (LLMs) with robust file-system access and agentic capabilities has rendered the "app-centric" approach obsolete. We are moving toward a "Local-First" architecture: a Personal Knowledge Assistant (PKA) where the primary interface is a simple, structured local directory, and the intelligence resides in an orchestration layer capable of processing raw inputs into structured knowledge.
The Problem: The Friction of Capture
The greatest bottleneck in any PKM system is the "capture friction." If a user must unlock their phone, navigate to a specific app, create a new note, and manually title it, the window of inspiration is often lost. Traditional third-party capture tools often introduce a "silo" problem, where data is trapped in a proprietary cloud environment, requiring manual export or complex API integrations to reach the user's primary knowledge base.
To solve this, we need a system that treats the mobile device not as a destination for notes, but as a high-frequency sensor for raw data—specifically audio—that feeds directly into a local-first ecosystem.
The Architecture: A Three-Tiered Pipeline
The architecture I have implemented relies on three distinct layers: the Capture Layer (iOS/Mobile), the Synchronization Layer (Cloud-to-Local Bridge), and the Processing Layer (The AI Agent).
1. The Capture Layer: iOS Shortcuts and Low-Latency Input
The goal of the Capture Layer is to minimize the time between "thought" and "storage." Using the iOS Shortcuts framework, we can bypass the standard UI overhead.
The implementation involves a specific Shortcut configuration:
- Action 1: Record Audio. The
Record Audioaction is configured with theImmediatelyparameter for theStop Recordingtrigger. This ensures that the moment the user stops interacting with the shortcut, the buffer is flushed. - Action 2: Save File. Instead of saving to the local iOS sandbox, the
Save Fileaction is directed to a specific path within the Dropbox or iCloud Drive directory.
By leveraging the Action Button (on iPhone 15 Pro/16) or the Control Center widgets, the user can trigger the Record Audio sequence without ever unlocking the device. This transforms the smartphone into a seamless audio-input peripheral for the local machine.
2. The Synchronization Layer: The Dropbox/iCloud Bridge
The synchronization layer acts as the transport protocol. By using Dropbox as the intermediary, we achieve a high-availability, cross-platform sync. The local MacBook monitors the .../Dropbox/PKA/Inbox/Audio_Captures directory. As soon as the iOS Shortcut completes the Save File operation, the file is pushed to the cloud and pulled to the local machine via the Dropbox daemon. This creates a "hot" folder that is constantly being updated with new, raw, unstructured data.
effectively 3. The Processing Layer: The Agentic Orchestration
This is where the "Assistant" aspect of the PKA emerges. The core of the processing engine is not a single model, but an orchestrated workflow involving Claude, Whisper.cpp, and FFmpeg.
The Role of Claude as the Orchestrator
The processing is executed within a terminal environment or a VS Code integrated terminal. Using Claude (via the Claude Code interface or similar agentic CLI tools), I can issue high-level instructions to the local file system.
Instead of manual transcription, the instruction is: "Process the audio captures in the team inbox folder."
The Transcription Engine: Whisper.cpp
For the heavy lifting of Speech-to-Text (STT), the system utilizes Whisper.cpp, a high-performance C++ port of OpenAI's Whisper model. Using whisper.cpp is critical for several reasons:
- Latency: It provides near-instantaneous transcription on local hardware (Apple Silicon).
- Privacy: The transcription happens entirely offline; no audio data leaves the local environment during the STT phase.
- Efficiency: It allows for the processing of large batches of audio files without the overhead of API calls or cloud-based transcription costs.
The Normalization Layer: FFmpeg
Audio captured via iOS may arrive in various containers or sample rates. To ensure the transcription engine receives a standardized input, FFmpeg is utilized within the automation script to normalize the audio (e.g., converting .m4a to .wav or .mp3 with a consistent mono-channel configuration).
The Automated Workflow Loop
The complete automated loop functions as follows:
- Detection: The agent (Claude) scans the
Inbox/Audio_Capturesdirectory for new files. - Normalization: An
ffmpegcommand is triggered to convert the incoming file to a compatible format forwhisper.cpp. - Transcription:
whisper.cppprocesses the normalized audio, generating a raw text transcript. - Structuring: Claude takes the raw transcript and performs several high-level tasks:
- Summarization: Creating a concise summary of the audio content.
- Metadata Extraction: Identifying key entities, dates, or action items.
- Markdown Generation: Creating a new
.mdfile in the primary Knowledge Base, containing the transcript, the summary, and links to related existing notes. - File Cleanup: Moving the original audio file from the
Inboxto anArchivefolder to prevent re-processing.
Conclusion: The Future of Agentic PKM
This architecture moves us away from "managing notes" and toward "managing data streams." By treating audio captures as raw telemetry that is processed by an agentic pipeline, we eliminate the cognitive load of manual organization. The system becomes self-organizing. As LLMs continue to improve their ability to interact with local file systems and execute shell commands, the boundary between "input" and "knowledge" will continue to dissolve, leaving us with a truly autonomous Personal Knowledge Assistant.