A decade ago this didn't require LLMs and cutting edge hardware and a trillion dollars of GPUs. This was a Facebook feature in like 2012.
>What I really want is my phone to transcribe all of my phone calls to a Notes document
This has been doable for decades. Why haven't you done it? My Pixel phones did this with voicemail before LLMs.
Windows Vista shipped with full featured dictation functionality, and it works better than you would expect, all local, all using classical algorithms, all evaluated cheaply. If it wasn't accurate enough, Dragon speech to text tools were gold standard for most of modern computing history, and greatly surpassed the accuracy of that built in system.
BTW, you can, on any Windows machine right now, access that built in voice recognition, and with a "Constrained vocabulary", say if you only want a few specific voice commands, it gets near perfect accuracy constantly. You have to search for old documentation now because Microsoft wants to hide that you don't need an internet connection or an Azure account and monthly bill to ship accurate voice recognition with your app. It's trivial to use, from both C++ and C#, and anything else that allows you to invoke native code, and the workflow is easy enough to understand. I built an app to utilize it instead of buying one of those $10 "Voice control your game" apps to add voice control to ARMA, and it was easier to implement the voice recognition than it was to copy and paste native code invocations for the Win32 api to inject keystrokes. I don't even write C# code in general.
https://learn.microsoft.com/en-us/previous-versions/windows/...
There's tons of documentation about "Grammar" and configuration but the default configuration IIRC is to just turn speech input into text, and do so with at least 85% accuracy, even without the user actually training the recognizer to their voice. If you build context specific grammars or a hierarchical grammar to support a real UX that isn't just hoping some code knows how to interpret raw speech you will get dramatically better recognition performance.
This is IMO a frequent pattern. Time and time again the people who keep saying "I want LLMs to do X" don't seem to be aware that "X" was a robust and mature area of research decades ago! They don't seem to be aware that you could already do X and even buy ready to go software for that purpose! Often enough the LLM version is an outright regression in functionality, as things that were doable with a single microchip in 1960 now require an internet connection.
>Since it isn't recording an audio conversation,
So to be clear, you want this functionality explicitly to bypass law? Federally and in 39ish states, you only need your own consent anyway.