I'm using the smaller vision models (Qwen3.5-4B currently) with Frigate, a FOSS self-hosted "AI" NVR. It's good enough at analyzing images to figure out mostly what's happening, and doesn't require the big knowledge base that bigger models have.
Also use a bigger model for summarizing or translating text, which I don't consume in realtime, so doesn't need to be fast. Would be a thing I could use OpenAI's batch APIs for if I did need something higher quality.