I'm going by what features Apple advertisement showed in the iPhone 16 ad. Take a phone out, and point at a restuarant and ask it to a) analyze the video/image b) understand what's going on
Or pull out the phone and ask "Who's the person I met on X day ..".
Sure, many local models can do all that today already, as they have vision and tool calling support.