The detection prepass plus text reasoning pipeline is effectively a perception to symbol translation layer, and that is where most of the brittleness will hide. Once you collapse a continuous 3D scene into discrete labels, you lose uncertainty, relative geometry, and temporal consistency unless you explicitly model them. The LLM then reasons over a clean but lossy world model, so action quality is capped by what the detector chose to surface.
The failure mode is not just missed objects, it is state aliasing. Two physically different scenes can map to the same label set, especially with occlusion, depth ambiguity, or near boundary conditions. In control tasks like drone navigation, that can produce confident but wrong actions because the planner has no access to the underlying geometry or sensor noise. Error compounds over time since each step re-anchors on an already simplified state.
Are you carrying forward any notion of uncertainty or temporal tracking from the vision stage, or is each step a stateless label snapshot fed to the reasoning model?