We use a two-layer approach.
The raw sync layer (Gmail, calendar, transcripts, etc.) is idempotent and file-based. Each thread, event, or transcript is stored as its own Markdown file keyed by the source ID, and we track sync state to avoid re-ingesting the same item. That layer is append-only and not deduplicated.
Entity consolidation happens in a separate graph-building step. An LLM processes batches of those raw files along with an index of existing entities (people, orgs, projects and their aliases). Instead of relying on string matching, the model decides whether a mention like “Sarah” maps to an existing “Sarah Chen” node or represents a new entity, and then either updates the existing note or creates a new one.
> the model decides whether a mention like “Sarah” maps to an existing “Sarah Chen” node or represents a new entity, and then either updates the existing note or creates a new one.
Thanks! How much context does the model get for the consolidation step? Just the immediate file? Related files? The existing knowledge graph? If the graph, does it need to be multi-pass?