We use a two-layer approach. The raw sync layer (Gmail, calendar, transcripts, etc.) is idempotent...

segmenta • yesterday at 8:10 PM • 1 reply • view on HN

We use a two-layer approach.

The raw sync layer (Gmail, calendar, transcripts, etc.) is idempotent and file-based. Each thread, event, or transcript is stored as its own Markdown file keyed by the source ID, and we track sync state to avoid re-ingesting the same item. That layer is append-only and not deduplicated.

Entity consolidation happens in a separate graph-building step. An LLM processes batches of those raw files along with an index of existing entities (people, orgs, projects and their aliases). Instead of relying on string matching, the model decides whether a mention like “Sarah” maps to an existing “Sarah Chen” node or represents a new entity, and then either updates the existing note or creates a new one.

Replies

delichon • yesterday at 8:23 PM

> the model decides whether a mention like “Sarah” maps to an existing “Sarah Chen” node or represents a new entity, and then either updates the existing note or creates a new one.

Thanks! How much context does the model get for the consolidation step? Just the immediate file? Related files? The existing knowledge graph? If the graph, does it need to be multi-pass?

➕ show 1 reply

alt Hacker News

Replies