GraphRAG Part 2 – Cross-Doc & Sub-graph Extraction, Multi-Vector Entity Representation

Part 2: A deep dive into implementation of cross-documents semantic relations extraction, multi-hop graph traversal queries, and multi-vector (entities, relations, document summaries, chunks) document representation to allow queries from different vector spaces.

Introduction

In Part 1 of this series, we explored the foundational architecture of the GraphRAG Document Repository, covering document ingestion, entity extraction, and knowledge graph construction. We established a multi-pass pipeline that processes PDF and web documents, extracts entities using AI-powered analysis, and builds a comprehensive knowledge graph in Neo4j.

Building upon that foundation, Part 2 introduces advanced cross-document intelligence capabilities that transform isolated document knowledge into an interconnected information network. This installment covers three critical enhancements:

Cross-Document Relationship Aggregation & Multi-Hop Traversal – Discovering how entities and concepts relate across multiple documents with confidence-weighted relationship tracking and intelligent graph exploration
Subgraph Extraction & LOCAL Query Context – Generating entity-centric knowledge subgraphs with AI-powered summarization and comprehensive provenance tracking
Multi-Vector Representation (MVR) – Moving beyond single-vector chunk embeddings to a sophisticated multi-space retrieval system supporting entity, relationship, summary, and chunk vectors

These capabilities enable our GraphRAG system to answer complex queries that require synthesizing information from multiple sources, understanding semantic relationships between entities, and providing contextually rich responses with transparent source attribution.

System Architecture Update

The GraphRAG system architecture described in Part 1 has evolved to support cross-document intelligence through enhanced / new services in the “Graph Operations & Search Services” functional area:

New Service Capabilities

The Graph Operations & Search Services layer now includes four specialized services working in concert:

Graph Service – We have updated the existing Graph Service to support advanced cross-document relationship operations. Key capabilities now include counting relationship occurrences throughout the entire repository and calculating aggregate confidence scores using average and maximum values. Additionally, the service tracks document support using first-seen metadata and allows for filtering based on relationship type and confidence thresholds.

Graph Traversal Service – we have introduced a new dedicated Graph Traversal Service designed for complex knowledge graph exploration. Key features include multi-hop path finding with cross-document awareness, alongside document diversity scoring for assessing path quality. The service supports configurable traversal strategies—such as breadth-first, shortest-path, and confidence-weighted approaches—and enables both neighborhood exploration and targeted entity-to-entity navigation.

Subgraph Service – we have optimized the Subgraph Service to focus on entity-centric knowledge extraction. Key capabilities now include generating single and multi-entity subgraphs and tracking cross-document provenance. Additionally, the service leverages AI-powered summarization using Claude 3.5 API and provides query-focused LOCAL context formatting.

Search Service – we have upgraded the Search Service to handle multi-modal retrieval orchestration. Key features now include performing vector search across multiple embedding spaces and facilitating graph-based relationship discovery. Additionally, the service employs hybrid fusion to combine semantic and structural search, alongside VoyageAI API re-ranking for result optimization.

This architecture enables sophisticated information retrieval that goes far beyond traditional document search, providing graph-aware, cross-document intelligence with transparent provenance tracking.

Cross-Document Relationship Aggregation & Multi-Hop Graph Traversal

The Challenge: From Isolated Documents to Connected Knowledge

Traditional document repositories often treat each document as an isolated information silo. When users ask questions like “What do multiple papers say about transformer architectures?” or “How do different authors view the relationship between AI safety and model scaling?”, single-document approaches fail to synthesize cross-document insights.

Our solution addresses this through two complementary capabilities: relationship aggregation that consolidates entity connections across documents, and multi-hop traversal that discovers indirect relationships through the knowledge graph.

Cross-Document Relationship Aggregation

Design Approach

Currently, separate entity relationship instances are created during ingestion for each mention in a document. At runtime, relationship edges are dynamically consolidated with aggregate metadata tracking document support so that each relationship between two entities maintains:

Occurrence Count – Number of documents mentioning this relationship
Aggregate Confidence – Both average and maximum confidence scores across mentions
Document Support – List of documents supporting the relationship with first-seen tracking
Relationship Properties – Contextual attributes like project names, dates, or roles

Implementation Details

The relationship aggregation leverages Neo4j’s native graph traversal capabilities combined with Cypher aggregation functions. When querying cross-document relationships for an entity, the system:

Identifies all relationship edges connected to the target entity across all documents
Groups relationships by type and target entity to consolidate duplicate connections
Aggregates metadata using Cypher’s collect() and aggregate functions (see below)
Filters by confidence thresholds to surface high-quality relationships
Ranks results by occurrence count and confidence scores

MATCH (e:Entity {id: $entity_id})-[r]-(related:Entity)
WITH e, type(r) as rel_type, related, 
     collect(r) as relationships,
     collect(DISTINCT r.document_id) as docs
RETURN related,
       rel_type,
       size(docs) as occurrence_count,
       avg([rel IN relationships | rel.confidence]) as avg_confidence,
       max([rel IN relationships | rel.confidence]) as max_confidence,
       docs as supporting_documentsaa

This approach provides several key advantages, including efficient queries that allow a single graph traversal to retrieve all cross-document connections. It also ensures transparent provenance through complete document support tracking for every relationship. Furthermore, the method offers quality signals, where confidence aggregation helps identify strongly supported relationships, and scalability as aggregation occurs at query time without preprocessing overhead.

Multi-Hop Graph Traversal with Cross-Document Awareness

Design Approach

While direct relationships between entities are valuable, many insights require discovering indirect connections through intermediary entities. Our multi-hop traversal system implements cross-document aware path finding that discovers entity relationships spanning multiple documents with configurable traversal strategies.

The GraphRAG system supports three traversal strategies:

Breadth-First Traversal – Explores all relationships at depth N before moving to depth N+1, ideal for discovering nearby entity neighborhoods
Shortest-Path Traversal – Finds minimum-hop paths between entities, optimized for targeted entity-to-entity queries
Confidence-Weighted Traversal – Prioritizes high-confidence relationships, ensuring path quality over path length

Document Diversity Scoring

A key innovation in our traversal system is document diversity scoring, which quantifies how well a path spans multiple information sources:

document_diversity = unique_documents_in_path / total_document_mentions

This score ranges from 0.0 (all relationships from a single document) to 1.0 (perfect diversity with each relationship from a different document). Higher diversity scores indicate paths supported by multiple independent sources, increasing confidence in the discovered relationships.

Traversal Configuration

The traversal service accepts configuration parameters for tailoring graph exploration:

Max Depth – Limits hop count to prevent excessive traversal (default: 2-3 hops)
Max Entities – Caps total entities in results to maintain performance (default: 50)
Min Confidence – Filters relationships below quality thresholds (default: 0.5)
Document Filter – Optionally restricts traversal to specific document subsets
Relationship Type Filter – Focuses traversal on specific relationship categories

These parameters allow us to support use cases ranging from broad entity neighborhood exploration to targeted path finding between specific entities.

Subgraph Extraction & LOCAL Query Context

The Challenge: From Graph Queries to Contextual Understanding

While the knowledge graph enables powerful relationship queries, using graph query results for generative AI applications (like answering user questions or summarization) requires transforming graph structure into natural language context. To meet this challenge we implemented entity-centric subgraphs extraction that captures all relevant relationships, entities, and supporting text while maintaining cross-document provenance.

Subgraph extraction forms the foundation of GraphRAG’s LOCAL context generation – system allows us to provide focused, entity-centric information with complete source attribution for AI-powered question answering.

Subgraph Data Model

Subgraph representation in GraphRAG extends traditional knowledge graph extraction with cross-document metadata designed specifically for AI consumption:

Entity Focus – central entity that anchors the subgraph
Relationships – all connected entities with relationship types and confidence scores
Cross-Document Metadata – is_cross_doc flag, document_count, and primary_document identification
Document Mentions – frequency tracking for each supporting document
Chunk Aggregation – relevant text chunks from all documents with provenance
AI Analysis – Claude-generated summary, keywords, and confidence assessment

This rich metadata structure enables the GraphRAG to Identify the primary document (document with most entity/relationship mentions) as the main source. It also allows the system to track document diversity to highlight cross-document insights and provide transparent provenance for every piece of information by including contextual text for semantic understanding by AI.

Entity-Centric Subgraph Extraction

Single Entity Extraction

The subgraph extraction process implements the following algorithm:

Entity Identification – locate the target entity in the knowledge graph
Relationship Traversal – discover connected entities within configurable depth (typically 1-2 hops)
Chunk Retrieval – gather relevant text chunks mentioning any entity in the subgraph (see Cypher query below)
Document Frequency Analysis – calculate mention counts per document to identify primary source
Cross-Document Detection – flag subgraphs spanning multiple documents with diversity metadata
AI Analysis – submit subgraph to Claude API for summarization and keyword extraction

MATCH (e:Entity)-[:MENTIONED_IN]->(c:Chunk)
WHERE e.id IN $entity_ids
MATCH (c)-[:FROM_DOCUMENT]->(d:Document)
RETURN DISTINCT c, d
ORDER BY c.index ASC

The extraction process respects the following configurable parameters to allow balancing the analysis depth with its performance:

Document Filter – restricts to specific document subsets (optional)
Max Depth – controls relationship traversal distance
Max Entities – limits subgraph size for manageable context
Min Confidence – filters low-quality relationships

Multiple Entity Extraction

Complex queries require information about multiple entities. GraphRAG system supports parallel subgraph extraction with intelligent de-duplication. This capability enables queries like “Compare the approaches of three different authors on ability to interpret an LLM model” by extracting and analyzing subgraphs for each author simultaneously.

AI-Powered Subgraph Analysis

Each extracted subgraph undergoes AI analysis using Claude API to generate:

Subgraph Summary – a concise natural language description of the subgraph’s key information, explicitly noting cross-document sources when applicable
Keywords – extracted key concepts and terms for semantic indexing and retrieval enhancement
Confidence Score – an AI assessment of subgraph quality and information completeness

The AI prompt used for that includes cross-document context cues, ensuring Claude understands when synthesizing information from multiple sources:

“This subgraph spans 3 documents with the primary source being ransformers_paper.pdf’. When summarizing, note information that appears consistently across multiple sources versus information from a single source.”

Claude AI analysis transforms raw graph structure into semantically rich context optimized for downstream AI applications.

LOCAL Query Context Formatting

Context Structure

The LOCAL context system formats multiple subgraphs into a hierarchical, query-focused prompt designed for Large Language Model (LLM) consumption:

# LOCAL Context for Query: [User Query]

## Context 1: [Primary Entity Name]
*Sources: 3 documents* (cross-document indicator)
[AI-Generated Summary explaining the entity’s role and key relationships]
– Related Entity (Type) – brief description
**Key Relationships:**
– Entity A → RELATIONSHIP_TYPE → Entity B [mentioned in 3 docs]
– Entity C → RELATIONSHIP_TYPE → Entity D [confidence: 0.92]
**Primary Document**: transformers_paper.pdf
**All Sources**: transformers_paper.pdf, bert_paper.pdf, gpt_paper.pdf

**Key Entities:**
– Entity Name (Type) – brief description
**Relevant Text Excerpts:**
“The attention mechanism computes…” (transformers_paper.pdf, chunk 5)
—

## Context 2: [Secondary Entity Name]
…

Design Principles

The LOCAL context format follows several key principles:

Query Focus – each context section directly addresses the user’s query by centering on relevant entities
Provenance Transparency – clear attribution of every piece of information to source documents with cross-document indicators
Hierarchical Organization – numbered sections enable LLMs to reference specific parts of context in responses
Source Prioritization – primary document identification helps LLMs understand the main information source
Cross-Document Awareness – explicit marking of multi-document subgraphs signals when information synthesis occurred

Multi-Vector Representation (MVR)

The Challenge: Beyond Single-Vector Chunk Retrieval

Traditional RAG systems rely on a single vector space – chunk embeddings – for information retrieval. While effective for basic semantic search, this approach has fundamental limitations:

Granularity Mismatch – entity-focused queries (“Tell me about OpenAI”) don’t align well with chunk-level vectors that contain multiple entities and concepts
Relationship Blindness – chunk embeddings don’t explicitly capture the semantic meaning of relationships between entities (e.g., “OpenAI CREATED GPT-4” vs “Microsoft INVESTED_IN OpenAI”)
Document-Level Gaps – high-level questions about document topics or themes are difficult to answer when only chunk-level vectors exist
Fact vs Context Confusion – specific factual relationships get diluted in chunk embeddings that include surrounding contextual text

Multi-Vector Architecture

Multi-Vector Representation system implemented in GraphRAG Document Repository addresses these limitations by creating dedicated embedding spaces for Entities, Relations, and Document Summaries in addition to Chunks embeddings space. Each of four specialized collections in Chroma DB is optimized for specific query patterns:

Embedding Generation: From Graph Data to Vector Representations

Before exploring the Multi-Vector Representation (MVR) design, it is essential to understand how raw graph data will transform into embeddings. Unlike chunk embeddings that directly embed document text, entity and relationship embeddings require aggregating information from multiple sources across the knowledge graph.

Entity Embedding Generation Process:

Consider an entity like “OpenAI” of Organization type that appears in multiple documents. The system constructs a rich contextual representation by aggregating:

Core Entity Information: name, type, aliases, and description extracted during entity recognition
Cross-Document Statistics: document frequency (appears in N documents), document diversity score (Shannon entropy of mention distribution)
Relationship Context: relationship count, unique relationship partners, top relationship types (e.g., FOUNDED, CREATED, INVESTED_IN)
Co-Occurrence Patterns: frequently co-mentioned entities across documents

The aggregated text sent to VoyageAI for embedding generation looks like:

Entity: OpenAI (Type: Organization)
Description: Artificial intelligence research laboratory

Document Frequency: Appears in 8 documents
Relationship Count: 34 total relationships
Document Diversity Score: 0.763

Top Relationships: FOUNDED, CREATED, INVESTED_IN
Frequent Co-entities: GPT-4, Sam Altman, Microsoft, ChatGPT, Anthropic

Key Context: AI research organization developing large language models 
and AGI technologies, founded in 2015, known for GPT series and ChatGPT.

This rich contextual embedding enables semantic search to find “OpenAI” not just by name matching, but by understanding its role as an AI research organization with specific relationships and cross-document presence patterns.

Relationship Embedding Generation Process:

For a relationship like “Sam Altman -FOUNDED-> OpenAI”, the system constructs fact-based representations by aggregating:

Structured Relationship Data: source entity name/type, relationship type, target entity name/type
Cross-Document Evidence: occurrence count (mentioned in N contexts), aggregate confidence, supporting documents
Consensus Metrics: document consensus score measuring agreement across sources
Contextual Information: sample context from document chunks where the relationship appears

The aggregated text sent to VoyageAI for embedding generation looks like:

Relationship: Sam Altman -FOUNDED-> OpenAI
Source Entity: Sam Altman (Person)
Target Entity: OpenAI (Organization)

Occurrence Count: Mentioned in 5 different contexts
Aggregate Confidence: 0.94
Supporting Documents: 5 documents
Document Consensus Score: 1.0

Context: Sam Altman co-founded OpenAI in December 2015 along with Elon Musk, 
Greg Brockman, and others with the mission to ensure AGI benefits humanity. 
He served as CEO of OpenAI and has been instrumental in the development 
of GPT series models.

This fact-based embedding allows queries like “Who founded OpenAI?” to directly match against relationship embeddings where the FOUNDED relationship type and entities are explicitly represented, avoiding the semantic drift that occurs when such facts are buried within general chunk text.

Summary Embedding Generation Process:

Document summaries are generated using Claude API, which analyzes the complete document to produce structured summaries with:

AI-Generated Summary: 2-3 sentence overview of document content and significance
Topic Extraction: key topics and themes identified by Claude
Entity Integration: key entities mentioned in the document with frequencies
Cross-Document Metrics: shared entities with other documents for connectivity analysis

The output text sent to VoyageAI for summary embedding generation looks like:

Document: "Attention Is All You Need" (Research Paper)

Summary: This seminal paper introduces the Transformer architecture, a novel 
neural network model based entirely on attention mechanisms without recurrence 
or convolution. The architecture achieves state-of-the-art results on machine 
translation tasks while being more parallelizable and requiring less training time.

Key Topics: transformer architecture, self-attention mechanism, encoder-decoder, 
sequence-to-sequence modeling, neural machine translation

Key Entities: Transformer, attention mechanism, BERT, neural networks, NLP

Cross-Document Connectivity: Shares 8 entities with other documents
Document Type: Academic research paper introducing foundational architecture
Relevance Score: 0.95

This Claude-powered approach enables document-level queries to find papers by their themes and contributions rather than requiring exact keyword matches.

Key Insight: The multi-vector approach transforms the knowledge graph’s structured data (entities, relationships, document metadata) into semantically rich text representations that preserve the graph’s structural information while enabling vector similarity search. This bridges the gap between symbolic graph representation and semantic vector search.

Vector Space Design

All vector spaces use VoyageAI’s voyage-3 embedding model (1024 dimensions) for semantic consistency, enabling mathematically sound similarity calculations across collections.

1. Entity Embeddings Collection

Use Cases: Entity-centric queries like “What do you know about BERT?” or “Find all Technology entities mentioned in multiple documents”.

Entity Embedding oChroma DB bject metadata:

entity_name, entity_type: core entity identification
doc_frequency: cross-document presence indicator (enables filtering for widely-discussed entities)
document_diversity_score: Shannon entropy-based measurement of entity mention distribution
unique_relationship_partners: connectivity metric showing how well-connected the entity is
top_relationship_types: most frequent relationship patterns for context
confidence: entity extraction confidence

2. Relation Embeddings Collection

Use Cases: Fact-based queries like “Who founded OpenAI?” or “What technologies does Google use?” with confidence scoring.

Relation Embedding Chroma DB object metadata:

source_entity, target_entity, relation_type: structured relationship information
occurrence_count: how many times relationship appears across documents
aggregate_confidence: average confidence across all mentions (Phase 3.2 aggregation)
document_consensus_score: cross-document agreement measurement
supporting_docs: list of documents containing this relationship
has_context: boolean indicating contextual information availability

3. Document Summary Embeddings Collection

Use Cases: Document discovery queries like “Find papers about transformer architectures” or “What documents discuss attention mechanisms?”
Summary Embedding Chroma DB object metadata:

title, source_type: document identification
entity_count: number of entities in document
key_entities: top entities by mention frequency
key_topics: Claude-extracted topics and themes
shared_entities_with_other_docs: cross-document connectivity metric
relevance_score: Claude-assigned document significance score

4. Document Chunks Collection

The existing chunk embeddings collection continues to serve detail-oriented queries requiring specific textual evidence, maintaining backward compatibility with all existing functionality.

Multi-Vector Indexing Pipeline

Embedding Generation Sequence:

Entity Embeddings – parallel embeddings generation with configurable batch sizes (default: 50 entities per batch)
- Aggregates 12 statistical metadata fields per entity
- Calculates Shannon entropy for document diversity scoring
- Tracks relationship connectivity and co-entity patterns
Relation Embeddings – parallel embeddings generation leveraging relationship aggregation (default: 100 relations per batch)
- Uses existing aggregate_relationship_edges() for cross-document data
- Generates stable MD5-based relation IDs for consistent identification
- Calculates document consensus scores
Summary Embeddings – sequential embeddings generation with Claude API rate limit management (default: 10 documents per batch)
- Claude API generates structured summaries with topics and themes
- Extracts key entities and calculates shared entity metrics
- Produces relevance scores and document type analysis

Service Architecture

Three New Specialized Services:

EntityEmbeddingService – rich entity context generation with 12 metadata fields
RelationEmbeddingService – fact-based relationship embeddings with consensus scoring
SummaryEmbeddingService – Claude-powered document summarization with topic extraction

Coordination Layer:

MultiVectorServiceManager – health monitoring, collection management, and orchestration
MultiVectorSearchService – unified search interface across four collections

The service architecture maintains clean separation of concerns with dedicated services for each embedding type, coordinated by a manager service that handles initialization, health monitoring, and batch generation orchestration.

Configuration System

MVR service introduced 24 new configuration parameters to provide complete control over multi-vector behavior:

Core Settings – enable/disable multi-vector indexing, version tracking, default search modes
Per-Collection Settings – batch sizes, minimum thresholds, collection names
Analytics Toggles – document diversity calculation, consensus scoring, shared entity metrics

All settings are environment-variable configurable for GraphRAG deployment flexibility.

New REST API Endpoints

The cross-document intelligence and multi-vector capabilities are exposed through several new REST API endpoints in the Graph Operations & Search Services area.

Cross-Document Analysis APIs

1. Get Cross-Document Relations

Use case: Discovering how an entity relates to others across multiple documents with confidence metrics.

GET /api/documents/entities/{entity_id}/cross-doc-relations

Query Parameters:
  - relationship_types: Optional[List[str]] - Filter by relationship types
  - min_confidence: Optional[float] - Minimum confidence threshold
  - min_occurrences: Optional[int] - Minimum document occurrence count

Response: List of relationships with aggregate statistics
  - occurrence_count: Number of documents supporting the relationship
  - avg_confidence: Average confidence across all mentions
  - max_confidence: Maximum confidence observed
  - supporting_documents: List of document IDs with first-seen metadata

2. Get Entity Document Distribution

Use case: Understanding which documents discuss an entity and identifying related entities through co-occurrence patterns.

GET /api/documents/entities/{entity_id}/documents

Response: Document distribution with co-occurrence statistics
  - document_list: Documents mentioning the entity
  - mention_count: Frequency per document
  - co_entities: Other entities mentioned in the same documents
  - co_occurrence_scores: Statistical co-occurrence metrics

3. Multi-Hop Graph Traversal

Use case: Discovering indirect relationships between entities and evaluating path quality through document diversity scoring.

POST /api/documents/entities/traverse

Request Body:
  - start_entity_id: Starting entity for traversal
  - end_entity_id: Optional target entity for directed search
  - strategy: Traversal strategy (breadth_first, shortest_path, confidence_weighted)
  - max_depth: Maximum hop count (default: 2)
  - max_entities: Entity limit (default: 50)
  - min_confidence: Relationship confidence threshold

Response: Discovered paths with cross-document metadata
  - paths: List of entity paths with relationships
  - document_diversity: Diversity score for each path
  - path_confidence: Aggregate confidence for path
  - supporting_documents: Documents supporting each path segment

Subgraph Extraction APIs

1. Single Entity Subgraph

Use case: Extracting comprehensive entity-centric knowledge with AI-powered summarization.

POST /api/documents/entities/{entity_id}/subgraph

Request Body:
  - max_depth: Relationship traversal depth (default: 2)
  - max_entities: Entity limit for subgraph (default: 50)
  - min_confidence: Relationship confidence threshold (default: 0.5)
  - include_summary: Generate AI summary (default: true)
  - include_keywords: Extract keywords (default: true)
  - document_filter: Optional document ID list

Response: Subgraph with cross-document metadata
  - entity: Central entity details
  - relationships: Connected entities with relationship types
  - is_cross_doc: Boolean indicating multi-document subgraph
  - document_count: Number of supporting documents
  - primary_document: Main source document
  - document_mentions: Frequency per document
  - summary: Claude-generated subgraph summary
  - keywords: Extracted key concepts
  - chunks: Relevant text excerpts with provenance

2. Multiple Entity Subgraphs

Use case: Batch extraction of subgraphs for multiple entities with parallel processing.

POST /api/documents/entities/subgraphs/extract-multiple

Query Parameters:
  - max_depth: Relationship depth (default: 2)
  - max_entities: Per-subgraph entity limit (default: 30)

Request Body: List of entity IDs

Response: List of subgraphs with parallel extraction
  - subgraphs: Array of entity subgraphs
  - extraction_time: Performance metrics
  - total_entities: Aggregate entity count across subgraphs

3. LOCAL Context Formatting

Use case: Generating query-focused LOCAL context for LLM consumption with comprehensive provenance tracking.

POST /api/documents/entities/subgraphs/format-context

Request Body:
  - query: User query string for context framing
  - entity_ids: List of entity IDs to include
  - max_subgraphs: Limit on number of subgraphs (default: 5)

Response: Formatted LOCAL context string
  - context: Markdown-formatted hierarchical context
  - subgraph_count: Number of included subgraphs
  - total_entities: Aggregate entity count
  - cross_doc_count: Number of cross-document subgraphs

All APIs include comprehensive error handling, request validation, and detailed response schemas auto-documented via OpenAPI/Swagger.

Conclusion

In this blog post I described new features that transformed isolated document knowledge into interconnected information network:

Cross-Document Intelligence – relationship aggregation and multi-hop traversal enabled discovering how entities and concepts relate across multiple information sources, with confidence scoring and document diversity metrics providing quality signals for multi-document insights.
LOCAL Context Generation – entity-centric subgraph extraction with AI-powered summarization delivered focused, provenance-rich context optimized for Large Language Model consumption that supports transparent information synthesis from multiple documents.
Multi-Vector Representation – moving beyond traditional single-vector chunk embeddings, new four-collection architecture (entity, relation, summary, and chunk embedding vectors) enabled precision retrieval tailored to specific query patterns. The system integrates cross-document analytics including document diversity scores, consensus measurements, and shared entity metrics that are unavailable in traditional vector search systems.

In the coming Part 3 of this blog series we will talk about MVR optimization by shifting focus to indexing semantically important relationships only; document communities and topics detection with MVR indexing support to allow GLOBAL queries about corpus-wide patterns and themes; and finally, about hybrid document retrieval orchestration to coordinate vector search, graph traversal, and subgraph extraction for generating a user query responses that combine semantic similarity and structural relationships.

Part 2: A deep dive into implementation of cross-documents semantic relations extraction, multi-hop graph traversal queries, and multi-vector (entities, relations, document summaries, chunks) document representation to allow queries from different vector spaces.

Introduction

System Architecture Update

New Service Capabilities

Cross-Document Relationship Aggregation & Multi-Hop Graph Traversal

The Challenge: From Isolated Documents to Connected Knowledge

Cross-Document Relationship Aggregation

Design Approach

Implementation Details

Multi-Hop Graph Traversal with Cross-Document Awareness

Design Approach

Document Diversity Scoring

Traversal Configuration

Subgraph Extraction & LOCAL Query Context

The Challenge: From Graph Queries to Contextual Understanding

Subgraph Data Model

Entity-Centric Subgraph Extraction

Single Entity Extraction

Multiple Entity Extraction

AI-Powered Subgraph Analysis

LOCAL Query Context Formatting

Context Structure

Design Principles

Multi-Vector Representation (MVR)

The Challenge: Beyond Single-Vector Chunk Retrieval

Multi-Vector Architecture

Embedding Generation: From Graph Data to Vector Representations

Entity Embedding Generation Process:

Relationship Embedding Generation Process:

Summary Embedding Generation Process:

Vector Space Design

Multi-Vector Indexing Pipeline

Embedding Generation Sequence:

Service Architecture

Configuration System

New REST API Endpoints

Cross-Document Analysis APIs

1. Get Cross-Document Relations

2. Get Entity Document Distribution

3. Multi-Hop Graph Traversal

Subgraph Extraction APIs

1. Single Entity Subgraph

2. Multiple Entity Subgraphs

3. LOCAL Context Formatting

Conclusion

Share this:

Related

2 Comments

Leave a comment Cancel reply