Part 2: A deep dive into implementation of cross-documents semantic relations extraction, multi-hop graph traversal queries, and multi-vector (entities, relations, document summaries, chunks) document representation to allow queries from different vector spaces.
Introduction
In Part 1 of this series, we explored the foundational architecture of the GraphRAG Document Repository, covering document ingestion, entity extraction, and knowledge graph construction. We established a multi-pass pipeline that processes PDF and web documents, extracts entities using AI-powered analysis, and builds a comprehensive knowledge graph in Neo4j.
Building upon that foundation, Part 2 introduces advanced cross-document intelligence capabilities that transform isolated document knowledge into an interconnected information network. This installment covers three critical enhancements:
- Cross-Document Relationship Aggregation & Multi-Hop Traversal – Discovering how entities and concepts relate across multiple documents with confidence-weighted relationship tracking and intelligent graph exploration
- Subgraph Extraction & LOCAL Query Context – Generating entity-centric knowledge subgraphs with AI-powered summarization and comprehensive provenance tracking
- Multi-Vector Representation (MVR) – Moving beyond single-vector chunk embeddings to a sophisticated multi-space retrieval system supporting entity, relationship, summary, and chunk vectors
These capabilities enable our GraphRAG system to answer complex queries that require synthesizing information from multiple sources, understanding semantic relationships between entities, and providing contextually rich responses with transparent source attribution.
System Architecture Update
The GraphRAG system architecture described in Part 1 has evolved to support cross-document intelligence through enhanced / new services in the “Graph Operations & Search Services” functional area:

New Service Capabilities
The Graph Operations & Search Services layer now includes four specialized services working in concert:
Graph Service – We have updated the existing Graph Service to support advanced cross-document relationship operations. Key capabilities now include counting relationship occurrences throughout the entire repository and calculating aggregate confidence scores using average and maximum values. Additionally, the service tracks document support using first-seen metadata and allows for filtering based on relationship type and confidence thresholds.
Graph Traversal Service – we have introduced a new dedicated Graph Traversal Service designed for complex knowledge graph exploration. Key features include multi-hop path finding with cross-document awareness, alongside document diversity scoring for assessing path quality. The service supports configurable traversal strategies—such as breadth-first, shortest-path, and confidence-weighted approaches—and enables both neighborhood exploration and targeted entity-to-entity navigation.
Subgraph Service – we have optimized the Subgraph Service to focus on entity-centric knowledge extraction. Key capabilities now include generating single and multi-entity subgraphs and tracking cross-document provenance. Additionally, the service leverages AI-powered summarization using Claude 3.5 API and provides query-focused LOCAL context formatting.
Search Service – we have upgraded the Search Service to handle multi-modal retrieval orchestration. Key features now include performing vector search across multiple embedding spaces and facilitating graph-based relationship discovery. Additionally, the service employs hybrid fusion to combine semantic and structural search, alongside VoyageAI API re-ranking for result optimization.
This architecture enables sophisticated information retrieval that goes far beyond traditional document search, providing graph-aware, cross-document intelligence with transparent provenance tracking.
Cross-Document Relationship Aggregation & Multi-Hop Graph Traversal
The Challenge: From Isolated Documents to Connected Knowledge
Traditional document repositories often treat each document as an isolated information silo. When users ask questions like “What do multiple papers say about transformer architectures?” or “How do different authors view the relationship between AI safety and model scaling?”, single-document approaches fail to synthesize cross-document insights.
Our solution addresses this through two complementary capabilities: relationship aggregation that consolidates entity connections across documents, and multi-hop traversal that discovers indirect relationships through the knowledge graph.
Cross-Document Relationship Aggregation
Design Approach
Currently, separate entity relationship instances are created during ingestion for each mention in a document. At runtime, relationship edges are dynamically consolidated with aggregate metadata tracking document support so that each relationship between two entities maintains:
- Occurrence Count – Number of documents mentioning this relationship
- Aggregate Confidence – Both average and maximum confidence scores across mentions
- Document Support – List of documents supporting the relationship with first-seen tracking
- Relationship Properties – Contextual attributes like project names, dates, or roles
| Neo4j Cypher: MATCH (e1:Entity {name: “Claude”})-[r]-(e2:Entity {name: “Anthropic”}) RETURN e1, r, e2 One of Claude entities is of Product and anther of Technology type. In addition to relation MENTIONED_WITH in multiple documents, you can see CREATED and USES relations. | ![]() |
Implementation Details
The relationship aggregation leverages Neo4j’s native graph traversal capabilities combined with Cypher aggregation functions. When querying cross-document relationships for an entity, the system:
- Identifies all relationship edges connected to the target entity across all documents
- Groups relationships by type and target entity to consolidate duplicate connections
- Aggregates metadata using Cypher’s
collect()and aggregate functions (see below) - Filters by confidence thresholds to surface high-quality relationships
- Ranks results by occurrence count and confidence scores
MATCH (e:Entity {id: $entity_id})-[r]-(related:Entity)
WITH e, type(r) as rel_type, related,
collect(r) as relationships,
collect(DISTINCT r.document_id) as docs
RETURN related,
rel_type,
size(docs) as occurrence_count,
avg([rel IN relationships | rel.confidence]) as avg_confidence,
max([rel IN relationships | rel.confidence]) as max_confidence,
docs as supporting_documentsaa
This approach provides several key advantages, including efficient queries that allow a single graph traversal to retrieve all cross-document connections. It also ensures transparent provenance through complete document support tracking for every relationship. Furthermore, the method offers quality signals, where confidence aggregation helps identify strongly supported relationships, and scalability as aggregation occurs at query time without preprocessing overhead.
Multi-Hop Graph Traversal with Cross-Document Awareness
Design Approach
While direct relationships between entities are valuable, many insights require discovering indirect connections through intermediary entities. Our multi-hop traversal system implements cross-document aware path finding that discovers entity relationships spanning multiple documents with configurable traversal strategies.
The GraphRAG system supports three traversal strategies:
- Breadth-First Traversal – Explores all relationships at depth N before moving to depth N+1, ideal for discovering nearby entity neighborhoods
- Shortest-Path Traversal – Finds minimum-hop paths between entities, optimized for targeted entity-to-entity queries
- Confidence-Weighted Traversal – Prioritizes high-confidence relationships, ensuring path quality over path length
Document Diversity Scoring
A key innovation in our traversal system is document diversity scoring, which quantifies how well a path spans multiple information sources:
document_diversity = unique_documents_in_path / total_document_mentions
This score ranges from 0.0 (all relationships from a single document) to 1.0 (perfect diversity with each relationship from a different document). Higher diversity scores indicate paths supported by multiple independent sources, increasing confidence in the discovered relationships.
Neo4j Cypher:MATCH path = (e1:Entity {name: "Transformer"})-[*1..3]-(e2:Entity {name: "BERT"}) RETURN path LIMIT 5 | ![]() |
Traversal Configuration
The traversal service accepts configuration parameters for tailoring graph exploration:
- Max Depth – Limits hop count to prevent excessive traversal (default: 2-3 hops)
- Max Entities – Caps total entities in results to maintain performance (default: 50)
- Min Confidence – Filters relationships below quality thresholds (default: 0.5)
- Document Filter – Optionally restricts traversal to specific document subsets
- Relationship Type Filter – Focuses traversal on specific relationship categories
These parameters allow us to support use cases ranging from broad entity neighborhood exploration to targeted path finding between specific entities.
Subgraph Extraction & LOCAL Query Context
The Challenge: From Graph Queries to Contextual Understanding
While the knowledge graph enables powerful relationship queries, using graph query results for generative AI applications (like answering user questions or summarization) requires transforming graph structure into natural language context. To meet this challenge we implemented entity-centric subgraphs extraction that captures all relevant relationships, entities, and supporting text while maintaining cross-document provenance.
Subgraph extraction forms the foundation of GraphRAG’s LOCAL context generation – system allows us to provide focused, entity-centric information with complete source attribution for AI-powered question answering.
Subgraph Data Model
Subgraph representation in GraphRAG extends traditional knowledge graph extraction with cross-document metadata designed specifically for AI consumption:
- Entity Focus – central entity that anchors the subgraph
- Relationships – all connected entities with relationship types and confidence scores
- Cross-Document Metadata –
is_cross_docflag,document_count, andprimary_documentidentification - Document Mentions – frequency tracking for each supporting document
- Chunk Aggregation – relevant text chunks from all documents with provenance
- AI Analysis – Claude-generated summary, keywords, and confidence assessment
Neo4j Cypher: MATCH (e:Entity {name: "Attention Mechanism"})-[r*1..2]-(related) OPTIONAL MATCH (e)-[:MENTIONED_IN]->(chunk:Chunk) RETURN e, r, related, chunk LIMIT 30 | ![]() |
This rich metadata structure enables the GraphRAG to Identify the primary document (document with most entity/relationship mentions) as the main source. It also allows the system to track document diversity to highlight cross-document insights and provide transparent provenance for every piece of information by including contextual text for semantic understanding by AI.
Entity-Centric Subgraph Extraction
Single Entity Extraction
The subgraph extraction process implements the following algorithm:
- Entity Identification – locate the target entity in the knowledge graph
- Relationship Traversal – discover connected entities within configurable depth (typically 1-2 hops)
- Chunk Retrieval – gather relevant text chunks mentioning any entity in the subgraph (see Cypher query below)
- Document Frequency Analysis – calculate mention counts per document to identify primary source
- Cross-Document Detection – flag subgraphs spanning multiple documents with diversity metadata
- AI Analysis – submit subgraph to Claude API for summarization and keyword extraction
MATCH (e:Entity)-[:MENTIONED_IN]->(c:Chunk)
WHERE e.id IN $entity_ids
MATCH (c)-[:FROM_DOCUMENT]->(d:Document)
RETURN DISTINCT c, d
ORDER BY c.index ASC
The extraction process respects the following configurable parameters to allow balancing the analysis depth with its performance:
- Document Filter – restricts to specific document subsets (optional)
- Max Depth – controls relationship traversal distance
- Max Entities – limits subgraph size for manageable context
- Min Confidence – filters low-quality relationships
Multiple Entity Extraction
Complex queries require information about multiple entities. GraphRAG system supports parallel subgraph extraction with intelligent de-duplication. This capability enables queries like “Compare the approaches of three different authors on ability to interpret an LLM model” by extracting and analyzing subgraphs for each author simultaneously.
AI-Powered Subgraph Analysis
Each extracted subgraph undergoes AI analysis using Claude API to generate:
- Subgraph Summary – a concise natural language description of the subgraph’s key information, explicitly noting cross-document sources when applicable
- Keywords – extracted key concepts and terms for semantic indexing and retrieval enhancement
- Confidence Score – an AI assessment of subgraph quality and information completeness
The AI prompt used for that includes cross-document context cues, ensuring Claude understands when synthesizing information from multiple sources:
“This subgraph spans 3 documents with the primary source being ransformers_paper.pdf’. When summarizing, note information that appears consistently across multiple sources versus information from a single source.”
Claude AI analysis transforms raw graph structure into semantically rich context optimized for downstream AI applications.
LOCAL Query Context Formatting
Context Structure
The LOCAL context system formats multiple subgraphs into a hierarchical, query-focused prompt designed for Large Language Model (LLM) consumption:
# LOCAL Context for Query: [User Query]
## Context 1: [Primary Entity Name]
*Sources: 3 documents* (cross-document indicator)
[AI-Generated Summary explaining the entity’s role and key relationships]
– Related Entity (Type) – brief description
**Key Relationships:**
– Entity A → RELATIONSHIP_TYPE → Entity B [mentioned in 3 docs]
– Entity C → RELATIONSHIP_TYPE → Entity D [confidence: 0.92]
**Primary Document**: transformers_paper.pdf
**All Sources**: transformers_paper.pdf, bert_paper.pdf, gpt_paper.pdf**Key Entities:**
– Entity Name (Type) – brief description
**Relevant Text Excerpts:**
“The attention mechanism computes…” (transformers_paper.pdf, chunk 5)
—
## Context 2: [Secondary Entity Name]
…
Design Principles
The LOCAL context format follows several key principles:
- Query Focus – each context section directly addresses the user’s query by centering on relevant entities
- Provenance Transparency – clear attribution of every piece of information to source documents with cross-document indicators
- Hierarchical Organization – numbered sections enable LLMs to reference specific parts of context in responses
- Source Prioritization – primary document identification helps LLMs understand the main information source
- Cross-Document Awareness – explicit marking of multi-document subgraphs signals when information synthesis occurred
Multi-Vector Representation (MVR)
The Challenge: Beyond Single-Vector Chunk Retrieval
Traditional RAG systems rely on a single vector space – chunk embeddings – for information retrieval. While effective for basic semantic search, this approach has fundamental limitations:
- Granularity Mismatch – entity-focused queries (“Tell me about OpenAI”) don’t align well with chunk-level vectors that contain multiple entities and concepts
- Relationship Blindness – chunk embeddings don’t explicitly capture the semantic meaning of relationships between entities (e.g., “OpenAI CREATED GPT-4” vs “Microsoft INVESTED_IN OpenAI”)
- Document-Level Gaps – high-level questions about document topics or themes are difficult to answer when only chunk-level vectors exist
- Fact vs Context Confusion – specific factual relationships get diluted in chunk embeddings that include surrounding contextual text
Multi-Vector Architecture
Multi-Vector Representation system implemented in GraphRAG Document Repository addresses these limitations by creating dedicated embedding spaces for Entities, Relations, and Document Summaries in addition to Chunks embeddings space. Each of four specialized collections in Chroma DB is optimized for specific query patterns:

Embedding Generation: From Graph Data to Vector Representations
Before exploring the Multi-Vector Representation (MVR) design, it is essential to understand how raw graph data will transform into embeddings. Unlike chunk embeddings that directly embed document text, entity and relationship embeddings require aggregating information from multiple sources across the knowledge graph.
Entity Embedding Generation Process:
Consider an entity like “OpenAI” of Organization type that appears in multiple documents. The system constructs a rich contextual representation by aggregating:
- Core Entity Information: name, type, aliases, and description extracted during entity recognition
- Cross-Document Statistics: document frequency (appears in N documents), document diversity score (Shannon entropy of mention distribution)
- Relationship Context: relationship count, unique relationship partners, top relationship types (e.g., FOUNDED, CREATED, INVESTED_IN)
- Co-Occurrence Patterns: frequently co-mentioned entities across documents
The aggregated text sent to VoyageAI for embedding generation looks like:
Entity: OpenAI (Type: Organization)
Description: Artificial intelligence research laboratory
Document Frequency: Appears in 8 documents
Relationship Count: 34 total relationships
Document Diversity Score: 0.763
Top Relationships: FOUNDED, CREATED, INVESTED_IN
Frequent Co-entities: GPT-4, Sam Altman, Microsoft, ChatGPT, Anthropic
Key Context: AI research organization developing large language models
and AGI technologies, founded in 2015, known for GPT series and ChatGPT.
This rich contextual embedding enables semantic search to find “OpenAI” not just by name matching, but by understanding its role as an AI research organization with specific relationships and cross-document presence patterns.
Relationship Embedding Generation Process:
For a relationship like “Sam Altman -FOUNDED-> OpenAI”, the system constructs fact-based representations by aggregating:
- Structured Relationship Data: source entity name/type, relationship type, target entity name/type
- Cross-Document Evidence: occurrence count (mentioned in N contexts), aggregate confidence, supporting documents
- Consensus Metrics: document consensus score measuring agreement across sources
- Contextual Information: sample context from document chunks where the relationship appears
The aggregated text sent to VoyageAI for embedding generation looks like:
Relationship: Sam Altman -FOUNDED-> OpenAI
Source Entity: Sam Altman (Person)
Target Entity: OpenAI (Organization)
Occurrence Count: Mentioned in 5 different contexts
Aggregate Confidence: 0.94
Supporting Documents: 5 documents
Document Consensus Score: 1.0
Context: Sam Altman co-founded OpenAI in December 2015 along with Elon Musk,
Greg Brockman, and others with the mission to ensure AGI benefits humanity.
He served as CEO of OpenAI and has been instrumental in the development
of GPT series models.
This fact-based embedding allows queries like “Who founded OpenAI?” to directly match against relationship embeddings where the FOUNDED relationship type and entities are explicitly represented, avoiding the semantic drift that occurs when such facts are buried within general chunk text.
Summary Embedding Generation Process:
Document summaries are generated using Claude API, which analyzes the complete document to produce structured summaries with:
- AI-Generated Summary: 2-3 sentence overview of document content and significance
- Topic Extraction: key topics and themes identified by Claude
- Entity Integration: key entities mentioned in the document with frequencies
- Cross-Document Metrics: shared entities with other documents for connectivity analysis
The output text sent to VoyageAI for summary embedding generation looks like:
Document: "Attention Is All You Need" (Research Paper)
Summary: This seminal paper introduces the Transformer architecture, a novel
neural network model based entirely on attention mechanisms without recurrence
or convolution. The architecture achieves state-of-the-art results on machine
translation tasks while being more parallelizable and requiring less training time.
Key Topics: transformer architecture, self-attention mechanism, encoder-decoder,
sequence-to-sequence modeling, neural machine translation
Key Entities: Transformer, attention mechanism, BERT, neural networks, NLP
Cross-Document Connectivity: Shares 8 entities with other documents
Document Type: Academic research paper introducing foundational architecture
Relevance Score: 0.95
This Claude-powered approach enables document-level queries to find papers by their themes and contributions rather than requiring exact keyword matches.
Key Insight: The multi-vector approach transforms the knowledge graph’s structured data (entities, relationships, document metadata) into semantically rich text representations that preserve the graph’s structural information while enabling vector similarity search. This bridges the gap between symbolic graph representation and semantic vector search.
Vector Space Design
All vector spaces use VoyageAI’s voyage-3 embedding model (1024 dimensions) for semantic consistency, enabling mathematically sound similarity calculations across collections.
1. Entity Embeddings Collection
Use Cases: Entity-centric queries like “What do you know about BERT?” or “Find all Technology entities mentioned in multiple documents”.
Entity Embedding oChroma DB bject metadata:
entity_name,entity_type: core entity identificationdoc_frequency: cross-document presence indicator (enables filtering for widely-discussed entities)document_diversity_score: Shannon entropy-based measurement of entity mention distributionunique_relationship_partners: connectivity metric showing how well-connected the entity istop_relationship_types: most frequent relationship patterns for contextconfidence: entity extraction confidence
2. Relation Embeddings Collection
Use Cases: Fact-based queries like “Who founded OpenAI?” or “What technologies does Google use?” with confidence scoring.
Relation Embedding Chroma DB object metadata:
source_entity,target_entity,relation_type: structured relationship informationoccurrence_count: how many times relationship appears across documentsaggregate_confidence: average confidence across all mentions (Phase 3.2 aggregation)document_consensus_score: cross-document agreement measurementsupporting_docs: list of documents containing this relationshiphas_context: boolean indicating contextual information availability
3. Document Summary Embeddings Collection
Use Cases: Document discovery queries like “Find papers about transformer architectures” or “What documents discuss attention mechanisms?”
Summary Embedding Chroma DB object metadata:
title,source_type: document identificationentity_count: number of entities in documentkey_entities: top entities by mention frequencykey_topics: Claude-extracted topics and themesshared_entities_with_other_docs: cross-document connectivity metricrelevance_score: Claude-assigned document significance score
4. Document Chunks Collection
The existing chunk embeddings collection continues to serve detail-oriented queries requiring specific textual evidence, maintaining backward compatibility with all existing functionality.
Multi-Vector Indexing Pipeline
Embedding Generation Sequence:
- Entity Embeddings – parallel embeddings generation with configurable batch sizes (default: 50 entities per batch)
- Aggregates 12 statistical metadata fields per entity
- Calculates Shannon entropy for document diversity scoring
- Tracks relationship connectivity and co-entity patterns
- Relation Embeddings – parallel embeddings generation leveraging relationship aggregation (default: 100 relations per batch)
- Uses existing
aggregate_relationship_edges()for cross-document data - Generates stable MD5-based relation IDs for consistent identification
- Calculates document consensus scores
- Uses existing
- Summary Embeddings – sequential embeddings generation with Claude API rate limit management (default: 10 documents per batch)
- Claude API generates structured summaries with topics and themes
- Extracts key entities and calculates shared entity metrics
- Produces relevance scores and document type analysis
Service Architecture
Three New Specialized Services:
- EntityEmbeddingService – rich entity context generation with 12 metadata fields
- RelationEmbeddingService – fact-based relationship embeddings with consensus scoring
- SummaryEmbeddingService – Claude-powered document summarization with topic extraction
Coordination Layer:
- MultiVectorServiceManager – health monitoring, collection management, and orchestration
- MultiVectorSearchService – unified search interface across four collections
The service architecture maintains clean separation of concerns with dedicated services for each embedding type, coordinated by a manager service that handles initialization, health monitoring, and batch generation orchestration.
Configuration System
MVR service introduced 24 new configuration parameters to provide complete control over multi-vector behavior:
- Core Settings – enable/disable multi-vector indexing, version tracking, default search modes
- Per-Collection Settings – batch sizes, minimum thresholds, collection names
- Analytics Toggles – document diversity calculation, consensus scoring, shared entity metrics
All settings are environment-variable configurable for GraphRAG deployment flexibility.
New REST API Endpoints
The cross-document intelligence and multi-vector capabilities are exposed through several new REST API endpoints in the Graph Operations & Search Services area.
Cross-Document Analysis APIs
1. Get Cross-Document Relations
Use case: Discovering how an entity relates to others across multiple documents with confidence metrics.
GET /api/documents/entities/{entity_id}/cross-doc-relations
Query Parameters:
- relationship_types: Optional[List[str]] - Filter by relationship types
- min_confidence: Optional[float] - Minimum confidence threshold
- min_occurrences: Optional[int] - Minimum document occurrence count
Response: List of relationships with aggregate statistics
- occurrence_count: Number of documents supporting the relationship
- avg_confidence: Average confidence across all mentions
- max_confidence: Maximum confidence observed
- supporting_documents: List of document IDs with first-seen metadata
2. Get Entity Document Distribution
Use case: Understanding which documents discuss an entity and identifying related entities through co-occurrence patterns.
GET /api/documents/entities/{entity_id}/documents
Response: Document distribution with co-occurrence statistics
- document_list: Documents mentioning the entity
- mention_count: Frequency per document
- co_entities: Other entities mentioned in the same documents
- co_occurrence_scores: Statistical co-occurrence metrics
3. Multi-Hop Graph Traversal
Use case: Discovering indirect relationships between entities and evaluating path quality through document diversity scoring.
POST /api/documents/entities/traverse
Request Body:
- start_entity_id: Starting entity for traversal
- end_entity_id: Optional target entity for directed search
- strategy: Traversal strategy (breadth_first, shortest_path, confidence_weighted)
- max_depth: Maximum hop count (default: 2)
- max_entities: Entity limit (default: 50)
- min_confidence: Relationship confidence threshold
Response: Discovered paths with cross-document metadata
- paths: List of entity paths with relationships
- document_diversity: Diversity score for each path
- path_confidence: Aggregate confidence for path
- supporting_documents: Documents supporting each path segment
Subgraph Extraction APIs
1. Single Entity Subgraph
Use case: Extracting comprehensive entity-centric knowledge with AI-powered summarization.
POST /api/documents/entities/{entity_id}/subgraph
Request Body:
- max_depth: Relationship traversal depth (default: 2)
- max_entities: Entity limit for subgraph (default: 50)
- min_confidence: Relationship confidence threshold (default: 0.5)
- include_summary: Generate AI summary (default: true)
- include_keywords: Extract keywords (default: true)
- document_filter: Optional document ID list
Response: Subgraph with cross-document metadata
- entity: Central entity details
- relationships: Connected entities with relationship types
- is_cross_doc: Boolean indicating multi-document subgraph
- document_count: Number of supporting documents
- primary_document: Main source document
- document_mentions: Frequency per document
- summary: Claude-generated subgraph summary
- keywords: Extracted key concepts
- chunks: Relevant text excerpts with provenance
2. Multiple Entity Subgraphs
Use case: Batch extraction of subgraphs for multiple entities with parallel processing.
POST /api/documents/entities/subgraphs/extract-multiple
Query Parameters:
- max_depth: Relationship depth (default: 2)
- max_entities: Per-subgraph entity limit (default: 30)
Request Body: List of entity IDs
Response: List of subgraphs with parallel extraction
- subgraphs: Array of entity subgraphs
- extraction_time: Performance metrics
- total_entities: Aggregate entity count across subgraphs
3. LOCAL Context Formatting
Use case: Generating query-focused LOCAL context for LLM consumption with comprehensive provenance tracking.
POST /api/documents/entities/subgraphs/format-context
Request Body:
- query: User query string for context framing
- entity_ids: List of entity IDs to include
- max_subgraphs: Limit on number of subgraphs (default: 5)
Response: Formatted LOCAL context string
- context: Markdown-formatted hierarchical context
- subgraph_count: Number of included subgraphs
- total_entities: Aggregate entity count
- cross_doc_count: Number of cross-document subgraphs
All APIs include comprehensive error handling, request validation, and detailed response schemas auto-documented via OpenAPI/Swagger.
Conclusion
In this blog post I described new features that transformed isolated document knowledge into interconnected information network:
- Cross-Document Intelligence – relationship aggregation and multi-hop traversal enabled discovering how entities and concepts relate across multiple information sources, with confidence scoring and document diversity metrics providing quality signals for multi-document insights.
- LOCAL Context Generation – entity-centric subgraph extraction with AI-powered summarization delivered focused, provenance-rich context optimized for Large Language Model consumption that supports transparent information synthesis from multiple documents.
- Multi-Vector Representation – moving beyond traditional single-vector chunk embeddings, new four-collection architecture (entity, relation, summary, and chunk embedding vectors) enabled precision retrieval tailored to specific query patterns. The system integrates cross-document analytics including document diversity scores, consensus measurements, and shared entity metrics that are unavailable in traditional vector search systems.
In the coming Part 3 of this blog series we will talk about MVR optimization by shifting focus to indexing semantically important relationships only; document communities and topics detection with MVR indexing support to allow GLOBAL queries about corpus-wide patterns and themes; and finally, about hybrid document retrieval orchestration to coordinate vector search, graph traversal, and subgraph extraction for generating a user query responses that combine semantic similarity and structural relationships.



2 Comments