GraphRAG Part 4 – Community Detection and Embedding, Search and Hybrid Retrieval Integration

Part 4 explores new advanced features introduced in GraphRAG by enabling community detection and community-based GLOBAL search, which allows system to identify and leverage hierarchical community structures within knowledge graphs. We’ll explore how graph algorithms in Neo4j detect natural groupings of interconnected entities, how communities are embedded into dedicated vector spaces for semantic search, and how GLOBAL queries leverage community summaries to provide thematic analysis.

Introduction

In Part 3 of this series, we explored how intelligent query routing enables GraphRAG to automatically select optimal retrieval strategies across four vector spaces (entities, relationships, summaries, and chunks) delivering context-aware responses through six specialized retrieval approaches. While this Multi-Vector Retrieval (MVR) system excels at LOCAL queries focusing on specific entities and their immediate connections, GLOBAL queries that seek insights across entire document corpus remained limited to summaries at document level.

Recently, we transformed GLOBAL search capabilities by introducing hierarchical community detection powered by Neo4j’s Graph Data Science (GDS) library. Communities represent natural groupings of interconnected entities within the knowledge graph—clusters of people, organizations, technologies, and concepts that frequently occur together and share semantic relationships. By applying the Louvain algorithm to our entity graph, we detect these communities automatically, generate AI-powered summaries for each community, and embed them into a dedicated fifth vector space for semantic search.

This enabled a fundamentally different approach to GLOBAL queries. Instead of aggregating document summaries, the system now identifies relevant communities, samples their representative entities, and synthesizes insights that span multiple documents while maintaining coherent topical focus. A query like “What are the major AI research themes in my document collection?” no longer returns a flat list of document summaries—it surfaces groups of researchers, organizations, and technologies organized around shared research themes like “transformer architectures,” “reinforcement learning,” or “multimodal AI systems.”.

The implementation introduces two new core services: CommunityService for detection and CommunityEmbeddingService for vector operations plus, a 7th retrieval strategy that seamlessly integrates with our existing hybrid retrieval orchestration service. This architectural approach maintains backward compatibility with all existing conversational AI features while enabling powerful new community-based search through six new REST API endpoints.

In this post, I will explore how community detection works at the graph algorithm level, how AI generates rich community embedding vectors, and how the hybrid retrieval system leverages community structures to deliver superior GLOBAL search results that reveal the thematic organization of your knowledge graph.

Technical Architecture Update

The GraphRAG architecture (see the diagram included into Part 2 of the series ) has been extended to add two new services for community detection and thematic analysis that are integrated with five existing services to support community-based GLOBAL search.

CommunityService – Graph Analytics Foundation

New CommunityService provides the backbone for community detection powered with Neo4j’s Graph Data Science (GDS) library to identify natural groupings within the entity graph. Service supports the following capabilities:

  • Neo4j GDS Integration: Louvain algorithm in GDS library is used to detect communities on in-memory graph projections
  • Hierarchical Detection: Algorithm is configured to detect multi-level community structures (levels 0-3) with parent-child relationships
  • CRUD Operations: New service supports complete lifecycle management for community entities and hierarchies.
  • AI-Powered Analysis: Community theme, summary and description are generated using Claude API.

The service creates graph projections from existing Entity nodes and their relationships (such as MENTIONED_WITH, WORKED_AT, etc.), applies the Louvain algorithm with configurable resolution parameters (0.1-5.0), and then structures detected communities hierarchically in Neo4j.

CommunityEmbeddingService – Vector Space Integration

Another new CommunityEmbeddingService allows transforming communities into searchable vector representations which are stored as the 5th vector space collection in ChromaDB. Service supports the following capabilities:

  • VoyageAI Integration: Voyage-3 AI model is used to generate embedding vector with 1024 dimensions based on AI-generated Community summary, theme and keywords..
  • ChromaDB Management: Generated vectors are stored in the community_embeddings collection with metadata for filtering support. Efficient embedding generation for multiple communities.
  • Similarity Search: Community retrieval supports level and size filtering.

Each community is embedded based on its AI-generated theme, summary, keywords, and entity descriptions, enabling semantic search across community structures rather than individual documents.

Enhanced Existing Services

HybridRetrievalService: 7th Retrieval Strategy

New global_with_communities search strategy has been added as the 7th retrieval approach in the GraphRAG orchestration sub-system which searches the community_embeddings collection for thematically relevant communities and then aggregates entities and relationships within matched communities. If communities are not available, it falls back to document summaries.

The enhancement maintains backward compatibility with all six existing strategies while enabling intelligent community-based GLOBAL query processing.

SearchService: Community Search Capabilities

Added support for semantic search across community embedding vectors with metadata filtering by hierarchy level and community size. New method is integrated with existing multi-vector search architecture with community aware result ranking and scoring

ContextFormatterService: Community Context Formatting

Added new method to transform community search results into structured context with community theme, up to 5 entity samples, and hierarchy presentation. Markdown report generation option was enhanced to account for community representation.

QueryRouter: Enhanced Query Classification

Updated Claude AI prompts and classification logic to recognize community based GLOBAL queries, including Community detection patterns in query analysis which yield routing to new global_with_communities search strategy with community aware retrieval.

MultiVectorServiceManager: 5th Vector Collection

Integrated community embedding support into the unified multi-vector management system in GraphRAG, including Community embedding generation or regeneration with force refresh and consistent interface across all five vector collections.

Community Data Flow

New services integration and enhancements in existing service created a 7 step flow from graph structure to searchable embedding vectors:

This approach maintains the separation of concerns established in earlier project while adding powerful community-based capabilities that enhance GLOBAL search coherence and thematic organization.

Community Detection

Community detection identifies natural groupings of entities within the knowledge graph—clusters where entities share dense interconnections and weak ties to other clusters. Unlike arbitrary grouping by document or entity type, communities emerge from the actual relationship patterns in your data, revealing thematic structures that span multiple documents.

Community Data Model

To define the community structure and metadata we introduced a comprehensive Pydantic models:

These data models enable rich metadata storage in both Neo4j (graph relationships) and ChromaDB (vector metadata) and thus, supporting sophisticated filtering and search operations.

Graph Projection and Louvain Algorithm

Community detection begins with creating an in-memory graph projection in Neo4j using Graph Data Science (GDS) library. The projection includes Entity nodes connected through five relationship types: MENTIONED_WITH (co-occurrence relationship, SEMANTIC_SIMILARITY (AI-detected semantic relationship), WORKED_AT -(Person and Organization relationship), RELATED_TO (general entity associations), and CONNECTS_TO (cross-document entity connections).

Next, the Louvain algorithm analyzes this projection to detect communities by maximizing modularity which is a measure of how well-separated communities are from each other. The algorithm iteratively moves nodes between communities to optimize modularity, resulting in natural groupings where intra-community connections are dense and inter-community connections are sparse.

Neo4j Community Graph Structure

Detected communities are persisted in Neo4j as first-class entity node with descriptive metadata and rich relationship modeling:

Relationship Types:

1. HAS_MEMBER – links Community to Entity nodes

2. PARENT_COMMUNITY – creates a hierarchical structure

3. SHARES_ENTITIES – connects overlapping communities (Louvain typically, doesn’t create it)

4. RELATED_THEME – AI-detected thematic relationships

5. EVOLVED_FROM – tracks community changes over subsequent detection

This schema enables complex graph queries like “Find all communities containing Person entities who worked at Technology companies” or “Show me the parent communities for all Climate Science communities.”

Hierarchical Community Generation

The Louvain algorithm can produce hierarchical community structures through recursive application at increasing resolutions:

Level 0 (Coarse): broad thematic communities (10-100+ entities, resolution: 0.5-1.0) for example “Healthcare Technology”, “Climate Research”, “Financial Services”

Level 1 (Medium): sub-themes within broader communities (20-50 entities, resolution: 1.0-2.0) for example: within “Healthcare Technology” community → “Medical Imaging AI”, “Drug Discovery”, “Clinical Diagnostics”

Level 2 (Fine): specific topic clusters (5-20 entities, resolution: 2.0-3.0) for example: within “Medical Imaging AI” community → “Radiology Networks”, “Pathology Systems”, “Diagnostic Tools”

Level 3 (Granular): narrow focus areas (2-10 entities, resolution: 3.0-5.0) for example: within “Radiology Networks” community → “CNN Architectures”, “Transfer Learning”

The system creates `PARENT_COMMUNITY` relationships automatically by detecting entity overlap between levels. A Level 1 community becomes a child of a Level 0 community if they share ≥30% of entities.

Here is an example of 2- and 3-level hierarchical communities detected in the test repository:

Community Detection Workflow

The end-to-end detection pipeline coordinates multiple services and AI models:

Step 1: Graph Projection Creation – CommunityService queries Neo4j for Entity nodes and relationships and calls on GDS to create in-memory weighted undirected graph.

Step 2: Louvain Algorithm Execution– the algorithm runs iteratively for each hierarchy level (0-3) to identify optimal community assignments based on modularity criteria.

Step 3: Neo4j Community Storage – Community nodes are created in Neo4j with initial metadata. Next, we establish [:HAS_MEMBER] relationships to entities and build [:PARENT_COMMUNITY] hierarchical relationships.

Step 4: AI Theme Analysis – we use Claude to analyze entity names, types, and descriptions within each community. Claude generates concise themes (2-4 words): “AI Safety Research”, “Quantum Computing” and produces summaries (2-3 sentences) explaining community coherence. It also extracts 5-8 relevant keywords representing domain concepts for a community.

Step 5: Vector Embedding Generation – CommunityEmbeddingService constructs rich text by combining theme, summary, keywords, and entity info. It then uses VoyageAI voyage-3 model to generate 1024-dimensional embedding vectors that capture community semantic meaning for similarity search

Step 6: ChromaDB Storage – Community embedding vectors are stored in the dedicated community_embeddings collection. Vector metadata includes level, size, theme, keywords, entity types which allows filtering, like “Find Level 1 communities with >20 members about ‘machine learning'”.

Step 7: Search Integration – HybridRetrievalService was extended to allow routing GLOBAL queries to community search by adding ability to detect community appropriate queries to the QueryRouter.

If AI-based community analysis is enabled (optional), it may take up to 10 seconds for a large community to complete it. Community detection process needs to run periodically hence, always using AI analysis is not a showstopper.

Search and Hybrid Retrieval Integration

To enable community-based search capabilities we extended the hybrid retrieval orchestration service by introducing the 7th retrieval strategy in addition to the existing six strategies. This enhancement enables GLOBAL queries to leverage thematic community structures rather than relying solely on document-level summaries.

Community Similarity Search

New search_communities method in the SearchService provides flexible community retrieval with semantic similarity and metadata filtering:

Core Search Parameters:

Search Process Flow:

  1. Query Embedding: VoyageAI voyage-3 converts the query into a 1024-dimensional vector
  2. Vector Search: ChromaDB performs cosine similarity search across community_embeddings collection
  3. Metadata Filtering: results are filtered by level, size, and quality constraints
  4. Ranking: matching communities are ranked by similarity score and secondary quality metrics
  5. Result Enrichment: finally, entity samples and statistics are added from Neo4j

Example Search Query:

The metadata filtering enables precise control: “Find all Level 1 sub-communities within the Healthcare domain that contain at least 10 Person entities”.

QueryRouter Community Integration

The QueryRouter now recognizes queries appropriate for using communities and routes them to the global_with_communities strategy automatically. The enhanced Claude classification prompt includes community detection patterns:

The router identifies the following community patterns in use queries to request thematic analysis:

  1. Pattern 1: Explicit Theme Requests → “What are the main research themes?”
  2. Pattern 2: Trend Analysis → “What trends emerge across documents?”
  3. Pattern 3: Topic Clustering → “What major topics are covered?”
  4. Pattern 4: Domain Overview → “Give me an overview of the AI research landscape”
  5. Pattern 5: Comparative Themes → “What different approaches exist?”

Routing Decision:

This ensures GLOBAL queries automatically leverage community structures when available, falling back to document summaries if communities haven’t been detected yet.

Community-Based GLOBAL Query Processing

New _global_with_communities() method in the HybridRetrievalService orchestrates community-based GLOBAL search using the following processing pipeline:

The fallback logic is used if communities are not detected yet or no communities match the query:

This ensures graceful degradation when community data is unavailable.

Community Theme Presentation

New format_global_community_context() method in the ContextFormatterService creates rich, structured presentations of community search results for example, below is an examle of markdown presentation template that at is used to generate a report:

**Entity Sampling Strategy:**

To keep context manageable, the presentation generation service samples up to 5 representative entities per community using the following pattern:

  1. Highest-degree entities (entity with most connections)
  2. Diverse entity types (a representative mix of Person, Organization, Technology, etc.)
  3. Core community members

This provides a representative snapshot without overwhelming the AI with hundreds of entity names.

Hybrid Retrieval Integration

The community strategy has been integrated with existing six strategies:

StrategyVector SpacesGraph QueriesUse Case
entity_centric_with_relationsentities, relationsMulti-hop traversal “Tell me about X”
relation_centric_graph_firstrelationsRelationship-first“How are X and Y connected?”
summary_based_parallelsummariesDocument metadata“Quick overview of topic”
multi_vector_balancedAll 4 collectionsEntity lookup “Compare X vs Y”
chunk_centric_precisechunksMinimal graph “What year did X occur?”

The orchestration service implements the same interface for all search strategies:

This enables the conversational AI service to consume community results in the same way as other retrieval strategies.

Community REST API Endpoints

In this section we’ll talk about six new REST API endpoints for community detection, querying, and analysis that were introduced in addition to the existing 36 document and query endpoints.

POST /api/communities/detect – Community Detection Trigger

Allows you to initiate community detection on the current knowledge graph, applying the Louvain algorithm across all hierarchy levels.

Using AI analysis to generate community theme, summary or description is a per-requisite to allow good quality embedding vectors for GLOBAL search. For a community with 100+ entities, it may take 10+ seconds to process it with AI enabled therefore, if your priority is to check what types of communities and hierarchies exist in the repository, you can run detection with AI disabled. The latter will generate names and themes.

In response you will get a list of detected communities with detailed in formation, for example:

GET /api/communities – Community Listing with Pagination

Retrieves a list of existing communities with filtering, pagination, and sorting options.

GET /api/communities/{id} – Detailed Community Information

This endpoint retrieves comprehensive details for a specific community:

Response includes entity samples and relationships, for example:

Note, entities, their aliases and descriptions; relationships as well as community summaries, a list of keywords, and description are extracted using Claude AI.

POST /api/communities/global-summary – GLOBAL Search with Communities

This endpoint executes community-based GLOBAL search queries, leveraging thematic structures for comprehensive analysis.

Response in markdown form will looks like:

GET /api/communities/{id}/entities – Community Member Entities

This endpoint retrieves all entities belonging to a specific community with detailed metadata.

GET /api/communities/statistics – Community Analytics

This endpoint provides a comprehensive statistics and analytics across all detected communities.

Response includes summary information about detected communities, their average size, quality, document coverage, and hierarchies.

Conclusion

Introduction of hierarchical community detection powered by Neo4j’s Graph Data Science library transformed how GraphRAG handles GLOBAL queries. Rather than relying solely on document-level summaries, the system now can identify natural thematic clusters within entity groups in the knowledge graph that frequently appear together and share semantic relationships across multiple documents. This shift from document-centric to theme-centric analysis delivered more coherent, insightful responses to queries like “What are the major research themes?” or “What trends emerge across my document collection?”.

GraphRAG now includes two new Community and CommunityEmbedding services that are integrated into existing infrastructure to add the 5th vector collection (community_embeddings) and enable the 7th retrieval strategy (global_with_communities). Changes in the architecture maintain the same design pattern: separation of concerns between detection and embedding, Pydantic data models, intelligent fallback logic, and robust REST APIs with full validation and monitoring.

The six new REST API endpoints allow building complex retrieval applications on the GraphRAG platform. From triggering detection with configurable parameters (POST /api/communities/detect) to executing sophisticated GLOBAL searches (POST /api/communities/global-summary), the API interface enables rich integration scenarios including analytics dashboards, automated theme monitoring, and intelligent content navigation systems.

Looking Ahead: Schema-Driven Ingestion

While Parts 1-4 of this series focused on processing unstructured documents (PDFs, web pages) through AI-powered text extraction and entity recognition, real-world GraphRAG deployments often need ability to ingest structured data from external systems, such as databases, ERP or CX apps. In Part 5 we will talk about a schema-driven ingestion pipeline that bridges this gap, enabling GraphRAG system to process structured documents by combining deterministic (defined by the schema) with probabilistic (AI analysis) steps.

This approach delivers the best of both worlds. For a support ticket, deterministic processing ensures the ‘customer_name’ field always created Organization node and ‘assigned_to’ field always creates a correct ASSIGNED_TO relationship to a relevant Person entity, while probabilistic AI analysis of the ticket description discovers mentions of technologies, related tickets, or affected systems that aren’t captured in structured fields.

Stay tuned to learn how we extended GraphRAG beyond articles or web pages processing into the realm of structured data integration the final piece for building an intelligent platform that synthesize insights across your entire information ecosystem.


Read the complete GrapRAG series:

  1. Building a GraphRAG System – Core Infrastructure & Document Ingestion
  2. GraphRAG Part 2 – Cross-Doc & Sub-graph Extraction, Multi-Vector Entity Representation
  3. GraphRAG Part 3 – Intelligent MVR, Query Routing and Context Generation

Leave a comment