Feature Specification: Knowledge Base Management
1. Overview & Vision
Knowledge Base Management is the "Brain" of the AI Assistant. It allows organization admins to curate and maintain the private data sources that power the AI's responses. By transforming documents and links into a searchable vector store, it ensures that the AI's intelligence is always grounded in the latest organizational context.
2. Personas & Stakeholders
| Persona | Goal |
|---|---|
| Knowledge Admin | Add, remove, and refresh data sources to keep the AI updated. |
| Org Admin | Monitor storage usage and audit the organization's knowledge perimeter. |
| Developer | Integrate new data connectors (e.g., custom API indexing). |
3. User Stories
- As an admin, I want to link our "Employee Handbook" URL so the AI can answer policy questions.
- As an admin, I want to see the "Sync Status" of a Drive folder to ensure the latest files are indexed.
- As an admin, I want to delete outdated sources to prevent the AI from giving obsolete information.
4. Functional Requirements (FR)
- REQ-KB-001: Support for multiple source types: Web URLs, Drive Files, and Document App Pages.
- REQ-KB-002: Real-time indexing progress tracking (Extraction, Chunking, Embedding).
- REQ-KB-003: Automated daily sync for active sources.
- REQ-KB-004: Usage analytics showing total chunks and tokens used per source.
5. Non-Functional Requirements (NFR)
- Scalability: Support for indexing up to 10,000 documents per organization.
- Accuracy: 100% text extraction integrity from supported formats (PDF, Docx, HTML).
- Security: Zero leakage of vector data between organizations.
6. Business Logic & Rules
- Indexing Pipeline: Extraction → Cleaning → Chunking (500 tokens) → Embedding (1536 dim) → Persistence.
- Update Logic: Re-indexing a source replaces all previous vector chunks associated with its ID.
- Fail-safe: Indexing errors are logged, and the source is marked with an "Error" status for admin review.
7. User Interface (UI/UX)
- Source List: Table view with Status badges (Indexing, Synced, Error, Paused).
- Add Source Modal: Type selection (URL/Drive/Doc) with validation.
- Detail View: Side-panel showing chunk statistics and last sync timestamp.
8. Information Architecture
- "Knowledge Base" section in the AI Assistant sidebar.
- Link to "KB Settings" for organization administrators.
9. Data Model & Persistence
- Table:
kb_sources(Registry). - Table:
kb_chunks(Vector Store withpgvector).
10. API & Service Layer
POST /sources(Initiates indexing).GET /sources(List registry).POST /sources/:id/sync(Manual refresh).
11. Integration Patterns
- Scraper Service: Headless browser for extracting text from public/private URLs.
- Drive Link: Programmatic access to S3 objects via the Drive module's internal service.
12. Security & Permissions
- RBAC:
ai_assistant:managerequired to add or remove sources. - Encryption: Source credentials (if any) are stored in the platform Vault.
13. Error Handling & Resilience
- Retry Mechanism: Exponential backoff for embedding API rate limits.
- Validation: Rejection of unsupported file types or malformed URLs.
14. Performance & Scalability
- Parallel chunk processing using background worker queues (planned).
- Efficient vector indexing using HNSW for sub-200ms retrieval.
15. Globalization & i18n
- Support for Vietnamese and International character sets during extraction.
16. Accessibility (a11y)
- Accessible data tables with sorting and filtering focus states.
17. Observability & Analytics
- Tracking of "Indexing Failures" by error type (Timeout, Auth, Format).
- Analytics on "Knowledge Density" (Chunks per Source).
18. Testing & Quality
- Integration tests for the extraction pipeline (PDF to plaintext).
- Stress tests for large-scale indexing (100MB+ files).
19. Constraints & Assumptions
- Assumes organization has sufficient token quota for embeddings.
20. Future Enhancements
- Slack / Microsoft Teams workspace indexing.
- Manual "Chunk Editor" for fine-tuning specific AI responses.