1. Overview
The IngestionDocs workflow is a fully automated document
ingestion and knowledge management system built with n8n. Its
purpose is to continuously ingest organizational documents
from Google Drive , transform them into vector embeddings using
OpenAI , store them in Pinecone , and make them searchable and
retrievable through an AI-powered Q &A interface.
This ensures that employees always have access to the most up-to-date
knowledge base without requiring manual intervention.
2. Key Objectives
- Automated Ingestion → Seamlessly process new and updated
documents from Google Drive.\
- Change Detection → Track and differentiate between new, updated,
and previously processed documents.\
- Knowledge Base Construction → Convert documents into embeddings
for semantic search.\
- AI-Powered Assistance → Provide an intelligent Q&A system for
employees to query manuals.\
- Scalable & Maintainable → Modular design using n8n, LangChain,
and Pinecone.
3. Workflow Breakdown
A. Document Monitoring and Retrieval
- The workflow begins with two Google Drive triggers :
- File Created Trigger → Fires when a new document is
uploaded.\
- File Updated Trigger → Fires when an existing document is
modified.\
- A search operation lists the files in the designated Google
Drive folder.\
- Non-downloadable items (e.g., subfolders) are filtered out.\
- For valid files:
- The file is downloaded.\
- A SHA256 hash is generated to uniquely identify the file's
content.
B. Record Management (Google Sheets Integration)
To keep track of ingestion states, the workflow uses a Google
Sheets--based Record Manager:\
- Each file entry contains:\
- Id (Google Drive file ID)\
- Name (file name)\
- hashId (SHA256 checksum)\
- The workflow compares the current file's hash with the stored one:\
- New Document → File not found in records → Inserted into the
Record Manager.\
- Already Processed → File exists and hash matches → Skipped.\
- Updated Document → File exists but hash differs → Record is
updated.
This guarantees that only new or modified content is processed, avoiding
duplication.
C. Document Processing and Vectorization
Once a document is marked as new or updated:\
- Default Data Loader extracts its content (binary files
supported).\
- Pages are split into individual chunks.\
- Metadata such as file ID and name are attached.\
- Recursive Character Text Splitter divides the content into
manageable segments with overlap.\
- OpenAI Embeddings (
text-embedding-3-large
) transform each text
chunk into a semantic vector.\
- Pinecone Vector Store stores these vectors in the configured
index:\
- For new documents , embeddings are inserted into a namespace based
on the file name.\
- For updated documents , the namespace is cleared first, then
re-ingested with fresh embeddings.
This process builds a scalable and queryable knowledge base.
D. Knowledge Base Q &A Interface
The workflow also provides an interactive form-based user
interface:\
- Form Trigger → Collects employee questions.\
- LangChain AI Agent :\
- Receives the question.\
- Retrieves relevant context from Pinecone using vector similarity
search.\
- Processes the response using OpenAI Chat Model (
gpt-4.1-mini
).\
- Answer Formatting :\
- Responses are returned in HTML format for readability.\
- A custom CSS theme ensures a modern, user-friendly design.\
- Answers may include references to page numbers when available.
This creates a self-service knowledge base assistant that employees
can query in natural language.
4. Technologies Used
- n8n → Orchestration of the entire workflow.\
- Google Drive API → File monitoring, listing, and downloading.\
- Google Sheets API → Record manager for tracking file states.\
- OpenAI API :
- text-embedding-3-large for semantic vector creation.\
- gpt-4.1-mini for conversational Q&A.\
- Pinecone → Vector database for embedding storage and retrieval.\
- LangChain → Document loaders, text splitters, vector store
connectors, and agent logic.\
- Crypto (SHA256) → File hash generation for change detection.\
- Form Trigger + Form Node → Employee-facing Q&A submission and
answer display.\
- Custom CSS → Provides a modern, responsive, styled UI for the
knowledge base.
5. End-to-End Data Flow
- Employee uploads or updates a document → Google Drive detects
the change.\
- Workflow downloads and hashes the file → Ensures uniqueness and
detects modifications.\
- Record Manager (Google Sheets) → Decides whether to skip,
insert, or update the record.\
- Document Processing → Splitting + Embedding + Storing into
Pinecone.\
- Knowledge Base Updated → The latest version of documents is
indexed.\
- Employee asks a question via the web form.\
- AI Agent retrieves embeddings from Pinecone + uses GPT-4.1-mini
→ Generates a contextual answer.\
- Answer displayed in styled HTML → Delivered back to the employee
through the form interface.
6. Benefits
- Always Up-to-Date → Automatically syncs documents when uploaded
or changed.\
- No Duplicates → Smart hashing ensures only relevant updates are
reprocessed.\
- Searchable Knowledge Base → Employees can query documents
semantically, not just by keywords.\
- Enhanced Productivity → Answers are immediate, reducing time
spent browsing manuals.\
- Scalable → New documents and users can be added without workflow
redesign.
✅ In summary, IngestionDocs is a robust AI-driven
document ingestion and retrieval system that integrates Google
Drive, Google Sheets, OpenAI, and Pinecone within n8n. It
continuously builds and maintains a knowledge base of manuals while
offering employees an intelligent, user-friendly Q &A assistant for
fast and accurate knowledge retrieval.