Generative AI (GenAI) emerges as a promising solution to bridge this critical gap, offering the potential to transform the vast quantities of organizational data into strategic insights. By harnessing the power of GenAI, organizations can unlock hidden patterns, trends, and correlations within their data, enabling them to make more informed decisions, optimize their operations, and ultimately, gain a competitive edge in the marketplace.
What is Docling
Document Parsing: Docling begins by parsing the input document, regardless of its format (PDF, DOCX, HTML, etc.), and extracting its underlying structure and content. This involves identifying the different elements within the document, such as headings, paragraphs, tables, images, and lists, and understanding their relationships to each other.Content Extraction: Once the document has been parsed, Docling extracts the relevant content from each element. This may involve OCR for scanned documents, text extraction from digital documents, or image processing to extract information from images.Contextualization: Docling then analyzes the extracted content to understand its context within the document. This involves identifying the relationships between different elements and determining the overall meaning of the text.Transformation: Docling transforms the extracted and contextualized content into a desired output format, such as Markdown or JSON. This involves structuring the data in a way that is easily consumable by GenAI applications.Output: Finally, Docling outputs the transformed data, ready to be used for fine-tuning pipelines, RAG systems, or other GenAI applications.
Python 3.7 or higher Pip package installer A vector database (e.g., ChromaDB, Pinecone, Weaviate). For this tutorial, we'll use ChromaDB for simplicity. OpenAI API key (if using OpenAI embeddings).
pip install docling chromadb sentence-transformers pypdf
docling: For document parsing and conversion. chromadb: For the vector database. sentence-transformers: For generating embeddings. pypdf: For reading PDF files.
import os
from docling.document_converter import DocumentConverter
import chromadb
from chromadb.utils import embedding_functions
from sentence_transformers import SentenceTransformer
import pypdf
def load_and_convert_pdf(pdf_path):
"""
Loads a PDF document from the given path and converts it to Markdown using Docling.
"""
converter = DocumentConverter()
result = converter.convert(pdf_path)
markdown_content = result.document.export_to_markdown()
return markdown_content
def split_into_chunks(text, chunk_size=500, chunk_overlap=50):
"""
Splits the input text into smaller chunks.
"""
chunks = []
start = 0
while start < len(text):
end = min(start + chunk_size, len(text))
chunk = text[start:end]
chunks.append(chunk)
start += chunk_size - chunk_overlap
return chunks
def setup_chroma(embedding_model_name="all-MiniLM-L6-v2", persist_directory="chroma"):
"""
Sets up ChromaDB with the specified embedding function.
"""
# Initialize SentenceTransformer embedding function
sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name=embedding_model_name
)
# Initialize ChromaDB client
client = chromadb.PersistentClient(path=persist_directory)
# Create or get the collection
collection = client.get_or_create_collection(
name="rag_collection", embedding_function=sentence_transformer_ef
)
return client, collection
def embed_and_store(collection, chunks):
"""
Generates embeddings for the text chunks and stores them in ChromaDB.
"""
ids = [str(i) for i in range(len(chunks))]
collection.add(
documents=chunks,
ids=ids
)
def query_chroma(collection, query, n_results=5):
"""
Queries ChromaDB with the given query and returns the most relevant results.
"""
results = collection.query(
query_texts=[query],
n_results=n_results
)
return results
def main(pdf_path, query):
"""
Main function to orchestrate the RAG pipeline.
"""
# Load and convert PDF
markdown_content = load_and_convert_pdf(pdf_path)
# Split into chunks
chunks = split_into_chunks(markdown_content)
# Setup ChromaDB
client, collection = setup_chroma()
# Embed and store chunks
embed_and_store(collection, chunks)
# Query ChromaDB
results = query_chroma(collection, query)
# Print results
print("Query:", query)
for i, (document, distance) in enumerate(zip(results['documents'][0], results['distances'][0])):
print(f"Result {i+1}:")
print(f"Document: {document}")
print(f"Distance: {distance}")
print("-" * 20)
client.delete_collection("rag_collection")
client.reset()
# Example Usage
if __name__ == "__main__":
pdf_file_path = "your_document.pdf" # Replace with the path to your PDF file
user_query = "What is the main topic of this document?"
main(pdf_file_path, user_query)
import os
from docling.document_converter import DocumentConverter
import chromadb
from chromadb.utils import embedding_functions
from sentence_transformers import SentenceTransformer
import pypdf
def load_and_convert_pdf(pdf_path):
"""
Loads a PDF document from the given path and converts it to Markdown using Docling.
"""
converter = DocumentConverter()
result = converter.convert(pdf_path)
markdown_content = result.document.export_to_markdown()
return markdown_content
def split_into_chunks(text, chunk_size=500, chunk_overlap=50):
"""
Splits the input text into smaller chunks.
"""
chunks = []
start = 0
while start < len(text):
end = min(start + chunk_size, len(text))
chunk = text[start:end]
chunks.append(chunk)
start += chunk_size - chunk_overlap
return chunks
def setup_chroma(embedding_model_name="all-MiniLM-L6-v2", persist_directory="chroma"):
"""
Sets up ChromaDB with the specified embedding function.
"""
# Initialize SentenceTransformer embedding function
sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name=embedding_model_name
)
# Initialize ChromaDB client
client = chromadb.PersistentClient(path=persist_directory)
# Create or get the collection
collection = client.get_or_create_collection(
name="rag_collection", embedding_function=sentence_transformer_ef
)
return client, collection
def embed_and_store(collection, chunks):
"""
Generates embeddings for the text chunks and stores them in ChromaDB.
"""
ids = [str(i) for i in range(len(chunks))]
collection.add(
documents=chunks,
ids=ids
)
def query_chroma(collection, query, n_results=5):
"""
Queries ChromaDB with the given query and returns the most relevant results.
"""
results = collection.query(
query_texts=[query],
n_results=n_results
)
return results
def main(pdf_path, query):
"""
Main function to orchestrate the RAG pipeline.
"""
# Load and convert PDF
markdown_content = load_and_convert_pdf(pdf_path)
# Split into chunks
chunks = split_into_chunks(markdown_content)
# Setup ChromaDB
client, collection = setup_chroma()
# Embed and store chunks
embed_and_store(collection, chunks)
# Query ChromaDB
results = query_chroma(collection, query)
# Print results
print("Query:", query)
for i, (document, distance) in enumerate(zip(results['documents'][0], results['distances'][0])):
print(f"Result {i+1}:")
print(f"Document: {document}")
print(f"Distance: {distance}")
print("-" * 20)
client.delete_collection("rag_collection")
client.reset()
# Example Usage
if __name__ == "__main__":
pdf_file_path = "your_document.pdf" # Replace with the path to your PDF file
user_query = "What is the main topic of this document?"
main(pdf_file_path, user_query)
Loading and Converting the PDF: The load_and_convert_pdf function uses Docling to convert the PDF document into Markdown format. This involves extracting text, formatting, and structural elements from the PDF.
Splitting into Chunks: The split_into_chunks function divides the Markdown content into smaller, manageable chunks. This is crucial for efficient retrieval because it allows the vector database to find the most relevant segments of the document quickly.
Setting up ChromaDB: The setup_chroma function initializes the ChromaDB client and creates a collection to store the embeddings. It uses sentence-transformers to generate embeddings, which are numerical representations of the text that capture semantic meaning.
Embedding and Storing Chunks: The embed_and_store function takes the generated chunks and converts them into embeddings using the sentence-transformers model. These embeddings are then stored in the ChromaDB collection, allowing for semantic search capabilities.
Querying ChromaDB: The query_chroma function takes a query string and searches the ChromaDB collection for the most relevant documents based on semantic similarity. It returns the documents along with their distances, indicating how closely they match the query.
Main Function: The main function orchestrates the entire process. It calls the functions to load the PDF, split it into chunks, set up ChromaDB, embed and store the chunks, and query the database. Finally, it prints the results, showing the most relevant documents and their distances. The ChromaDB collection is deleted, and the client is reset at the end.
Save the Code: Save the provided code as a Python file (e.g., rag_pdf.py).
Prepare Your PDF Document: Place the PDF document you want to query in the same directory as the script. Update the pdf_file_path variable in the main function to point to your PDF file.
Run the Script: Open a terminal or command prompt. Navigate to the directory where you saved the script and the PDF file. Run the script using the command: python rag_pdf.py
Examine the Output: The script will print the query and the most relevant documents found in the PDF, along with their distances. The distance indicates how closely each document matches the query.
docling: This library is used for parsing and converting PDF documents into a more manageable format, such as Markdown. Docling excels at preserving the structure and formatting of the original document, making it easier to extract meaningful information. chromadb: ChromaDB is a vector database that allows you to store and retrieve embeddings efficiently. Embeddings are numerical representations of text that capture their semantic meaning, enabling similarity searches. sentence-transformers: This library provides pre-trained models for generating high-quality embeddings. SentenceTransformer models are designed to produce embeddings that capture the semantic meaning of sentences and paragraphs, making them ideal for RAG applications. pypdf: PyPDF is a Python library used for reading PDF files and extracting text content.
os: This module provides functions for interacting with the operating system, such as accessing file paths and environment variables. docling.document_converter.DocumentConverter: This class is used to convert documents from one format to another. In this case, we will use it to convert PDF documents to Markdown format. chromadb: This module provides the core functionalities for interacting with the ChromaDB vector database. chromadb.utils.embedding_functions: This module provides utility functions for working with embeddings, such as the SentenceTransformerEmbeddingFunction. sentence_transformers.SentenceTransformer: This class is used to load and use pre-trained SentenceTransformer models for generating embeddings. pypdf: This module is used to extract text content from PDF files.
converter = DocumentConverter(): This line creates an instance of the DocumentConverter class, which will be used to perform the conversion. result = converter.convert(pdf_path): This line calls the convert method of the DocumentConverter instance to convert the PDF document to Markdown format. The pdf_path variable specifies the path to the PDF file. markdown_content = result.document.export_to_markdown(): This line extracts the Markdown content from the conversion result. return markdown_content: This line returns the Markdown content.
chunks = []: This line initializes an empty list to store the chunks. start = 0: This line initializes the start index to 0. while start < len(text): This loop iterates over the text, creating chunks until the end of the text is reached. end = min(start + chunk_size, len(text)): This line calculates the end index for the current chunk. The end index is the minimum of the start index plus the chunk size and the length of the text. This ensures that the chunk does not exceed the chunk size. chunk = text[start:end]: This line creates the current chunk by slicing the text from the start index to the end index. chunks.append(chunk): This line adds the current chunk to the list of chunks. start += chunk_size - chunk_overlap: This line updates the start index for the next chunk. The start index is incremented by the chunk size minus the chunk overlap. This ensures that there is some overlap between the chunks, which can help to preserve context. return chunks: This line returns the list of chunks.
sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=embedding_model_name): This line initializes the SentenceTransformer embedding function. The embedding_model_name variable specifies the name of the SentenceTransformer model to use. client = chromadb.PersistentClient(path=persist_directory): This line initializes the ChromaDB client. The persist_directory variable specifies the directory where the ChromaDB data will be stored. collection = client.get_or_create_collection(name="rag_collection", embedding_function=sentence_transformer_ef): This line creates or gets the ChromaDB collection. The name variable specifies the name of the collection. The embedding_function variable specifies the embedding function to use. return client, collection: This line returns the ChromaDB client and collection.
ids = [str(i) for i in range(len(chunks))]: This line creates a list of unique IDs for the chunks. collection.add(documents=chunks, ids=ids): This line adds the chunks to the ChromaDB collection. The documents variable specifies the list of text chunks. The ids variable specifies the list of unique IDs for the chunks.
results = collection.query(query_texts=[query], n_results=n_results): This line queries the ChromaDB collection with the given query. The query_texts variable specifies the query string. The n_results variable specifies the number of results to return. return results: This line returns the query results.
markdown_content = load_and_convert_pdf(pdf_path): This line loads the PDF document and converts it to Markdown format. chunks = split_into_chunks(markdown_content): This line splits the Markdown content into smaller chunks. client, collection = setup_chroma(): This line initializes ChromaDB and sets up the embedding function. embed_and_store(collection, chunks): This line generates embeddings for the text chunks and stores them in the ChromaDB collection. results = query_chroma(collection, query): This line queries ChromaDB with the given query and returns the most relevant results. print("Query:", query): This line prints the query string. for i, (document, distance) in enumerate(zip(results['documents'][0], results['distances'][0])): This loop iterates over the query results. print(f"Result {i+1}:"): This line prints the result number. print(f"Document: {document}"): This line prints the document content. print(f"Distance: {distance}"): This line prints the distance between the query and the document. print("-" * 20): This line prints a separator.
Different Embedding Models: Experiment with different SentenceTransformer models to find the one that works best for your data. Some models are better suited for specific types of text.Chunk Size and Overlap: Adjust the chunk_size and chunk_overlap parameters in the split_into_chunks function to optimize the retrieval performance. Smaller chunk sizes may improve accuracy, while larger chunk sizes may improve efficiency.Prompt Engineering: Integrate the retrieved documents into a prompt for a large language model (LLM) to generate more comprehensive and context-aware answers.Metadata: Add metadata to the chunks before storing them in ChromaDB, such as the page number or section heading. This can help to improve the accuracy and relevance of the search results.Error Handling: Add error handling to the code to gracefully handle exceptions, such as file not found errors or network errors.
0 comments:
Post a Comment