Friday, April 18, 2025

Create a Simple RAG PDF with Docling


Generative AI (GenAI) emerges as a promising solution to bridge this critical gap, offering the potential to transform the vast quantities of organizational data into strategic insights. By harnessing the power of GenAI, organizations can unlock hidden patterns, trends, and correlations within their data, enabling them to make more informed decisions, optimize their operations, and ultimately, gain a competitive edge in the marketplace.

However, the potential of GenAI to revolutionize data-driven decision-making cannot be fully realized when the data is trapped in formats that are not directly consumable by large language models (LLMs). This can occur when valuable information is stored in PDF files or proprietary document formats like DOCX, effectively hindering the alignment of the model to the specific needs and requirements of the organization. This challenge becomes a significant roadblock, preventing organizations from fully leveraging their data assets to drive innovation and achieve their strategic objectives.

To overcome these challenges, organizations must adopt a proactive approach to preprocess their existing documents into formats that can be readily consumed by fine-tuning pipelines or used for the creation of retrieval-augmented generation (RAG) systems. This preprocessing step is crucial for unlocking the potential of GenAI and ensuring that the models can effectively learn from and leverage the organization's data.

It's important to emphasize that this process is not simply about performing basic Optical Character Recognition (OCR) or extracting raw text from documents. A critical aspect of this preprocessing stage is the implementation of context and element-aware techniques for data extraction. For example, if a table spans multiple pages within a document, it must be extracted as a single, cohesive table, preserving its structural integrity and the relationships between its constituent elements. Similarly, if the document features a complex layout with multiple text columns per page or incorporates a mix of elements such as images and tables, each of these elements must be extracted consistently, while maintaining awareness of the context from which it is being extracted.

Over the years, numerous open-source tools have attempted to address specific aspects of this challenge, but none have provided a comprehensive solution that tackles the entire problem holistically. This fragmented approach has forced organizations to rely on complex pipelines that interconnect disparate tools, requiring them to process the same documents multiple times using different software. The resulting workflows are often inconsistent, computationally expensive, and difficult to maintain, leading to variable output quality and hindering the overall effectiveness of the data ingestion process.

In our own experience, we encountered these same challenges when developing document ingestion pipelines for processing documents to fine-tune models with InstructLab. It was during this process that we discovered and adopted Docling, a project developed at IBM Research, which proved to be a transformative solution for our workflow. IBM's recent decision to release Docling as an open-source project in the fall of 2024 has solidified its position as a prominent tool in the NLP field, validating our early adoption and recognizing its potential to revolutionize document processing for GenAI applications. We are excited to witness its growing impact on the open-source GenAI ecosystem and believe that it will play a critical role in democratizing access to these powerful technologies.

What is Docling

Docling stands out as an upstream open-source project and tool designed to streamline the process of parsing documents from a variety of formats, including PDF, DOCX, PPTX, and HTML, and converting them into easily consumable formats like Markdown or JSON. This capability simplifies the process of preparing content for GenAI applications, enabling organizations to unlock the value of their unstructured data and leverage it for a wide range of use cases. Docling supports advanced PDF processing, including Optical Character Recognition (OCR) for scanned documents, making it suitable for handling both digital and digitized content. Furthermore, Docling seamlessly integrates with popular tools like LlamaIndex and LangChain, facilitating the creation of RAG systems and question-answering applications.

Understanding the Docling Process

The Docling process can be broken down into the following key steps:

  1. Document Parsing: Docling begins by parsing the input document, regardless of its format (PDF, DOCX, HTML, etc.), and extracting its underlying structure and content. This involves identifying the different elements within the document, such as headings, paragraphs, tables, images, and lists, and understanding their relationships to each other.

  2. Content Extraction: Once the document has been parsed, Docling extracts the relevant content from each element. This may involve OCR for scanned documents, text extraction from digital documents, or image processing to extract information from images.

  3. Contextualization: Docling then analyzes the extracted content to understand its context within the document. This involves identifying the relationships between different elements and determining the overall meaning of the text.

  4. Transformation: Docling transforms the extracted and contextualized content into a desired output format, such as Markdown or JSON. This involves structuring the data in a way that is easily consumable by GenAI applications.

  5. Output: Finally, Docling outputs the transformed data, ready to be used for fine-tuning pipelines, RAG systems, or other GenAI applications.

Step-by-Step Tutorial: Building a RAG PDF System with Docling and a Vector Database

This tutorial will guide you through the process of building a Retrieval-Augmented Generation (RAG) system for PDF documents using Docling to process the documents and a vector database to store the embeddings.

Prerequisites:

  • Python 3.7 or higher

  • Pip package installer

  • A vector database (e.g., ChromaDB, Pinecone, Weaviate). For this tutorial, we'll use ChromaDB for simplicity.

  • OpenAI API key (if using OpenAI embeddings).

Step 1: Install Necessary Libraries

First, install the required Python libraries using pip:

      pip install docling chromadb sentence-transformers pypdf
    
  • docling: For document parsing and conversion.

  • chromadb: For the vector database.

  • sentence-transformers: For generating embeddings.

  • pypdf: For reading PDF files.

Step 2: Set Up the Environment

Import the necessary libraries and set up your environment:

import os
from docling.document_converter import DocumentConverter
import chromadb
from chromadb.utils import embedding_functions
from sentence_transformers import SentenceTransformer
import pypdf
    

Step 3: Load and Convert PDF Documents

Create a function to load and convert PDF documents using Docling:

    def load_and_convert_pdf(pdf_path):
    """
    Loads a PDF document from the given path and converts it to Markdown using Docling.
    """
    converter = DocumentConverter()
    result = converter.convert(pdf_path)
    markdown_content = result.document.export_to_markdown()
    return markdown_content
    

This function takes the path to a PDF file as input, uses Docling to convert it to Markdown format, and returns the Markdown content.

Step 4: Split the Document into Chunks

To effectively use the document in a RAG system, split the Markdown content into smaller chunks. This helps in retrieving more relevant information during the query.

    def split_into_chunks(text, chunk_size=500, chunk_overlap=50):
    """
    Splits the input text into smaller chunks.
    """
    chunks = []
    start = 0
    while start < len(text):
        end = min(start + chunk_size, len(text))
        chunk = text[start:end]
        chunks.append(chunk)
        start += chunk_size - chunk_overlap
    return chunks
    

This function splits the input text into chunks of a specified size, with a defined overlap between chunks to maintain context.

Step 5: Set Up the Vector Database

Initialize ChromaDB and set up the embedding function. Here, we'll use SentenceTransformer embeddings, which are efficient and don't require an OpenAI API key:

    def setup_chroma(embedding_model_name="all-MiniLM-L6-v2", persist_directory="chroma"):
    """
    Sets up ChromaDB with the specified embedding function.
    """
    # Initialize SentenceTransformer embedding function
    sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
        model_name=embedding_model_name
    )

    # Initialize ChromaDB client
    client = chromadb.PersistentClient(path=persist_directory)

    # Create or get the collection
    collection = client.get_or_create_collection(
        name="rag_collection", embedding_function=sentence_transformer_ef
    )
    return client, collection
    

This function initializes ChromaDB with SentenceTransformer embeddings. You can change the embedding_model_name to other models supported by SentenceTransformer. It creates a collection (or retrieves it if it already exists) to store the embeddings.

Step 6: Generate Embeddings and Store in the Vector Database

Now, generate embeddings for the text chunks and store them in the ChromaDB collection:

    def embed_and_store(collection, chunks):
    """
    Generates embeddings for the text chunks and stores them in ChromaDB.
    """
    ids = [str(i) for i in range(len(chunks))]
    collection.add(
        documents=chunks,
        ids=ids
    )
    

This function takes the text chunks, generates embeddings using the SentenceTransformer model, and stores them in the ChromaDB collection along with unique IDs.

Step 7: Create the RAG Pipeline

Now we create the core of our RAG pipeline:

    def query_chroma(collection, query, n_results=5):
    """
    Queries ChromaDB with the given query and returns the most relevant results.
    """
    results = collection.query(
        query_texts=[query],
        n_results=n_results
    )
    return results
    

This function takes a query string as input, uses the ChromaDB collection to find the most relevant documents, and returns the results.

Step 8: Putting it All Together

Now, let's create a main function to tie everything together:

    def main(pdf_path, query):
    """
    Main function to orchestrate the RAG pipeline.
    """
    # Load and convert PDF
    markdown_content = load_and_convert_pdf(pdf_path)

    # Split into chunks
    chunks = split_into_chunks(markdown_content)

    # Setup ChromaDB
    client, collection = setup_chroma()

    # Embed and store chunks
    embed_and_store(collection, chunks)

    # Query ChromaDB
    results = query_chroma(collection, query)

    # Print results
    print("Query:", query)
    for i, (document, distance) in enumerate(zip(results['documents'][0], results['distances'][0])):
        print(f"Result {i+1}:")
        print(f"Document: {document}")
        print(f"Distance: {distance}")
        print("-" * 20)

    client.delete_collection("rag_collection")
    client.reset()

# Example Usage
if __name__ == "__main__":
    pdf_file_path = "your_document.pdf"  # Replace with the path to your PDF file
    user_query = "What is the main topic of this document?"
    main(pdf_file_path, user_query)
    

Complete Code:

import os
from docling.document_converter import DocumentConverter
import chromadb
from chromadb.utils import embedding_functions
from sentence_transformers import SentenceTransformer
import pypdf

def load_and_convert_pdf(pdf_path):
    """
    Loads a PDF document from the given path and converts it to Markdown using Docling.
    """
    converter = DocumentConverter()
    result = converter.convert(pdf_path)
    markdown_content = result.document.export_to_markdown()
    return markdown_content

def split_into_chunks(text, chunk_size=500, chunk_overlap=50):
    """
    Splits the input text into smaller chunks.
    """
    chunks = []
    start = 0
    while start < len(text):
        end = min(start + chunk_size, len(text))
        chunk = text[start:end]
        chunks.append(chunk)
        start += chunk_size - chunk_overlap
    return chunks

def setup_chroma(embedding_model_name="all-MiniLM-L6-v2", persist_directory="chroma"):
    """
    Sets up ChromaDB with the specified embedding function.
    """
    # Initialize SentenceTransformer embedding function
    sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
        model_name=embedding_model_name
    )

    # Initialize ChromaDB client
    client = chromadb.PersistentClient(path=persist_directory)

    # Create or get the collection
    collection = client.get_or_create_collection(
        name="rag_collection", embedding_function=sentence_transformer_ef
    )
    return client, collection

def embed_and_store(collection, chunks):
    """
    Generates embeddings for the text chunks and stores them in ChromaDB.
    """
    ids = [str(i) for i in range(len(chunks))]
    collection.add(
        documents=chunks,
        ids=ids
    )

def query_chroma(collection, query, n_results=5):
    """
    Queries ChromaDB with the given query and returns the most relevant results.
    """
    results = collection.query(
        query_texts=[query],
        n_results=n_results
    )
    return results

def main(pdf_path, query):
    """
    Main function to orchestrate the RAG pipeline.
    """
    # Load and convert PDF
    markdown_content = load_and_convert_pdf(pdf_path)

    # Split into chunks
    chunks = split_into_chunks(markdown_content)

    # Setup ChromaDB
    client, collection = setup_chroma()

    # Embed and store chunks
    embed_and_store(collection, chunks)

    # Query ChromaDB
    results = query_chroma(collection, query)

    # Print results
    print("Query:", query)
    for i, (document, distance) in enumerate(zip(results['documents'][0], results['distances'][0])):
        print(f"Result {i+1}:")
        print(f"Document: {document}")
        print(f"Distance: {distance}")
        print("-" * 20)

    client.delete_collection("rag_collection")
    client.reset()

# Example Usage
if __name__ == "__main__":
    pdf_file_path = "your_document.pdf"  # Replace with the path to your PDF file
    user_query = "What is the main topic of this document?"
    main(pdf_file_path, user_query)
    

Explanation:

  1. Loading and Converting the PDF:

    • The load_and_convert_pdf function uses Docling to convert the PDF document into Markdown format. This involves extracting text, formatting, and structural elements from the PDF.

  2. Splitting into Chunks:

    • The split_into_chunks function divides the Markdown content into smaller, manageable chunks. This is crucial for efficient retrieval because it allows the vector database to find the most relevant segments of the document quickly.

  3. Setting up ChromaDB:

    • The setup_chroma function initializes the ChromaDB client and creates a collection to store the embeddings. It uses sentence-transformers to generate embeddings, which are numerical representations of the text that capture semantic meaning.

  4. Embedding and Storing Chunks:

    • The embed_and_store function takes the generated chunks and converts them into embeddings using the sentence-transformers model. These embeddings are then stored in the ChromaDB collection, allowing for semantic search capabilities.

  5. Querying ChromaDB:

    • The query_chroma function takes a query string and searches the ChromaDB collection for the most relevant documents based on semantic similarity. It returns the documents along with their distances, indicating how closely they match the query.

  6. Main Function:

    • The main function orchestrates the entire process. It calls the functions to load the PDF, split it into chunks, set up ChromaDB, embed and store the chunks, and query the database. Finally, it prints the results, showing the most relevant documents and their distances.

    • The ChromaDB collection is deleted, and the client is reset at the end.

How to Run the Code:

  1. Save the Code:

    • Save the provided code as a Python file (e.g., rag_pdf.py).

  2. Prepare Your PDF Document:

    • Place the PDF document you want to query in the same directory as the script.

    • Update the pdf_file_path variable in the main function to point to your PDF file.

  3. Run the Script:

    • Open a terminal or command prompt.

    • Navigate to the directory where you saved the script and the PDF file.

    • Run the script using the command: python rag_pdf.py

  4. Examine the Output:

    • The script will print the query and the most relevant documents found in the PDF, along with their distances. The distance indicates how closely each document matches the query.

Detailed Explanation of Each Step:

1. Installing Libraries:

The first step is to ensure that you have all the necessary libraries installed. These libraries provide the functionalities needed for document processing, embedding generation, and vector database management.

  • docling: This library is used for parsing and converting PDF documents into a more manageable format, such as Markdown. Docling excels at preserving the structure and formatting of the original document, making it easier to extract meaningful information.

  • chromadb: ChromaDB is a vector database that allows you to store and retrieve embeddings efficiently. Embeddings are numerical representations of text that capture their semantic meaning, enabling similarity searches.

  • sentence-transformers: This library provides pre-trained models for generating high-quality embeddings. SentenceTransformer models are designed to produce embeddings that capture the semantic meaning of sentences and paragraphs, making them ideal for RAG applications.

  • pypdf: PyPDF is a Python library used for reading PDF files and extracting text content.

2. Setting Up the Environment:

The next step is to import the necessary libraries and set up your environment. This involves importing the required modules from each library and configuring any necessary settings.

  • os: This module provides functions for interacting with the operating system, such as accessing file paths and environment variables.

  • docling.document_converter.DocumentConverter: This class is used to convert documents from one format to another. In this case, we will use it to convert PDF documents to Markdown format.

  • chromadb: This module provides the core functionalities for interacting with the ChromaDB vector database.

  • chromadb.utils.embedding_functions: This module provides utility functions for working with embeddings, such as the SentenceTransformerEmbeddingFunction.

  • sentence_transformers.SentenceTransformer: This class is used to load and use pre-trained SentenceTransformer models for generating embeddings.

  • pypdf: This module is used to extract text content from PDF files.

3. Loading and Converting PDF Documents:

The load_and_convert_pdf function is responsible for loading PDF documents and converting them into Markdown format using Docling. This function takes the path to a PDF file as input and returns the Markdown content.

  • converter = DocumentConverter(): This line creates an instance of the DocumentConverter class, which will be used to perform the conversion.

  • result = converter.convert(pdf_path): This line calls the convert method of the DocumentConverter instance to convert the PDF document to Markdown format. The pdf_path variable specifies the path to the PDF file.

  • markdown_content = result.document.export_to_markdown(): This line extracts the Markdown content from the conversion result.

  • return markdown_content: This line returns the Markdown content.

4. Splitting the Document into Chunks:

The split_into_chunks function is responsible for splitting the Markdown content into smaller chunks. This is necessary because vector databases have limitations on the size of the documents they can store. By splitting the document into smaller chunks, we can ensure that each chunk is small enough to be stored in the vector database.

  • chunks = []: This line initializes an empty list to store the chunks.

  • start = 0: This line initializes the start index to 0.

  • while start < len(text): This loop iterates over the text, creating chunks until the end of the text is reached.

  • end = min(start + chunk_size, len(text)): This line calculates the end index for the current chunk. The end index is the minimum of the start index plus the chunk size and the length of the text. This ensures that the chunk does not exceed the chunk size.

  • chunk = text[start:end]: This line creates the current chunk by slicing the text from the start index to the end index.

  • chunks.append(chunk): This line adds the current chunk to the list of chunks.

  • start += chunk_size - chunk_overlap: This line updates the start index for the next chunk. The start index is incremented by the chunk size minus the chunk overlap. This ensures that there is some overlap between the chunks, which can help to preserve context.

  • return chunks: This line returns the list of chunks.

5. Setting Up the Vector Database:

The setup_chroma function is responsible for initializing ChromaDB and setting up the embedding function. This function takes the embedding model name and the persist directory as input and returns the ChromaDB client and collection.

  • sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=embedding_model_name): This line initializes the SentenceTransformer embedding function. The embedding_model_name variable specifies the name of the SentenceTransformer model to use.

  • client = chromadb.PersistentClient(path=persist_directory): This line initializes the ChromaDB client. The persist_directory variable specifies the directory where the ChromaDB data will be stored.

  • collection = client.get_or_create_collection(name="rag_collection", embedding_function=sentence_transformer_ef): This line creates or gets the ChromaDB collection. The name variable specifies the name of the collection. The embedding_function variable specifies the embedding function to use.

  • return client, collection: This line returns the ChromaDB client and collection.

6. Generating Embeddings and Storing in the Vector Database:

The embed_and_store function is responsible for generating embeddings for the text chunks and storing them in the ChromaDB collection. This function takes the ChromaDB collection and the list of text chunks as input.

  • ids = [str(i) for i in range(len(chunks))]: This line creates a list of unique IDs for the chunks.

  • collection.add(documents=chunks, ids=ids): This line adds the chunks to the ChromaDB collection. The documents variable specifies the list of text chunks. The ids variable specifies the list of unique IDs for the chunks.

7. Creating the RAG Pipeline:

The query_chroma function is responsible for querying ChromaDB with a given query and returning the most relevant results. This function takes the ChromaDB collection, the query string, and the number of results to return as input.

  • results = collection.query(query_texts=[query], n_results=n_results): This line queries the ChromaDB collection with the given query. The query_texts variable specifies the query string. The n_results variable specifies the number of results to return.

  • return results: This line returns the query results.

8. Putting it All Together:

The main function orchestrates the entire RAG pipeline. This function takes the path to the PDF file and the query string as input.

  • markdown_content = load_and_convert_pdf(pdf_path): This line loads the PDF document and converts it to Markdown format.

  • chunks = split_into_chunks(markdown_content): This line splits the Markdown content into smaller chunks.

  • client, collection = setup_chroma(): This line initializes ChromaDB and sets up the embedding function.

  • embed_and_store(collection, chunks): This line generates embeddings for the text chunks and stores them in the ChromaDB collection.

  • results = query_chroma(collection, query): This line queries ChromaDB with the given query and returns the most relevant results.

  • print("Query:", query): This line prints the query string.

  • for i, (document, distance) in enumerate(zip(results['documents'][0], results['distances'][0])): This loop iterates over the query results.

  • print(f"Result {i+1}:"): This line prints the result number.

  • print(f"Document: {document}"): This line prints the document content.

  • print(f"Distance: {distance}"): This line prints the distance between the query and the document.

  • print("-" * 20): This line prints a separator.

Customization and Further Improvements:

  • Different Embedding Models: Experiment with different SentenceTransformer models to find the one that works best for your data. Some models are better suited for specific types of text.

  • Chunk Size and Overlap: Adjust the chunk_size and chunk_overlap parameters in the split_into_chunks function to optimize the retrieval performance. Smaller chunk sizes may improve accuracy, while larger chunk sizes may improve efficiency.

  • Prompt Engineering: Integrate the retrieved documents into a prompt for a large language model (LLM) to generate more comprehensive and context-aware answers.

  • Metadata: Add metadata to the chunks before storing them in ChromaDB, such as the page number or section heading. This can help to improve the accuracy and relevance of the search results.

  • Error Handling: Add error handling to the code to gracefully handle exceptions, such as file not found errors or network errors.

By following this tutorial, you can build a RAG system that leverages Docling to process PDF documents and ChromaDB to store and retrieve embeddings. This system can be used to answer questions about the content of the PDF documents, providing a valuable tool for information retrieval and knowledge management.

Conclusion

Red Hat introduced Red Hat Enterprise Linux AI (RHEL AI) and InstructLab earlier this year to bring these same capabilities to enterprise customers. RHEL AI is an enterprise-focused foundation model platform that integrates open-source GenAI capabilities and is optimized for deployment across hybrid cloud environments. It combines IBM’s open-source Granite LLMs with Red Hat’s InstructLab tools, enabling domain experts (not just data scientists) to fine-tune models with industry-specific knowledge, aligning them with unique organizational needs and data. Those models can then be deployed and managed across a hybrid cloud environment that spans from enterprise data centers to public clouds and edge environments.


0 comments:

Post a Comment