Back to library
Data Ingestion

S3 PDF Folder Loader

Loads and extracts text from multiple PDF files stored in an S3-compatible bucket (like Cloudian HyperStore) with support for page batching.

Overview

This component allows you to point Langflow at a folder (prefix) in an S3 bucket and automatically extract text from every PDF found within. It is designed to work with any S3-compatible storage, including Cloudian HyperStore.

Key features:

  • Batching: Process specific page ranges to manage memory or LLM context limits.
  • Metadata: Automatically attaches the bucket name, object key, and page number to each record.
  • Compatibility: Works with AWS S3, Cloudian, MinIO, and other S3-compatible APIs.

Python Code

python
import os
import tempfile
import boto3
from langchain.document_loaders import PyPDFLoader
from langflow.custom.custom_component.component import Component
from langflow.io import MessageTextInput, SecretStrInput, IntInput, Output
from langflow.schema.data import Data
from langflow.schema.dataframe import DataFrame
 
class CloudianS3LoadPDFs(Component):
    display_name = "S3 Load PDFs from Folder"
    description = "Loads and extracts text from PDF files stored in an S3-compatible bucket, with page batching."
    icon = "database-2-line"
    name = "CloudianS3LoadPDFs"
 
    inputs = [
        MessageTextInput(name="s3_endpoint", display_name="S3 Endpoint", required=True),
        MessageTextInput(name="access_key", display_name="Access Key", required=True),
        SecretStrInput(name="secret_key", display_name="Secret Key", required=True),
        MessageTextInput(name="bucket_name", display_name="Bucket Name", required=True),
        MessageTextInput(name="folder_prefix", display_name="Folder / Prefix", required=True),
        IntInput(name="start_page", display_name="Start Page (1-based)", value=1),
        IntInput(name="pages_per_batch", display_name="Pages per Batch (0 = all)", value=0),
    ]
 
    outputs = [
        Output(display_name="Documents", name="dataframe", method="load_documents")
    ]
 
    def load_documents(self) -> DataFrame:
        s3 = boto3.client(
            "s3",
            endpoint_url=self.s3_endpoint,
            aws_access_key_id=self.access_key,
            aws_secret_access_key=self.secret_key,
        )
 
        response = s3.list_objects_v2(
            Bucket=self.bucket_name,
            Prefix=self.folder_prefix,
        )
 
        data_items = []
        start_index = max(self.start_page - 1, 0)
        max_pages = self.pages_per_batch if self.pages_per_batch > 0 else None
 
        for obj in response.get("Contents", []):
            key = obj["Key"]
            if not key.lower().endswith(".pdf"):
                continue
 
            with tempfile.TemporaryDirectory() as tmp:
                local_path = os.path.join(tmp, os.path.basename(key))
                s3.download_file(self.bucket_name, key, local_path)
 
                loader = PyPDFLoader(local_path)
                pages = loader.load()  # one Document per page
 
                if max_pages is not None:
                    pages = pages[start_index : start_index + max_pages]
                else:
                    pages = pages[start_index:]
 
                for page_number, doc in enumerate(pages, start=start_index + 1):
                    if not doc.page_content.strip():
                        continue
 
                    data_items.append(
                        Data(
                            text=doc.page_content,
                            data={
                                **doc.metadata,
                                "bucket": self.bucket_name,
                                "key": key,
                                "endpoint": self.s3_endpoint,
                                "page": page_number,
                            },
                        )
                    )
 
        return DataFrame(data_items)

Setup Requirements

Since Langflow is running inside a Docker container, you need to install the dependencies inside that container.

Option 1: Temporary Install (Quick Test)

Run this command in your terminal while the container is running:

bash
docker exec -it langflow pip install boto3 langchain pypdf

Option 2: Persistent Install (Recommended)

To ensure the libraries are always available, update your docker-compose.yml to use a custom command for the Langflow service:

yaml
  langflow:
    image: langflowai/langflow:latest
    # ... other settings ...
    command: >
      bash -c "pip install boto3 langchain pypdf && langflow run --host 0.0.0.0 --port 7860"

Then restart your stack:

bash
docker compose up -d

Configuration

  1. S3 Endpoint: The full URL of your storage (e.g., https://s3-region.example.com).
  2. Access/Secret Keys: Your S3 credentials.
  3. Folder / Prefix: The path inside the bucket (e.g., documents/raw/).
  4. Pages per Batch: Set to 0 to load the entire document, or a specific number (e.g., 5) to process documents in smaller chunks.