Data Ingestion
S3 PDF Folder Loader
Loads and extracts text from multiple PDF files stored in an S3-compatible bucket (like Cloudian HyperStore) with support for page batching.
Overview
This component allows you to point Langflow at a folder (prefix) in an S3 bucket and automatically extract text from every PDF found within. It is designed to work with any S3-compatible storage, including Cloudian HyperStore.
Key features:
- Batching: Process specific page ranges to manage memory or LLM context limits.
- Metadata: Automatically attaches the bucket name, object key, and page number to each record.
- Compatibility: Works with AWS S3, Cloudian, MinIO, and other S3-compatible APIs.
Python Code
python
import os
import tempfile
import boto3
from langchain.document_loaders import PyPDFLoader
from langflow.custom.custom_component.component import Component
from langflow.io import MessageTextInput, SecretStrInput, IntInput, Output
from langflow.schema.data import Data
from langflow.schema.dataframe import DataFrame
class CloudianS3LoadPDFs(Component):
display_name = "S3 Load PDFs from Folder"
description = "Loads and extracts text from PDF files stored in an S3-compatible bucket, with page batching."
icon = "database-2-line"
name = "CloudianS3LoadPDFs"
inputs = [
MessageTextInput(name="s3_endpoint", display_name="S3 Endpoint", required=True),
MessageTextInput(name="access_key", display_name="Access Key", required=True),
SecretStrInput(name="secret_key", display_name="Secret Key", required=True),
MessageTextInput(name="bucket_name", display_name="Bucket Name", required=True),
MessageTextInput(name="folder_prefix", display_name="Folder / Prefix", required=True),
IntInput(name="start_page", display_name="Start Page (1-based)", value=1),
IntInput(name="pages_per_batch", display_name="Pages per Batch (0 = all)", value=0),
]
outputs = [
Output(display_name="Documents", name="dataframe", method="load_documents")
]
def load_documents(self) -> DataFrame:
s3 = boto3.client(
"s3",
endpoint_url=self.s3_endpoint,
aws_access_key_id=self.access_key,
aws_secret_access_key=self.secret_key,
)
response = s3.list_objects_v2(
Bucket=self.bucket_name,
Prefix=self.folder_prefix,
)
data_items = []
start_index = max(self.start_page - 1, 0)
max_pages = self.pages_per_batch if self.pages_per_batch > 0 else None
for obj in response.get("Contents", []):
key = obj["Key"]
if not key.lower().endswith(".pdf"):
continue
with tempfile.TemporaryDirectory() as tmp:
local_path = os.path.join(tmp, os.path.basename(key))
s3.download_file(self.bucket_name, key, local_path)
loader = PyPDFLoader(local_path)
pages = loader.load() # one Document per page
if max_pages is not None:
pages = pages[start_index : start_index + max_pages]
else:
pages = pages[start_index:]
for page_number, doc in enumerate(pages, start=start_index + 1):
if not doc.page_content.strip():
continue
data_items.append(
Data(
text=doc.page_content,
data={
**doc.metadata,
"bucket": self.bucket_name,
"key": key,
"endpoint": self.s3_endpoint,
"page": page_number,
},
)
)
return DataFrame(data_items)Setup Requirements
Since Langflow is running inside a Docker container, you need to install the dependencies inside that container.
Option 1: Temporary Install (Quick Test)
Run this command in your terminal while the container is running:
bash
docker exec -it langflow pip install boto3 langchain pypdfOption 2: Persistent Install (Recommended)
To ensure the libraries are always available, update your docker-compose.yml to use a custom command for the Langflow service:
yaml
langflow:
image: langflowai/langflow:latest
# ... other settings ...
command: >
bash -c "pip install boto3 langchain pypdf && langflow run --host 0.0.0.0 --port 7860"Then restart your stack:
bash
docker compose up -dConfiguration
- S3 Endpoint: The full URL of your storage (e.g.,
https://s3-region.example.com). - Access/Secret Keys: Your S3 credentials.
- Folder / Prefix: The path inside the bucket (e.g.,
documents/raw/). - Pages per Batch: Set to
0to load the entire document, or a specific number (e.g.,5) to process documents in smaller chunks.