Data Processing
Custom PDF Parser
A specialized component for parsing complex PDF layouts with tables and multi-column text.
Overview
This component uses the pypdf library to extract text from PDF files while maintaining structural integrity. It is particularly useful for documents with complex layouts that standard parsers might struggle with.
Python Code
python
from langflow.custom import CustomComponent
from langflow.schema import Record
from pypdf import PdfReader
import io
class PDFParserComponent(CustomComponent):
display_name = "Custom PDF Parser"
description = "Extracts text from PDF files with layout awareness."
def build_config(self):
return {
"file_path": {"display_name": "File Path", "file_types": ["pdf"]},
"extract_images": {"display_name": "Extract Images", "value": False},
}
def build(self, file_path: str, extract_images: bool = False) -> Record:
reader = PdfReader(file_path)
text = ""
for page in reader.pages:
text += page.extract_text()
return Record(text=text, data={"pages": len(reader.pages)})How to use
- Install the dependency:
pip install pypdf - Create a Custom Component in Langflow.
- Paste the code above into the code editor.
- Connect a file path to the input.