Data Processing

Custom PDF Parser

A specialized component for parsing complex PDF layouts with tables and multi-column text.

Overview

This component uses the pypdf library to extract text from PDF files while maintaining structural integrity. It is particularly useful for documents with complex layouts that standard parsers might struggle with.

Python Code

python

from langflow.custom import CustomComponent
from langflow.schema import Record
from pypdf import PdfReader
import io
 
class PDFParserComponent(CustomComponent):
    display_name = "Custom PDF Parser"
    description = "Extracts text from PDF files with layout awareness."
 
    def build_config(self):
        return {
            "file_path": {"display_name": "File Path", "file_types": ["pdf"]},
            "extract_images": {"display_name": "Extract Images", "value": False},
        }
 
    def build(self, file_path: str, extract_images: bool = False) -> Record:
        reader = PdfReader(file_path)
        text = ""
        for page in reader.pages:
            text += page.extract_text()
        
        return Record(text=text, data={"pages": len(reader.pages)})

How to use

Install the dependency: pip install pypdf
Create a Custom Component in Langflow.
Paste the code above into the code editor.
Connect a file path to the input.