Build Your Own: Python PDF to Text

Jerry Ng

Aug 1, 2023 • 6 min read

Recently, I found myself facing the need to convert my personal PDF files to text. While there were many existing PDF to text converters available for this task, I couldn't shake the feeling of unease about uploading my private documents to unknown servers. Who knows what could happen to them or how secure they truly are?

I decided to build my own PDF to text converter (demo). Right from the get-go, I knew this was possible by using only Python.

Getting Started

Before that, I must confess, I'm not particularly skilled in frontend development. So, I opted to use PyWebIO, sparing me the hassle of dealing with HTML and CSS.

Main Dependencies

To begin, I set up a new project directory named pypdf2txt and initialized it with Git and Poetry (my go-to virtual environment and package manager for Python).

💡

See how to install Poetry here. Alternatively, you can always use good ol’ Pip and venv

I included the necessary dependencies — pdfminer.six for working with PDFs and pywebio for building the web application.

Lastly, I created a main.py file to hold the core functionality of the converter.

mkdir pypdf2txt
cd pypdf2txt
git init
poetry init
poetry add pdfminer.six pywebio
touch main.py

Optional: Dev Dependencies

I added autopep8 and ruff (instead of flake8) as development dependencies using Poetry. These will help with automatic code formatting and linting:

poetry add -D autopep8 ruff

Converting PDF to Text

NOTE: Throughout this entire walkthrough, we only have one main.py to work with.

Adding Site Description

Let’s start with something a little bit more simple, shall we? To render any output to the browser, we can simply use the pywebio.output module (reference).

# main.py

from functools import partial
from io import BytesIO, StringIO
from pathlib import Path

from pdfminer.high_level import extract_text_to_fp
from pywebio import config, session, start_server
from pywebio.input import file_upload
from pywebio.output import clear, put_buttons, put_code, put_markdown, toast, use_scope
from pywebio.session import download, run_js

def render_description():
    description = """
    # Pypdf2: Convert PDF to Text

    A simple Python web service that allows you to convert your PDF documents to text.
    Extract text from PDF files without compromising **privacy**, **security**, and **ownership**.

    ## Features

    -   Converts PDF documents to text
    -   Simple and easy-to-use web interface
    -   Fast and efficient text extraction
    """

    put_markdown(description)

In the render_description function, I define the app's description using a markdown-formatted string.

Then, I use put_markdown from the pywebio.output module to display the description in the web interface.

Setting Up the Web App

This is the main function that brings everything together:

# ... (rest of the code)

def main():
    session.run_js(
        'WebIO._state.CurrentSession.on_session_close(()=>{setTimeout(()=>location.reload(), 4000})',
    )

    render_description()
    # process_pdf() #  TODO: create & uncomment later 

if __name__ == "__main__":
    start_server(main, port=8080)

It starts the PyWebIO server, runs the JavaScript code for auto-reloading the page on session close, renders the initial description of the app, and invokes the process_pdf function to handle the PDF to text conversion.

Now, when you run the app, it will render the description at the start:

poetry run python3 main.py

File Upload

Next, I needed to upload the PDF file through the web interface. Let’s create a function name process_pdf utilizing the file_upload API. I found a handy example of file upload in the PyWebIO documentation, and with a bit of copy-pasting, I was well on my way:

# ... (rest of the code)

def process_pdf():
    put_markdown(
        """
        ## Convert PDF To Text
        """,
    )

    while True:
        pdf_file = file_upload(
            "Select PDF",
            accept="application/pdf",
            max_size="10M",
            multiple=False,
            help_text="sample.pdf",
        )

        text_output = extract_text_from_pdf(pdf_file) # TODO: work on this later
        text_filename = f"{Path(pdf_file['filename']).stem}.txt"

# ... (rest of the code)

First, I added a "Convert PDF To Text” heading to the UI, making it clear what the app was designed to do.

💡

As the input function of PyWebIO is blocking and the input form will be destroyed after successful submission, we put everything under a while true loop here so that we always recreate the input. (reference)

The file_upload function handled the file selection, ensuring that only PDFs were accepted and imposing a reasonable file size limit of 10MB — for now.

In the future, we could configure this based on the deployment server's available RAM.

Extracting Text from PDF

Once the user uploaded the PDF file, the extract_text_from_pdf function swung into action. Here's the code snippet for extracting text from the uploaded PDF:

# ... (rest of the code)

from pdfminer.high_level import extract_text_to_fp

def extract_text_from_pdf(pdf_file):
    pdf_buffer = BytesIO(pdf_file['content'])
    text_buffer = StringIO()
    extract_text_to_fp(pdf_buffer, text_buffer)
    return text_buffer.getvalue()

# ... (rest of the code)

Here's where the magic happened. To extract text from the PDF, I utilized the extract_text_to_fp API from pdfminer to do the job. The concise PDFMiner documentation guided me through the process.

Keeping Output Area Clean

To maintain a clean user experience, I encapsulated the text output and download button in a "scope" using PyWebIO's use_scope function (reference).

This allowed me to clear the specified output area every time a new file was uploaded, ensuring that each conversion was separate and didn't clutter the interface.

# ... (rest of the code)

def process_pdf():
    put_markdown(
        """
        ## Convert PDF To Text
        """,
    )

    while True:
        pdf_file = file_upload(
            "Select PDF",
            accept="application/pdf",
            max_size="10M",
            multiple=False,
            help_text="sample.pdf",
        )
        clear('text-output-area') # NOTE: to clear previous text output

        text_output = extract_text_from_pdf(pdf_file)
        text_filename = f"{Path(pdf_file['filename']).stem}.txt"

        with use_scope('text-output-area'):
            put_markdown(
                """
                ### Text Output
                """,
            )

            put_code(text_output, rows=10)
            put_buttons(
                [
                    "Copy to Clipboard",
                    'Click to Download',
                ],
                onclick=[
                    partial(
												copy_to_clipboard, # TODO: create later
												text=text_output
										),
                    partial(
                        click_to_download, # TODO: create later
                        filename=text_filename,
                        text=text_output.encode(),
                    ),
                ],
            )

# ... (rest of the code)

Download Text Output

Implementing a “Click to Download” feature for our app is super straightforward. In fact, you can choose between using put_file (reference) or download (reference).

💡

See PyWebIO’s demo on download() usage.

I chose the latter because I wanted it to be a button instead of a link:

# ... (rest of the code)

def click_to_download(filename: str, text: bytes):
    download(
        filename,
        text,
    )
    toast('Text file downloaded')

# ... (rest of the code)

With this, users can download the output text with a simple click of a button.

Copy to Clipboard

Lastly, let’s work on our “Copy to Clipboard” feature. This is where things are a little bit more tricky.

I learned that PyWebIO doesn't natively support this copy-to-clipboard function. Yet, the demo above showcased its presence, sparking my determination to uncover its implementation.

Despite my thorough search through the official documentation, the feature remained elusive. However, my experience has led me to Sourcegraph, where I unearthed the secrets behind its clever implementation. Oh, the joy of discovery!

After taking inspiration from PyWebIO's demo example, here’s what I arrived at:

# ... (rest of the code)

def copy_to_clipboard(text: str):
    clipboard_setup = """
    window.writeText = function(text) {
        const input = document.createElement('textarea');
        input.style.opacity  = 0;
        input.style.position = 'absolute';
        input.style.left = '-100000px';
        document.body.appendChild(input);

        input.value = text;
        input.select();
        input.setSelectionRange(0, text.length);
        document.execCommand('copy');
        document.body.removeChild(input);
        return true;
    }
    """
    run_js(clipboard_setup)
    run_js("writeText(text)", text=text)
    toast('Text copied to the clipboard')

# ... (rest of the code)

When the user clicks the "Copy to Clipboard" button, the JavaScript code within copy_to_clipboard sets up a temporary textarea, copies the text to it, selects the content, and executes the copy command.

As a result, the extracted text is now ready for pasting elsewhere, and a toast notification confirms the successful copy operation. Yay!

Full Code

You may find the full code example here:

Closing Thoughts

Congratulations! You now have the power of converting PDF to text at your fingertips, without sacrificing privacy, security, and ownership.

I remember the first time I stumbled upon PyWebIO, and I was immediately intrigued by its promise of creating web apps with just a few lines of Python code. While it might not suit complex projects, for simple tools, it's amazing.

If I were to build a GUI application, I would have gone for something like PyQt. Today, I’d say no more GUI frameworks; just one browser app is all you need.

💡

Looking for alternatives? Check out Streamlit and Avril.

Happy building!