Go to file
Lukas Holzner 6b18868ca6 Add initial implementation of PDF to Excel table extractor
- Create main script for extracting tables from PDF files and saving to Excel format.
- Add dependency checks for required libraries and Ghostscript.
- Implement functions for extracting tables and saving them to Excel.
- Update README with usage instructions and examples.
- Add devcontainer configuration for development environment.
- Include .gitignore to exclude PDF and Excel files from version control.
- Specify required packages in requirements.txt.
2025-12-02 09:22:42 +00:00
.devcontainer Add initial implementation of PDF to Excel table extractor 2025-12-02 09:22:42 +00:00
.gitignore Add initial implementation of PDF to Excel table extractor 2025-12-02 09:22:42 +00:00
pdf_to_excel.py Add initial implementation of PDF to Excel table extractor 2025-12-02 09:22:42 +00:00
README.md Add initial implementation of PDF to Excel table extractor 2025-12-02 09:22:42 +00:00
requirements.txt Add initial implementation of PDF to Excel table extractor 2025-12-02 09:22:42 +00:00

PDF to Excel Table Extractor

A command-line tool to extract tables from PDF files and save them to Excel format.

Features

  • Extract tables from single or multiple pages
  • Automatically detects tables with and without visible borders
  • Save each table to separate sheets or combine them into one
  • Preserves table headers

Prerequisites

This tool requires Ghostscript to be installed on your system.

Install Ghostscript

Ubuntu/Debian:

sudo apt-get install ghostscript

Fedora/RHEL:

sudo dnf install ghostscript

macOS (using Homebrew):

brew install ghostscript

Windows: Download and install from: https://www.ghostscript.com/releases/gsdnld.html

Installation

  1. Clone or download this repository:

    cd pdf-to-excel
    
  2. Create a virtual environment (recommended):

    python3 -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
    
  3. Install the required Python packages:

    pip install -r requirements.txt
    

Usage

Basic Usage

Extract all tables from a PDF:

python pdf_to_excel.py input.pdf output.xlsx

Options

Option Description
-p, --pages Specify pages to extract from. Default: all
-c, --combine Combine all tables into a single sheet

Examples

Extract tables from all pages:

python pdf_to_excel.py document.pdf tables.xlsx

Extract tables from specific pages:

python pdf_to_excel.py document.pdf tables.xlsx --pages 1,2,3

Extract tables from a page range:

python pdf_to_excel.py document.pdf tables.xlsx --pages 1-5

Extract from a single page:

python pdf_to_excel.py document.pdf tables.xlsx --pages 1

Combine all tables into one sheet:

python pdf_to_excel.py document.pdf tables.xlsx --combine

Output

By default, each extracted table is saved to a separate sheet in the Excel file:

  • Table_1 - First table found
  • Table_2 - Second table found
  • etc.

When using the --combine flag, all tables are merged into a single sheet called Combined_Tables.

Troubleshooting

"No tables found"

  • The PDF might contain images of tables rather than actual table data. This tool works best with text-based PDFs.
  • Try specifying specific pages with the --pages option.

Ghostscript errors

  • Make sure Ghostscript is installed and available in your system PATH.
  • On Windows, you may need to restart your terminal after installing Ghostscript.

Poor table extraction quality

  • The tool automatically tries two extraction methods (lattice and stream).
  • For tables with clear borders, the lattice method is used.
  • For tables without borders, the stream method is used.

Dependencies

  • camelot-py: PDF table extraction library
  • pandas: Data manipulation and Excel export
  • openpyxl: Excel file writing
  • opencv-python: Image processing for table detection
  • ghostscript: PDF rendering (system dependency)

License

MIT License