Go to file

Lukas Holzner 6b18868ca6 Add initial implementation of PDF to Excel table extractor - Create main script for extracting tables from PDF files and saving to Excel format. - Add dependency checks for required libraries and Ghostscript. - Implement functions for extracting tables and saving them to Excel. - Update README with usage instructions and examples. - Add devcontainer configuration for development environment. - Include .gitignore to exclude PDF and Excel files from version control. - Specify required packages in requirements.txt.		2025-12-02 09:22:42 +00:00
.devcontainer	Add initial implementation of PDF to Excel table extractor	2025-12-02 09:22:42 +00:00
.gitignore	Add initial implementation of PDF to Excel table extractor	2025-12-02 09:22:42 +00:00
pdf_to_excel.py	Add initial implementation of PDF to Excel table extractor	2025-12-02 09:22:42 +00:00
README.md	Add initial implementation of PDF to Excel table extractor	2025-12-02 09:22:42 +00:00
requirements.txt	Add initial implementation of PDF to Excel table extractor	2025-12-02 09:22:42 +00:00

PDF to Excel Table Extractor

A command-line tool to extract tables from PDF files and save them to Excel format.

Features

This tool requires Ghostscript to be installed on your system.

Ubuntu/Debian:

sudo apt-get install ghostscript

Fedora/RHEL:

sudo dnf install ghostscript

macOS (using Homebrew):

brew install ghostscript

Windows: Download and install from: https://www.ghostscript.com/releases/gsdnld.html

Create a virtual environment (recommended):

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Extract all tables from a PDF:

python pdf_to_excel.py input.pdf output.xlsx

Option	Description
`-p, --pages`	Specify pages to extract from. Default: `all`
`-c, --combine`	Combine all tables into a single sheet

Extract tables from all pages:

python pdf_to_excel.py document.pdf tables.xlsx

Extract tables from specific pages:

python pdf_to_excel.py document.pdf tables.xlsx --pages 1,2,3

Extract tables from a page range:

python pdf_to_excel.py document.pdf tables.xlsx --pages 1-5

Extract from a single page:

python pdf_to_excel.py document.pdf tables.xlsx --pages 1

Combine all tables into one sheet:

python pdf_to_excel.py document.pdf tables.xlsx --combine

By default, each extracted table is saved to a separate sheet in the Excel file:

When using the --combine flag, all tables are merged into a single sheet called Combined_Tables.

The PDF might contain images of tables rather than actual table data. This tool works best with text-based PDFs.
Try specifying specific pages with the --pages option.

MIT License