- Create main script for extracting tables from PDF files and saving to Excel format. - Add dependency checks for required libraries and Ghostscript. - Implement functions for extracting tables and saving them to Excel. - Update README with usage instructions and examples. - Add devcontainer configuration for development environment. - Include .gitignore to exclude PDF and Excel files from version control. - Specify required packages in requirements.txt. |
||
|---|---|---|
| .devcontainer | ||
| .gitignore | ||
| pdf_to_excel.py | ||
| README.md | ||
| requirements.txt | ||
PDF to Excel Table Extractor
A command-line tool to extract tables from PDF files and save them to Excel format.
Features
- Extract tables from single or multiple pages
- Automatically detects tables with and without visible borders
- Save each table to separate sheets or combine them into one
- Preserves table headers
Prerequisites
This tool requires Ghostscript to be installed on your system.
Install Ghostscript
Ubuntu/Debian:
sudo apt-get install ghostscript
Fedora/RHEL:
sudo dnf install ghostscript
macOS (using Homebrew):
brew install ghostscript
Windows: Download and install from: https://www.ghostscript.com/releases/gsdnld.html
Installation
-
Clone or download this repository:
cd pdf-to-excel -
Create a virtual environment (recommended):
python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate -
Install the required Python packages:
pip install -r requirements.txt
Usage
Basic Usage
Extract all tables from a PDF:
python pdf_to_excel.py input.pdf output.xlsx
Options
| Option | Description |
|---|---|
-p, --pages |
Specify pages to extract from. Default: all |
-c, --combine |
Combine all tables into a single sheet |
Examples
Extract tables from all pages:
python pdf_to_excel.py document.pdf tables.xlsx
Extract tables from specific pages:
python pdf_to_excel.py document.pdf tables.xlsx --pages 1,2,3
Extract tables from a page range:
python pdf_to_excel.py document.pdf tables.xlsx --pages 1-5
Extract from a single page:
python pdf_to_excel.py document.pdf tables.xlsx --pages 1
Combine all tables into one sheet:
python pdf_to_excel.py document.pdf tables.xlsx --combine
Output
By default, each extracted table is saved to a separate sheet in the Excel file:
Table_1- First table foundTable_2- Second table found- etc.
When using the --combine flag, all tables are merged into a single sheet called Combined_Tables.
Troubleshooting
"No tables found"
- The PDF might contain images of tables rather than actual table data. This tool works best with text-based PDFs.
- Try specifying specific pages with the
--pagesoption.
Ghostscript errors
- Make sure Ghostscript is installed and available in your system PATH.
- On Windows, you may need to restart your terminal after installing Ghostscript.
Poor table extraction quality
- The tool automatically tries two extraction methods (lattice and stream).
- For tables with clear borders, the lattice method is used.
- For tables without borders, the stream method is used.
Dependencies
- camelot-py: PDF table extraction library
- pandas: Data manipulation and Excel export
- openpyxl: Excel file writing
- opencv-python: Image processing for table detection
- ghostscript: PDF rendering (system dependency)
License
MIT License