# PDF to Excel Table Extractor A command-line tool to extract tables from PDF files and save them to Excel format. ## Features - Extract tables from single or multiple pages - Automatically detects tables with and without visible borders - Save each table to separate sheets or combine them into one - Preserves table headers ## Prerequisites This tool requires **Ghostscript** to be installed on your system. ### Install Ghostscript **Ubuntu/Debian:** ```bash sudo apt-get install ghostscript ``` **Fedora/RHEL:** ```bash sudo dnf install ghostscript ``` **macOS (using Homebrew):** ```bash brew install ghostscript ``` **Windows:** Download and install from: https://www.ghostscript.com/releases/gsdnld.html ## Installation 1. Clone or download this repository: ```bash cd pdf-to-excel ``` 2. Create a virtual environment (recommended): ```bash python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate ``` 3. Install the required Python packages: ```bash pip install -r requirements.txt ``` ## Usage ### Basic Usage Extract all tables from a PDF: ```bash python pdf_to_excel.py input.pdf output.xlsx ``` ### Options | Option | Description | |--------|-------------| | `-p, --pages` | Specify pages to extract from. Default: `all` | | `-c, --combine` | Combine all tables into a single sheet | ### Examples Extract tables from all pages: ```bash python pdf_to_excel.py document.pdf tables.xlsx ``` Extract tables from specific pages: ```bash python pdf_to_excel.py document.pdf tables.xlsx --pages 1,2,3 ``` Extract tables from a page range: ```bash python pdf_to_excel.py document.pdf tables.xlsx --pages 1-5 ``` Extract from a single page: ```bash python pdf_to_excel.py document.pdf tables.xlsx --pages 1 ``` Combine all tables into one sheet: ```bash python pdf_to_excel.py document.pdf tables.xlsx --combine ``` ## Output By default, each extracted table is saved to a separate sheet in the Excel file: - `Table_1` - First table found - `Table_2` - Second table found - etc. When using the `--combine` flag, all tables are merged into a single sheet called `Combined_Tables`. ## Troubleshooting ### "No tables found" - The PDF might contain images of tables rather than actual table data. This tool works best with text-based PDFs. - Try specifying specific pages with the `--pages` option. ### Ghostscript errors - Make sure Ghostscript is installed and available in your system PATH. - On Windows, you may need to restart your terminal after installing Ghostscript. ### Poor table extraction quality - The tool automatically tries two extraction methods (lattice and stream). - For tables with clear borders, the lattice method is used. - For tables without borders, the stream method is used. ## Dependencies - **camelot-py**: PDF table extraction library - **pandas**: Data manipulation and Excel export - **openpyxl**: Excel file writing - **opencv-python**: Image processing for table detection - **ghostscript**: PDF rendering (system dependency) ## License MIT License