pdf-to-excel/README.md
Lukas Holzner 6b18868ca6 Add initial implementation of PDF to Excel table extractor
- Create main script for extracting tables from PDF files and saving to Excel format.
- Add dependency checks for required libraries and Ghostscript.
- Implement functions for extracting tables and saving them to Excel.
- Update README with usage instructions and examples.
- Add devcontainer configuration for development environment.
- Include .gitignore to exclude PDF and Excel files from version control.
- Specify required packages in requirements.txt.
2025-12-02 09:22:42 +00:00

132 lines
3.0 KiB
Markdown

# PDF to Excel Table Extractor
A command-line tool to extract tables from PDF files and save them to Excel format.
## Features
- Extract tables from single or multiple pages
- Automatically detects tables with and without visible borders
- Save each table to separate sheets or combine them into one
- Preserves table headers
## Prerequisites
This tool requires **Ghostscript** to be installed on your system.
### Install Ghostscript
**Ubuntu/Debian:**
```bash
sudo apt-get install ghostscript
```
**Fedora/RHEL:**
```bash
sudo dnf install ghostscript
```
**macOS (using Homebrew):**
```bash
brew install ghostscript
```
**Windows:**
Download and install from: https://www.ghostscript.com/releases/gsdnld.html
## Installation
1. Clone or download this repository:
```bash
cd pdf-to-excel
```
2. Create a virtual environment (recommended):
```bash
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
```
3. Install the required Python packages:
```bash
pip install -r requirements.txt
```
## Usage
### Basic Usage
Extract all tables from a PDF:
```bash
python pdf_to_excel.py input.pdf output.xlsx
```
### Options
| Option | Description |
|--------|-------------|
| `-p, --pages` | Specify pages to extract from. Default: `all` |
| `-c, --combine` | Combine all tables into a single sheet |
### Examples
Extract tables from all pages:
```bash
python pdf_to_excel.py document.pdf tables.xlsx
```
Extract tables from specific pages:
```bash
python pdf_to_excel.py document.pdf tables.xlsx --pages 1,2,3
```
Extract tables from a page range:
```bash
python pdf_to_excel.py document.pdf tables.xlsx --pages 1-5
```
Extract from a single page:
```bash
python pdf_to_excel.py document.pdf tables.xlsx --pages 1
```
Combine all tables into one sheet:
```bash
python pdf_to_excel.py document.pdf tables.xlsx --combine
```
## Output
By default, each extracted table is saved to a separate sheet in the Excel file:
- `Table_1` - First table found
- `Table_2` - Second table found
- etc.
When using the `--combine` flag, all tables are merged into a single sheet called `Combined_Tables`.
## Troubleshooting
### "No tables found"
- The PDF might contain images of tables rather than actual table data. This tool works best with text-based PDFs.
- Try specifying specific pages with the `--pages` option.
### Ghostscript errors
- Make sure Ghostscript is installed and available in your system PATH.
- On Windows, you may need to restart your terminal after installing Ghostscript.
### Poor table extraction quality
- The tool automatically tries two extraction methods (lattice and stream).
- For tables with clear borders, the lattice method is used.
- For tables without borders, the stream method is used.
## Dependencies
- **camelot-py**: PDF table extraction library
- **pandas**: Data manipulation and Excel export
- **openpyxl**: Excel file writing
- **opencv-python**: Image processing for table detection
- **ghostscript**: PDF rendering (system dependency)
## License
MIT License