pdf-to-excel/README.md

# PDF to Excel Table Extractor

A command-line tool to extract tables from PDF files and save them to Excel format.

## Features

- Extract tables from single or multiple pages
- Automatically detects tables with and without visible borders
- Save each table to separate sheets or combine them into one
- Preserves table headers

## Prerequisites

This tool requires **Ghostscript** to be installed on your system.

### Install Ghostscript

**Ubuntu/Debian:**
```bash
sudo apt-get install ghostscript
```

**Fedora/RHEL:**
```bash
sudo dnf install ghostscript
```

**macOS (using Homebrew):**
```bash
brew install ghostscript
```

**Windows:**
Download and install from: https://www.ghostscript.com/releases/gsdnld.html

## Installation

1. Clone or download this repository:
   ```bash
   cd pdf-to-excel
   ```

2. Create a virtual environment (recommended):
   ```bash
   python3 -m venv venv
   source venv/bin/activate  # On Windows: venv\Scripts\activate
   ```

3. Install the required Python packages:
   ```bash
   pip install -r requirements.txt
   ```

## Usage

### Basic Usage

Extract all tables from a PDF:
```bash
python pdf_to_excel.py input.pdf output.xlsx
```

### Options

| Option | Description |
|--------|-------------|
| `-p, --pages` | Specify pages to extract from. Default: `all` |
| `-c, --combine` | Combine all tables into a single sheet |

### Examples

Extract tables from all pages:
```bash
python pdf_to_excel.py document.pdf tables.xlsx
```

Extract tables from specific pages:
```bash
python pdf_to_excel.py document.pdf tables.xlsx --pages 1,2,3
```

Extract tables from a page range:
```bash
python pdf_to_excel.py document.pdf tables.xlsx --pages 1-5
```

Extract from a single page:
```bash
python pdf_to_excel.py document.pdf tables.xlsx --pages 1
```

Combine all tables into one sheet:
```bash
python pdf_to_excel.py document.pdf tables.xlsx --combine
```

## Output

By default, each extracted table is saved to a separate sheet in the Excel file:
- `Table_1` - First table found
- `Table_2` - Second table found
- etc.

When using the `--combine` flag, all tables are merged into a single sheet called `Combined_Tables`.

## Troubleshooting

### "No tables found"
- The PDF might contain images of tables rather than actual table data. This tool works best with text-based PDFs.
- Try specifying specific pages with the `--pages` option.

### Ghostscript errors
- Make sure Ghostscript is installed and available in your system PATH.
- On Windows, you may need to restart your terminal after installing Ghostscript.

### Poor table extraction quality
- The tool automatically tries two extraction methods (lattice and stream).
- For tables with clear borders, the lattice method is used.
- For tables without borders, the stream method is used.

## Dependencies

- **camelot-py**: PDF table extraction library
- **pandas**: Data manipulation and Excel export
- **openpyxl**: Excel file writing
- **opencv-python**: Image processing for table detection
- **ghostscript**: PDF rendering (system dependency)

## License

MIT License