- Create main script for extracting tables from PDF files and saving to Excel format. - Add dependency checks for required libraries and Ghostscript. - Implement functions for extracting tables and saving them to Excel. - Update README with usage instructions and examples. - Add devcontainer configuration for development environment. - Include .gitignore to exclude PDF and Excel files from version control. - Specify required packages in requirements.txt.
132 lines
3.0 KiB
Markdown
132 lines
3.0 KiB
Markdown
# PDF to Excel Table Extractor
|
|
|
|
A command-line tool to extract tables from PDF files and save them to Excel format.
|
|
|
|
## Features
|
|
|
|
- Extract tables from single or multiple pages
|
|
- Automatically detects tables with and without visible borders
|
|
- Save each table to separate sheets or combine them into one
|
|
- Preserves table headers
|
|
|
|
## Prerequisites
|
|
|
|
This tool requires **Ghostscript** to be installed on your system.
|
|
|
|
### Install Ghostscript
|
|
|
|
**Ubuntu/Debian:**
|
|
```bash
|
|
sudo apt-get install ghostscript
|
|
```
|
|
|
|
**Fedora/RHEL:**
|
|
```bash
|
|
sudo dnf install ghostscript
|
|
```
|
|
|
|
**macOS (using Homebrew):**
|
|
```bash
|
|
brew install ghostscript
|
|
```
|
|
|
|
**Windows:**
|
|
Download and install from: https://www.ghostscript.com/releases/gsdnld.html
|
|
|
|
## Installation
|
|
|
|
1. Clone or download this repository:
|
|
```bash
|
|
cd pdf-to-excel
|
|
```
|
|
|
|
2. Create a virtual environment (recommended):
|
|
```bash
|
|
python3 -m venv venv
|
|
source venv/bin/activate # On Windows: venv\Scripts\activate
|
|
```
|
|
|
|
3. Install the required Python packages:
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
## Usage
|
|
|
|
### Basic Usage
|
|
|
|
Extract all tables from a PDF:
|
|
```bash
|
|
python pdf_to_excel.py input.pdf output.xlsx
|
|
```
|
|
|
|
### Options
|
|
|
|
| Option | Description |
|
|
|--------|-------------|
|
|
| `-p, --pages` | Specify pages to extract from. Default: `all` |
|
|
| `-c, --combine` | Combine all tables into a single sheet |
|
|
|
|
### Examples
|
|
|
|
Extract tables from all pages:
|
|
```bash
|
|
python pdf_to_excel.py document.pdf tables.xlsx
|
|
```
|
|
|
|
Extract tables from specific pages:
|
|
```bash
|
|
python pdf_to_excel.py document.pdf tables.xlsx --pages 1,2,3
|
|
```
|
|
|
|
Extract tables from a page range:
|
|
```bash
|
|
python pdf_to_excel.py document.pdf tables.xlsx --pages 1-5
|
|
```
|
|
|
|
Extract from a single page:
|
|
```bash
|
|
python pdf_to_excel.py document.pdf tables.xlsx --pages 1
|
|
```
|
|
|
|
Combine all tables into one sheet:
|
|
```bash
|
|
python pdf_to_excel.py document.pdf tables.xlsx --combine
|
|
```
|
|
|
|
## Output
|
|
|
|
By default, each extracted table is saved to a separate sheet in the Excel file:
|
|
- `Table_1` - First table found
|
|
- `Table_2` - Second table found
|
|
- etc.
|
|
|
|
When using the `--combine` flag, all tables are merged into a single sheet called `Combined_Tables`.
|
|
|
|
## Troubleshooting
|
|
|
|
### "No tables found"
|
|
- The PDF might contain images of tables rather than actual table data. This tool works best with text-based PDFs.
|
|
- Try specifying specific pages with the `--pages` option.
|
|
|
|
### Ghostscript errors
|
|
- Make sure Ghostscript is installed and available in your system PATH.
|
|
- On Windows, you may need to restart your terminal after installing Ghostscript.
|
|
|
|
### Poor table extraction quality
|
|
- The tool automatically tries two extraction methods (lattice and stream).
|
|
- For tables with clear borders, the lattice method is used.
|
|
- For tables without borders, the stream method is used.
|
|
|
|
## Dependencies
|
|
|
|
- **camelot-py**: PDF table extraction library
|
|
- **pandas**: Data manipulation and Excel export
|
|
- **openpyxl**: Excel file writing
|
|
- **opencv-python**: Image processing for table detection
|
|
- **ghostscript**: PDF rendering (system dependency)
|
|
|
|
## License
|
|
|
|
MIT License
|