004_comission/goodtastesmartie/task1-deal-broken/ocr/readme.txt


# OCR Sensitive Information Redaction

This project is a Python script for redacting sensitive information from documents using Optical Character Recognition (OCR).
It takes as input documents in various formats (PDF, DOCX, images) containing sensitive information such as credit card numbers and Hong Kong Identity Card numbers,
and redacts this information before saving the redacted document in the desired format.

## Installation

1. Copy the pythontransform.py in your local machine.

2. Install the required Python libraries including `opencv-python`, `PyMuPDF` (for PDF processing), `python-docx` (for DOCX processing),
and `pytesseract` (for OCR).

## Usage

To run the script, use the following command:

python pythontransform.py <input_file> <output_file>

Replace `<input_file>` with the path to the input document you want to redact, and `<output_file>` with the desired path for the redacted document.

For example:

python pythontransform.py input_document.pdf redacted_document.docx

This will redact sensitive information from the input PDF file `input_document.pdf` and save the redacted document as `redacted_document.docx`.

## Supported Formats

The script supports input documents in the following formats:
- PDF
- DOCX
- Images (PNG, JPEG, etc.)

The output format for the redacted document is in DOCX format.