39 lines
1.3 KiB
Plaintext
39 lines
1.3 KiB
Plaintext
|
|
# OCR Sensitive Information Redaction
|
|
|
|
This project is a Python script for redacting sensitive information from documents using Optical Character Recognition (OCR).
|
|
It takes as input documents in various formats (PDF, DOCX, images) containing sensitive information such as credit card numbers and Hong Kong Identity Card numbers,
|
|
and redacts this information before saving the redacted document in the desired format.
|
|
|
|
## Installation
|
|
|
|
1. Copy the pythontransform.py in your local machine.
|
|
|
|
2. Install the required Python libraries including `opencv-python`, `PyMuPDF` (for PDF processing), `python-docx` (for DOCX processing),
|
|
and `pytesseract` (for OCR).
|
|
|
|
## Usage
|
|
|
|
To run the script, use the following command:
|
|
|
|
python pythontransform.py <input_file> <output_file>
|
|
|
|
Replace `<input_file>` with the path to the input document you want to redact, and `<output_file>` with the desired path for the redacted document.
|
|
|
|
For example:
|
|
|
|
python pythontransform.py input_document.pdf redacted_document.docx
|
|
|
|
This will redact sensitive information from the input PDF file `input_document.pdf` and save the redacted document as `redacted_document.docx`.
|
|
|
|
## Supported Formats
|
|
|
|
The script supports input documents in the following formats:
|
|
- PDF
|
|
- DOCX
|
|
- Images (PNG, JPEG, etc.)
|
|
|
|
The output format for the redacted document is in DOCX format.
|
|
|
|
|