update,
This commit is contained in:
38
goodtastesmartie/task1-deal-broken/ocr/readme.txt
Normal file
38
goodtastesmartie/task1-deal-broken/ocr/readme.txt
Normal file
@@ -0,0 +1,38 @@
|
||||
|
||||
# OCR Sensitive Information Redaction
|
||||
|
||||
This project is a Python script for redacting sensitive information from documents using Optical Character Recognition (OCR).
|
||||
It takes as input documents in various formats (PDF, DOCX, images) containing sensitive information such as credit card numbers and Hong Kong Identity Card numbers,
|
||||
and redacts this information before saving the redacted document in the desired format.
|
||||
|
||||
## Installation
|
||||
|
||||
1. Copy the pythontransform.py in your local machine.
|
||||
|
||||
2. Install the required Python libraries including `opencv-python`, `PyMuPDF` (for PDF processing), `python-docx` (for DOCX processing),
|
||||
and `pytesseract` (for OCR).
|
||||
|
||||
## Usage
|
||||
|
||||
To run the script, use the following command:
|
||||
|
||||
python pythontransform.py <input_file> <output_file>
|
||||
|
||||
Replace `<input_file>` with the path to the input document you want to redact, and `<output_file>` with the desired path for the redacted document.
|
||||
|
||||
For example:
|
||||
|
||||
python pythontransform.py input_document.pdf redacted_document.docx
|
||||
|
||||
This will redact sensitive information from the input PDF file `input_document.pdf` and save the redacted document as `redacted_document.docx`.
|
||||
|
||||
## Supported Formats
|
||||
|
||||
The script supports input documents in the following formats:
|
||||
- PDF
|
||||
- DOCX
|
||||
- Images (PNG, JPEG, etc.)
|
||||
|
||||
The output format for the redacted document is in DOCX format.
|
||||
|
||||
|
Reference in New Issue
Block a user