update,

2025-01-31 19:51:04 +08:00
parent 4c9568fd60
commit 90bb565f91
17 changed files with 372 additions and 0 deletions
--- a/goodtastesmartie/task1-deal-broken/ocr/readme.txt
+++ b/goodtastesmartie/task1-deal-broken/ocr/readme.txt
@@ -0,0 +1,38 @@
+
+# OCR Sensitive Information Redaction
+
+This project is a Python script for redacting sensitive information from documents using Optical Character Recognition (OCR).
+It takes as input documents in various formats (PDF, DOCX, images) containing sensitive information such as credit card numbers and Hong Kong Identity Card numbers,
+and redacts this information before saving the redacted document in the desired format.
+
+## Installation
+
+1. Copy the pythontransform.py in your local machine.
+
+2. Install the required Python libraries including `opencv-python`, `PyMuPDF` (for PDF processing), `python-docx` (for DOCX processing),
+and `pytesseract` (for OCR).
+
+## Usage
+
+To run the script, use the following command:
+
+python pythontransform.py <input_file> <output_file>
+
+Replace `<input_file>` with the path to the input document you want to redact, and `<output_file>` with the desired path for the redacted document.
+
+For example:
+
+python pythontransform.py input_document.pdf redacted_document.docx
+
+This will redact sensitive information from the input PDF file `input_document.pdf` and save the redacted document as `redacted_document.docx`.
+
+## Supported Formats
+
+The script supports input documents in the following formats:
+- PDF
+- DOCX
+- Images (PNG, JPEG, etc.)
+
+The output format for the redacted document is in DOCX format.
+
+