π Quick Start β’ βοΈ How It Works β’ π‘ API Docs β’ π― Coverage β’ πΊοΈ Roadmap
Aegis-CDR is not a virus scanner β it's a Content Disarm & Reconstruction system. It deconstructs every PDF and DOCX file into atomic components, surgically strips all active threats, and rebuilds a pixel-perfect, mathematically safe document. Powered by Groq LLaMA 3.3-70B for AI-driven threat intelligence.
Aegis CDR detecting 14 threats in a malicious DOCX β Risk Score 100/CRITICAL β Groq AI analysis active
- π What is CDR?
- βοΈ How Aegis Works
- ποΈ Architecture
- β¨ Features
- π§° Tech Stack
- π Quick Start
- π Project Structure
- π‘ API Reference
- π― Threat Detection Coverage
- π Risk Scoring Engine
- π₯οΈ Frontend Interface
- π§ͺ Testing
- πΊοΈ Roadmap
- π€ Contributing
β οΈ Disclaimer
Content Disarm & Reconstruction (CDR) is a cybersecurity technique that goes far beyond traditional antivirus. Instead of asking "Is this file malicious?" β which fails against zero-days β CDR assumes every file is potentially dangerous and treats it accordingly.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β THE CDR PHILOSOPHY β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ€
β Traditional AV β Aegis CDR β
ββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββ€
β File β Scan β File β Decompose β
β "Is this bad?" β "What is active content?" β
β ALLOW or BLOCK β Strip ALL active content β
β Fails on zero-days β Rebuild clean from safe parts β
β ~99% detection β 100% β no active content can exist β
ββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββ
The reconstructed document looks identical to the original β all text, images, and formatting preserved β but is mathematically impossible to contain executable threats.
Aegis-CDR processes every document through 4 hardened security layers:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AEGIS-CDR PIPELINE β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π₯ UNTRUSTED FILE (PDF / DOCX)
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LAYER 1 ββ INGESTION & FINGERPRINTING β
β β
β β’ Reads first 8 bytes β true magic number detection β
β β’ Detects real MIME type, ignores file extension β
β β’ Blocks MZ (PE .exe), ELF, shell scripts in β
β disguise β
β β’ For ZIP-based files: inspects [Content_Types].xml β
β to confirm genuine OOXML structure β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
PASS / π¨ BLOCKED
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LAYER 2 ββ DECOMPOSITION ENGINE β
β β
β PDF: Iterates every xref object in the PDF tree β
β Scans dictionary keys for threat signatures β
β DOCX: Unzips OPC package, maps all XML parts β
β Reads all relationship files (.rels) β
β β’ Builds complete threat surface map β
β β’ Records all active content locations β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LAYER 3 ββ SANITIZATION (DISARM) β
β β
β PDF: xref_set_key() nulls dangerous dict entries β
β Removes /JavaScript /OpenAction /AA /Launch β
β /EmbeddedFile /RichMedia /Sound /Movie β
β DOCX: Deletes vbaProject.bin, customXml/, activeX/ β
β Strips attachedTemplate from .rels files β
β Scrubs DDEAUTO, MACROBUTTON fields in XML β
β Neutralizes external hyperlinks β "#" β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LAYER 4 ββ RECONSTRUCTION + AI SENTRY β
β β
β PDF: Incremental save β appends only delta bytes β
β Output size β input size (no re-encoding) β
β DOCX: Re-zips clean package with sanitized XML β
β AI: Groq LLaMA 3.3-70B analyzes threat report β
β Generates natural-language security summary β
β Risk score 0-100 + visual integrity check β
β π€ CLEAN FILE + JSON THREAT INTELLIGENCE REPORT β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Aegis uses a 3-tier fallback strategy for PDF processing:
| Mode | When Used | Output Quality | Size Impact |
|---|---|---|---|
| Scrub + Incremental Save | Default β all clean PDFs | Perfect fidelity | β Same as input |
| Full Reconstruction | When scrub mode fails | High fidelity | +10β15% |
| Pixel-Only Fallback | When risk score β₯ 75 or reconstruction fails | Rasterized (not searchable) | Variable |
aegis-cdr/
βββ π api/
β βββ main.py # FastAPI REST API + static frontend serving
β
βββ π§ core/
β βββ pdf/
β β βββ sanitizer.py # 4-mode PDF CDR engine
β βββ docx/
β β βββ sanitizer.py # DOCX ZIP/XML surgical disarming
β βββ ai/
β βββ sentry.py # Groq LLM threat intelligence layer
β
βββ π οΈ utils/
β βββ validator.py # SafeTypeValidator β magic byte fingerprinting
β
βββ π rules/
β βββ aegis_rules.yar # 13 YARA detection patterns
β
βββ π static/
β βββ index.html # Complete frontend β zero npm, zero dependencies
β
βββ π§ͺ create_test_files.py # Generates malicious test files for validation
βββ β‘ aegis_standalone.py # CLI β test without starting API
βββ π .env.example # Environment configuration template
βββ π¦ requirements.txt # Python dependencies
π Security Engine β Click to expand
| Feature | Detail |
|---|---|
| Magic Byte Validation | Reads true binary signature β extension is irrelevant |
| Extension Spoof Detection | Catches .exe renamed to .pdf, PE headers in .docx |
| PDF JavaScript Removal | Strips /JavaScript, /JS from all xref objects |
| OpenAction Disarming | Removes auto-execute triggers from document catalog |
| Additional Actions (AA) | Strips page-level and field-level event handlers |
| Launch Action Blocking | Removes shell command execution annotations |
| EmbeddedFile Extraction | Detects and removes attached file payloads |
| Rich Media Stripping | Removes Flash/video embedding (historic exploit vector) |
| Widget Annotation Removal | Strips interactive form fields with action triggers |
| PostScript XObject Detection | Flags /XObject + /PS β used in CVE exploitation |
| VBA Macro Removal | Deletes vbaProject.bin (verified by OLE2 magic bytes) |
| Remote Template Blocking | Strips attachedTemplate external DOTM injection |
| DDE Field Stripping | Removes DDEAUTO, DDE, MACROBUTTON fields |
| OLE Object Blocking | Removes embedded OLE2 executable objects |
| ActiveX Removal | Deletes word/activeX/ directory entirely |
| External Link Neutralization | Replaces tracking/phishing URLs with # |
| Custom XML Blocking | Removes customXml/ parts (data injection vector) |
| Pixel-Only Fallback | Emergency rasterization β output is pure images, zero attack surface |
π€ AI Intelligence Layer β Click to expand
| Feature | Detail |
|---|---|
| Groq LLaMA 3.3-70B | State-of-the-art LLM for threat narrative generation |
| Natural Language Reports | Plain-English explanation of every threat found |
| Contextual Risk Reasoning | AI understands why each threat is dangerous |
| Risk Score 0β100 | Weighted cumulative scoring across all threat types |
| 5 Risk Levels | CLEAN / LOW / MEDIUM / HIGH / CRITICAL |
| Threat Categorization | Groups into: Scripts, Macros, Links, Embedded Objects, Auto-Execute |
| Visual Integrity Check | Compares original vs clean page count |
| Model Selection | Configurable: llama-3.3-70b / llama-3.1-8b / mixtral-8x7b |
| Rule-Based Fallback | Fully deterministic scoring β works with no API key |
π₯οΈ Frontend Interface β Click to expand
| Feature | Detail |
|---|---|
| Zero Dependencies | Single HTML file β no npm, no Node.js, no build step |
| Drag & Drop Upload | Drop PDF or DOCX directly, with animated feedback |
| Processing Animation | 5-step pipeline visualization while scanning |
| Animated Risk Gauge | SVG ring meter animates from 0 to threat score |
| Color-Coded Risk Level | Green β Cyan β Gold β Orange β Red based on score |
| Threat Breakdown Bars | Animated category bars with per-category counts |
| Threat Item List | Individual threat descriptions for β€15 threats |
| Stats Dashboard | Threats found, original size, clean size, processing time, pages |
| Groq AI Report | Full natural-language analysis with model attribution |
| One-Click Download | Download sanitized file directly from results |
| Cyberpunk Aesthetic | Dark theme, grid background, scan-line animation, glowing accents |
- Python 3.10+
- A Groq API key β get one free at console.groq.com
- Git
# Clone the repository
git clone https://github.com/Nauman123-coder/aegis-cdr.git
cd aegis-cdr
# Create virtual environment
python -m venv .venv
# Activate (Windows Git Bash)
source .venv/Scripts/activate
# Activate (Linux / macOS)
source .venv/bin/activate
# Install all dependencies
pip install -r requirements.txt# Copy the environment template
cp .env.example .envOpen .env and add your key:
# Required
GROQ_API_KEY=gsk_your_groq_api_key_here
# Optional β choose your model
GROQ_MODEL=llama-3.3-70b-versatileModel Options:
Model Speed Quality Best For llama-3.1-8b-instantβ‘β‘β‘ β β β High-volume scanning llama-3.3-70b-versatileβ‘β‘ β β β β β Recommended mixtral-8x7b-32768β‘β‘ β β β β Long documents
uvicorn api.main:app --reloadThat's it. Open http://localhost:8000 β the full drag-and-drop UI loads instantly.
No npm. No Node.js. No second terminal. No build step.
aegis-cdr/
β
βββ api/
β βββ main.py β FastAPI application
β POST /api/sanitize β main CDR endpoint
β GET /api/download/{token} β file download
β GET /api/health β Groq status check
β GET / β serves the frontend
β
βββ core/
β βββ pdf/
β β βββ sanitizer.py β PDF CDR Engine
β β _scan_for_threats() β xref/annotation scanner
β β _scrub_inplace() β surgical key removal
β β _pixel_only_fallback() β emergency rasterizer
β β
β βββ docx/
β β βββ sanitizer.py β DOCX CDR Engine
β β Unzips OPC package
β β Strips .rels, deletes vbaProject.bin
β β Scrubs document.xml fields
β β Re-zips clean package
β β
β βββ ai/
β βββ sentry.py β AI Sentry (Groq Integration)
β summarize() β Groq LLM narrative
β risk_score() β weighted 0-100 scoring
β categorize_threats() β grouping engine
β
βββ utils/
β βββ validator.py β SafeTypeValidator
β detect_true_type() β magic byte detection
β validate_extension_matches() β spoof check
β
βββ rules/
β βββ aegis_rules.yar β 13 YARA Detection Rules
β PDF_Embedded_JavaScript
β PDF_OpenAction_AutoLaunch
β PDF_Heap_Spray_Pattern
β DOCX_VBA_Macro_Present
β DOCX_External_Template_Injection (T1221)
β DOCX_DDE_Injection (T1559.002)
β Generic_Suspicious_PowerShell
β Generic_Base64_Shellcode
β + 5 more
β
βββ static/
β βββ index.html β Complete Frontend (620 lines, zero deps)
β Drag-drop upload zone
β 5-step processing animation
β Animated SVG risk gauge
β Threat breakdown bar charts
β Groq AI analysis panel
β Download button
β
βββ create_test_files.py β Test File Generator
β malicious_test.pdf (5 threat types)
β malicious_test.docx (8 threat types)
β spoofed_exe.pdf (MZ magic in .pdf)
β
βββ aegis_standalone.py β CLI Interface
βββ .env.example β Environment template
βββ requirements.txt β Python dependencies
Upload a PDF or DOCX β receive a full threat intelligence report and download token.
Request
Content-Type: multipart/form-data
Body: file=<binary>
Response
{
"status": "sanitized",
"original_filename": "invoice.pdf",
"sanitized_filename": "SAFE_invoice.pdf",
"true_mime_type": "application/pdf",
"file_size_original": 816384,
"file_size_sanitized": 798720,
"processing_time_ms": 1247,
"page_count_original": 15,
"page_count_sanitized": 15,
"items_removed_count": 13,
"threat_categories": [
{
"name": "Scripts & JavaScript",
"items": [
"Document Catalog xref 1: /JavaScript detected and stripped",
"Threat in xref 1: /JS",
"Threat in xref 3: /JavaScript"
],
"icon": "β‘"
},
{
"name": "Auto-Execute Actions",
"items": [
"Document Catalog xref 1: /OpenAction detected and stripped",
"Threat in xref 1: /AA"
],
"icon": "π"
}
],
"risk": {
"score": 100,
"level": "CRITICAL",
"color": "#ff1a1a",
"rationale": "Embedded JavaScript; Auto-execute on open; Shell launch command"
},
"ai_summary": "A thorough analysis of invoice.pdf revealed critical threats including JavaScript and OpenAction exploits, which could have allowed arbitrary code execution and unauthorized system access if not neutralized. The removal of 13 malicious items has mitigated the risk of these exploits being used to compromise system security. The document is now safe for use, with all identified threats stripped and visual integrity confirmed at 15 pages.",
"groq_powered": true,
"fallback_used": false,
"download_token": "aegis_1709123456_SAFE_invoice.pdf"
}Error Responses
| Status | Error Code | Description |
|---|---|---|
415 |
FILE_BLOCKED |
Magic bytes indicate dangerous file type |
415 |
UNSUPPORTED_TYPE |
Not PDF or DOCX |
500 |
SANITIZATION_FAILED |
Internal processing error |
Download the sanitized file by token.
curl http://localhost:8000/api/download/aegis_1709123456_SAFE_invoice.pdf \
--output SAFE_invoice.pdfCheck server status and Groq configuration.
{
"status": "operational",
"version": "2.0.0",
"groq": {
"configured": true,
"model": "llama-3.3-70b-versatile"
},
"ui": "http://localhost:8000",
"supported_formats": ["PDF", "DOCX"]
}| Threat | PDF Key | Points | Impact |
|---|---|---|---|
| Embedded JavaScript | /JavaScript, /JS |
+40 | Arbitrary code execution on open |
| Auto-Execute Action | /OpenAction |
+30 | Triggers immediately when PDF opens |
| Additional Actions | /AA |
+25 | Page/annotation/field event triggers |
| Shell Launch | /Launch |
+50 | Spawns external process (cmd.exe, bash) |
| Embedded File | /EmbeddedFile |
+20 | Attached payload (exe, dll, bat) |
| Rich Media | /RichMedia |
+30 | Flash/video execution context |
| Form Widget | /Widget |
+20 | Interactive field with action trigger |
| PostScript XObject | /XObject + /PS |
+35 | PostScript code injection |
| External URI | /URI |
+10 | Tracking pixel / SSRF / phishing |
| Threat | Location | Points | Impact |
|---|---|---|---|
| VBA Macros | vbaProject.bin |
+45 | AutoOpen/AutoExec code execution |
| Remote Template | attachedTemplate rel |
+30 | Loads macro payload from remote URL |
| DDE Field | DDEAUTO in instrText |
+35 | Dynamic Data Exchange cmd.exe execution |
| Macro Button | MACROBUTTON field |
+40 | Click-triggered macro execution |
| OLE Object | word/embeddings/ |
+35 | Embedded executable object |
| ActiveX Control | word/activeX/ |
+40 | Script-executable browser control |
| External Hyperlink | word/_rels/ |
+10 | Tracking/phishing/SSRF link |
| Custom XML | customXml/ |
+10 | Schema-based data injection |
PDF_Embedded_JavaScript β /JS and /JavaScript in PDF streams
PDF_OpenAction_AutoLaunch β /OpenAction trigger detection
PDF_Heap_Spray_Pattern β Large repeated NOP sled patterns
PDF_Suspicious_URI β Encoded/obfuscated URI actions
DOCX_VBA_Macro_Present β OLE2 vbaProject.bin signature
DOCX_External_Template_Injection β MITRE ATT&CK T1221
DOCX_DDE_Injection β MITRE ATT&CK T1559.002
DOCX_Macro_Auto_Execute β AutoOpen/AutoExec triggers
Generic_Suspicious_PowerShell β Encoded PowerShell download cradles
Generic_Base64_Shellcode β Base64-encoded executable payloads
Generic_URL_Obfuscation β Hex/percent-encoded malicious URLs
Generic_PE_In_Document β MZ magic bytes inside document stream
Generic_OLE_Embedded β OLE2 compound document signature
Aegis computes a cumulative risk score based on all threats found:
Score = Ξ£(threat_points) capped at 100
| Score | Level | Color | Indicator |
|---|---|---|---|
| 0 | β CLEAN | #00ff9d |
No active content |
| 1 β 19 | π΅ LOW | #00c9ff |
Tracking links or custom XML only |
| 20 β 39 | π‘ MEDIUM | #ffd700 |
Embedded files or form widgets |
| 40 β 69 | π HIGH | #ff6b35 |
VBA macros, DDE injection, OLE objects |
| 70 β 100 | π΄ CRITICAL | #ff1a1a |
JavaScript, LaunchAction, or pixel fallback |
Example Scoring:
Document with VBA macro (+45) + DDE injection (+35) + 2 hyperlinks (+20) = 100 β CRITICAL
Document with 3 tracking hyperlinks only (+30) = 30 β MEDIUM
Clean research paper = 0 β CLEAN β
The entire frontend is a single self-contained HTML file (static/index.html) served directly by FastAPI. No npm, no Node.js, no build tools required.
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β DROP ZONE βββββΆβ PROCESSING βββββΆβ RESULTS βββββΆβ DOWNLOAD β
β β β β β β β β
β Drag & Drop β β Step-by-step β β Risk Gauge β β SAFE_*.pdf β
β or Browse β β animation β β Threat Bars β β or .docx β
β β β 5 stages β β AI Report β β β
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
- π Fingerprinting magic bytes
- π§© Decomposing component tree
- π£ Disarming active content
- ποΈ Reconstructing clean document
- π€ Groq AI Sentry analysis
Background: #060a10 (near-black)
Surface: #0c1220 / #111827
Accent: #00d4ff (cyan)
Success: #00ff9d (green)
Danger: #ff3c3c (red)
Warning: #ffd700 (gold)
Font: Share Tech Mono + Rajdhani (Google Fonts)
Effects: Animated grid, corner glows, scan-line, SVG ring meter
python create_test_files.pyProduces test_files/ with three validation files:
malicious_test.pdf β Hand-crafted PDF with 5 real threat structures:
/OpenActionwith/JavaScriptin catalog (auto-runs on open)/AAAdditional Actions on catalog and page/Launchannotation pointing tocmd.exe /c calc.exe/EmbeddedFileattachment (malware_payload.exe)/URIexternal tracking link
β Expected: Risk CRITICAL (100) | Threats: 13 | Groq names JS/OpenAction/Launch
malicious_test.docx β Full-threat DOCX with 8 attack vectors:
vbaProject.binwith real OLE2 magic bytes +AutoOpen+Shellmacro- External template injection via
attachedTemplate DDEAUTOfield withcmd.exe + PowerShell -EncodedCommandMACROBUTTONfield- 2Γ external hyperlinks (phishing + tracking pixel)
- ActiveX control with CLSID
- 3Γ
customXmlparts with base64 payload
β Expected: Risk CRITICAL (100) | Threats: 14 | All 5 categories populated
spoofed_exe.pdf β Windows PE executable with .pdf extension:
- MZ magic bytes:
4D 5A 90 00 03 00... - Claimed extension:
.pdf - True type:
application/x-msdownload
β Expected: π¨ BLOCKED at Layer 1 β never reaches sanitization
# Sanitize a single file
python aegis_standalone.py --file document.pdf
# Enable pixel fallback for high-risk files
python aegis_standalone.py --file risky.pdf --pixel-fallback
# Run built-in demo (creates and sanitizes test files automatically)
python aegis_standalone.py --demo
# API health check
curl http://localhost:8000/api/health- PDF CDR engine with incremental save
- DOCX CDR engine with full XML sanitization
- Groq LLM threat intelligence
- Magic byte fingerprinting
- Zero-dependency frontend
- Risk scoring engine
- YARA rule integration
- XLSX support β Spreadsheet CDR (macro-enabled workbooks)
- PPTX support β Presentation CDR
- RTF support β Rich Text Format
- Email pipeline β
.eml/.msgattachment scanning - Batch API β Async queue for multiple files
- Docker container β One-command deployment
- Webhook callbacks β POST results to external URL
- Audit log β Persistent scan history with search
- Password-protected PDF β Decrypt before sanitize
- Report export β PDF/HTML threat report download
Contributions are warmly welcome!
# Fork and clone
git clone https://github.com/YOUR_USERNAME/aegis-cdr.git
cd aegis-cdr
# Set up development environment
python -m venv .venv
source .venv/Scripts/activate # Windows
source .venv/bin/activate # Linux/macOS
pip install -r requirements.txt
cp .env.example .env # Add your GROQ_API_KEY
# Create a feature branch
git checkout -b feature/your-feature-name
# Make changes, then commit
git add .
git commit -m "feat: add XLSX support"
# Push and open PR
git push origin feature/your-feature-name| Area | Description |
|---|---|
| π New file formats | Add XLSX, PPTX, RTF, EML engines |
| π Detection rules | New YARA rules, threat signatures |
| π€ AI improvements | Better prompts, structured Groq output |
| π¨ Frontend | UI improvements, dark/light theme |
| π¦ Deployment | Docker, CI/CD, cloud deployment guides |
| π Documentation | Examples, tutorials, threat research |
Aegis-CDR is a security research and document sanitization tool.
- The test files in
create_test_files.pycontain simulated threat structures β no working exploits or actual malware - Always run in an isolated environment when processing real-world untrusted files
- Aegis-CDR is a defence tool β do not use to craft malicious documents
- The pixel fallback mode produces rasterized output β text will not be searchable/copyable
This project is licensed under the MIT License β see the LICENSE file for details.
Built with π‘οΈ by Nauman Ahmad
Detect. Neutralize. Reconstruct.
β Star this repo if Aegis-CDR helped secure your documents!