Skip to content

Nauman123-coder/aegis-cdr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation


Python FastAPI PyMuPDF LangChain Groq JavaScript HTML5 CSS3


Status License PRs Welcome Made with Love Security


πŸš€ Quick Start β€’ βš™οΈ How It Works β€’ πŸ“‘ API Docs β€’ 🎯 Coverage β€’ πŸ—ΊοΈ Roadmap


πŸ›‘οΈ Live Demo


Aegis-CDR is not a virus scanner β€” it's a Content Disarm & Reconstruction system. It deconstructs every PDF and DOCX file into atomic components, surgically strips all active threats, and rebuilds a pixel-perfect, mathematically safe document. Powered by Groq LLaMA 3.3-70B for AI-driven threat intelligence.


Aegis CDR β€” Risk Score CRITICAL, 14 threats detected in malicious_test.docx
Aegis CDR detecting 14 threats in a malicious DOCX β€” Risk Score 100/CRITICAL β€” Groq AI analysis active

πŸ“‹ Table of Contents


πŸ” What is CDR?

Content Disarm & Reconstruction (CDR) is a cybersecurity technique that goes far beyond traditional antivirus. Instead of asking "Is this file malicious?" β€” which fails against zero-days β€” CDR assumes every file is potentially dangerous and treats it accordingly.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    THE CDR PHILOSOPHY                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Traditional AV      β”‚  Aegis CDR                           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  File β†’ Scan         β”‚  File β†’ Decompose                    β”‚
β”‚  "Is this bad?"      β”‚  "What is active content?"           β”‚
β”‚  ALLOW or BLOCK      β”‚  Strip ALL active content            β”‚
β”‚  Fails on zero-days  β”‚  Rebuild clean from safe parts       β”‚
β”‚  ~99% detection      β”‚  100% β€” no active content can exist  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The reconstructed document looks identical to the original β€” all text, images, and formatting preserved β€” but is mathematically impossible to contain executable threats.


βš™οΈ How Aegis Works

Aegis-CDR processes every document through 4 hardened security layers:

  ╔══════════════════════════════════════════════════════════╗
  β•‘                   AEGIS-CDR PIPELINE                     β•‘
  β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

  πŸ“₯  UNTRUSTED FILE (PDF / DOCX)
       β”‚
       β–Ό
  ╔═══════════════════════════════════════════════════════╗
  β•‘  LAYER 1 ── INGESTION & FINGERPRINTING               β•‘
  β•‘                                                       β•‘
  β•‘  β€’ Reads first 8 bytes β€” true magic number detection β•‘
  β•‘  β€’ Detects real MIME type, ignores file extension     β•‘
  β•‘  β€’ Blocks MZ (PE .exe), ELF, shell scripts in        β•‘
  β•‘    disguise                                           β•‘
  β•‘  β€’ For ZIP-based files: inspects [Content_Types].xml β•‘
  β•‘    to confirm genuine OOXML structure                 β•‘
  β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•
       β”‚  βœ… PASS / 🚨 BLOCKED
       β–Ό
  ╔═══════════════════════════════════════════════════════╗
  β•‘  LAYER 2 ── DECOMPOSITION ENGINE                     β•‘
  β•‘                                                       β•‘
  β•‘  PDF:  Iterates every xref object in the PDF tree    β•‘
  β•‘        Scans dictionary keys for threat signatures   β•‘
  β•‘  DOCX: Unzips OPC package, maps all XML parts        β•‘
  β•‘        Reads all relationship files (.rels)           β•‘
  β•‘  β€’ Builds complete threat surface map                 β•‘
  β•‘  β€’ Records all active content locations              β•‘
  β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•
       β”‚
       β–Ό
  ╔═══════════════════════════════════════════════════════╗
  β•‘  LAYER 3 ── SANITIZATION (DISARM)                    β•‘
  β•‘                                                       β•‘
  β•‘  PDF:  xref_set_key() nulls dangerous dict entries   β•‘
  β•‘        Removes /JavaScript /OpenAction /AA /Launch   β•‘
  β•‘        /EmbeddedFile /RichMedia /Sound /Movie        β•‘
  β•‘  DOCX: Deletes vbaProject.bin, customXml/, activeX/  β•‘
  β•‘        Strips attachedTemplate from .rels files      β•‘
  β•‘        Scrubs DDEAUTO, MACROBUTTON fields in XML     β•‘
  β•‘        Neutralizes external hyperlinks β†’ "#"         β•‘
  β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•
       β”‚
       β–Ό
  ╔═══════════════════════════════════════════════════════╗
  β•‘  LAYER 4 ── RECONSTRUCTION + AI SENTRY               β•‘
  β•‘                                                       β•‘
  β•‘  PDF:  Incremental save β€” appends only delta bytes   β•‘
  β•‘        Output size β‰ˆ input size (no re-encoding)     β•‘
  β•‘  DOCX: Re-zips clean package with sanitized XML      β•‘
  β•‘  AI:   Groq LLaMA 3.3-70B analyzes threat report    β•‘
  β•‘        Generates natural-language security summary   β•‘
  β•‘        Risk score 0-100 + visual integrity check     β•‘
  β•‘  πŸ“€ CLEAN FILE + JSON THREAT INTELLIGENCE REPORT    β•‘
  β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

πŸ›‘οΈ PDF Sanitization Modes

Aegis uses a 3-tier fallback strategy for PDF processing:

Mode When Used Output Quality Size Impact
Scrub + Incremental Save Default β€” all clean PDFs Perfect fidelity β‰ˆ Same as input
Full Reconstruction When scrub mode fails High fidelity +10–15%
Pixel-Only Fallback When risk score β‰₯ 75 or reconstruction fails Rasterized (not searchable) Variable

πŸ—οΈ Architecture

aegis-cdr/
β”œβ”€β”€ πŸ”Œ api/
β”‚   └── main.py                 # FastAPI REST API + static frontend serving
β”‚
β”œβ”€β”€ 🧠 core/
β”‚   β”œβ”€β”€ pdf/
β”‚   β”‚   └── sanitizer.py        # 4-mode PDF CDR engine
β”‚   β”œβ”€β”€ docx/
β”‚   β”‚   └── sanitizer.py        # DOCX ZIP/XML surgical disarming
β”‚   └── ai/
β”‚       └── sentry.py           # Groq LLM threat intelligence layer
β”‚
β”œβ”€β”€ πŸ› οΈ utils/
β”‚   └── validator.py            # SafeTypeValidator β€” magic byte fingerprinting
β”‚
β”œβ”€β”€ πŸ“‹ rules/
β”‚   └── aegis_rules.yar         # 13 YARA detection patterns
β”‚
β”œβ”€β”€ 🌐 static/
β”‚   └── index.html              # Complete frontend β€” zero npm, zero dependencies
β”‚
β”œβ”€β”€ πŸ§ͺ create_test_files.py     # Generates malicious test files for validation
β”œβ”€β”€ ⚑ aegis_standalone.py      # CLI β€” test without starting API
β”œβ”€β”€ πŸ“„ .env.example             # Environment configuration template
└── πŸ“¦ requirements.txt         # Python dependencies

✨ Features

πŸ”’ Security Engine β€” Click to expand
Feature Detail
Magic Byte Validation Reads true binary signature β€” extension is irrelevant
Extension Spoof Detection Catches .exe renamed to .pdf, PE headers in .docx
PDF JavaScript Removal Strips /JavaScript, /JS from all xref objects
OpenAction Disarming Removes auto-execute triggers from document catalog
Additional Actions (AA) Strips page-level and field-level event handlers
Launch Action Blocking Removes shell command execution annotations
EmbeddedFile Extraction Detects and removes attached file payloads
Rich Media Stripping Removes Flash/video embedding (historic exploit vector)
Widget Annotation Removal Strips interactive form fields with action triggers
PostScript XObject Detection Flags /XObject + /PS β€” used in CVE exploitation
VBA Macro Removal Deletes vbaProject.bin (verified by OLE2 magic bytes)
Remote Template Blocking Strips attachedTemplate external DOTM injection
DDE Field Stripping Removes DDEAUTO, DDE, MACROBUTTON fields
OLE Object Blocking Removes embedded OLE2 executable objects
ActiveX Removal Deletes word/activeX/ directory entirely
External Link Neutralization Replaces tracking/phishing URLs with #
Custom XML Blocking Removes customXml/ parts (data injection vector)
Pixel-Only Fallback Emergency rasterization β€” output is pure images, zero attack surface
πŸ€– AI Intelligence Layer β€” Click to expand
Feature Detail
Groq LLaMA 3.3-70B State-of-the-art LLM for threat narrative generation
Natural Language Reports Plain-English explanation of every threat found
Contextual Risk Reasoning AI understands why each threat is dangerous
Risk Score 0–100 Weighted cumulative scoring across all threat types
5 Risk Levels CLEAN / LOW / MEDIUM / HIGH / CRITICAL
Threat Categorization Groups into: Scripts, Macros, Links, Embedded Objects, Auto-Execute
Visual Integrity Check Compares original vs clean page count
Model Selection Configurable: llama-3.3-70b / llama-3.1-8b / mixtral-8x7b
Rule-Based Fallback Fully deterministic scoring β€” works with no API key
πŸ–₯️ Frontend Interface β€” Click to expand
Feature Detail
Zero Dependencies Single HTML file β€” no npm, no Node.js, no build step
Drag & Drop Upload Drop PDF or DOCX directly, with animated feedback
Processing Animation 5-step pipeline visualization while scanning
Animated Risk Gauge SVG ring meter animates from 0 to threat score
Color-Coded Risk Level Green β†’ Cyan β†’ Gold β†’ Orange β†’ Red based on score
Threat Breakdown Bars Animated category bars with per-category counts
Threat Item List Individual threat descriptions for ≀15 threats
Stats Dashboard Threats found, original size, clean size, processing time, pages
Groq AI Report Full natural-language analysis with model attribution
One-Click Download Download sanitized file directly from results
Cyberpunk Aesthetic Dark theme, grid background, scan-line animation, glowing accents

🧰 Tech Stack

Layer Technology Version Purpose
Python Python 3.10+ Core engine language
FastAPI FastAPI 0.111 REST API + static file serving
PyMuPDF PyMuPDF (fitz) 1.24 PDF parsing, xref surgery, incremental save
python-docx python-docx 1.1 OOXML ZIP manipulation
LangChain LangChain 0.2 LLM orchestration framework
Groq Groq API β€” Ultra-fast LLM inference
LLaMA LLaMA 3.3-70B β€” Threat narrative generation
YARA YARA 4.5 Pattern-based malware detection
HTML5 Vanilla JS + HTML5 ES2024 Zero-dependency browser frontend
lxml lxml 5.2 XML processing for DOCX parts

πŸš€ Quick Start

Prerequisites

Step 1 β€” Clone & Install

# Clone the repository
git clone https://github.com/Nauman123-coder/aegis-cdr.git
cd aegis-cdr

# Create virtual environment
python -m venv .venv

# Activate (Windows Git Bash)
source .venv/Scripts/activate

# Activate (Linux / macOS)
source .venv/bin/activate

# Install all dependencies
pip install -r requirements.txt

Step 2 β€” Configure Groq

# Copy the environment template
cp .env.example .env

Open .env and add your key:

# Required
GROQ_API_KEY=gsk_your_groq_api_key_here

# Optional β€” choose your model
GROQ_MODEL=llama-3.3-70b-versatile

Model Options:

Model Speed Quality Best For
llama-3.1-8b-instant ⚑⚑⚑ β˜…β˜…β˜… High-volume scanning
llama-3.3-70b-versatile ⚑⚑ β˜…β˜…β˜…β˜…β˜… Recommended
mixtral-8x7b-32768 ⚑⚑ β˜…β˜…β˜…β˜… Long documents

Step 3 β€” Launch

uvicorn api.main:app --reload

That's it. Open http://localhost:8000 β€” the full drag-and-drop UI loads instantly. No npm. No Node.js. No second terminal. No build step.


πŸ“ Project Structure

aegis-cdr/
β”‚
β”œβ”€β”€ api/
β”‚   └── main.py              ← FastAPI application
β”‚                              POST /api/sanitize β€” main CDR endpoint
β”‚                              GET  /api/download/{token} β€” file download
β”‚                              GET  /api/health β€” Groq status check
β”‚                              GET  / β€” serves the frontend
β”‚
β”œβ”€β”€ core/
β”‚   β”œβ”€β”€ pdf/
β”‚   β”‚   └── sanitizer.py     ← PDF CDR Engine
β”‚   β”‚                          _scan_for_threats()  β€” xref/annotation scanner
β”‚   β”‚                          _scrub_inplace()     β€” surgical key removal
β”‚   β”‚                          _pixel_only_fallback() β€” emergency rasterizer
β”‚   β”‚
β”‚   β”œβ”€β”€ docx/
β”‚   β”‚   └── sanitizer.py     ← DOCX CDR Engine
β”‚   β”‚                          Unzips OPC package
β”‚   β”‚                          Strips .rels, deletes vbaProject.bin
β”‚   β”‚                          Scrubs document.xml fields
β”‚   β”‚                          Re-zips clean package
β”‚   β”‚
β”‚   └── ai/
β”‚       └── sentry.py        ← AI Sentry (Groq Integration)
β”‚                              summarize() β€” Groq LLM narrative
β”‚                              risk_score() β€” weighted 0-100 scoring
β”‚                              categorize_threats() β€” grouping engine
β”‚
β”œβ”€β”€ utils/
β”‚   └── validator.py         ← SafeTypeValidator
β”‚                              detect_true_type() β€” magic byte detection
β”‚                              validate_extension_matches() β€” spoof check
β”‚
β”œβ”€β”€ rules/
β”‚   └── aegis_rules.yar      ← 13 YARA Detection Rules
β”‚                              PDF_Embedded_JavaScript
β”‚                              PDF_OpenAction_AutoLaunch
β”‚                              PDF_Heap_Spray_Pattern
β”‚                              DOCX_VBA_Macro_Present
β”‚                              DOCX_External_Template_Injection (T1221)
β”‚                              DOCX_DDE_Injection (T1559.002)
β”‚                              Generic_Suspicious_PowerShell
β”‚                              Generic_Base64_Shellcode
β”‚                              + 5 more
β”‚
β”œβ”€β”€ static/
β”‚   └── index.html           ← Complete Frontend (620 lines, zero deps)
β”‚                              Drag-drop upload zone
β”‚                              5-step processing animation
β”‚                              Animated SVG risk gauge
β”‚                              Threat breakdown bar charts
β”‚                              Groq AI analysis panel
β”‚                              Download button
β”‚
β”œβ”€β”€ create_test_files.py     ← Test File Generator
β”‚                              malicious_test.pdf (5 threat types)
β”‚                              malicious_test.docx (8 threat types)
β”‚                              spoofed_exe.pdf (MZ magic in .pdf)
β”‚
β”œβ”€β”€ aegis_standalone.py      ← CLI Interface
β”œβ”€β”€ .env.example             ← Environment template
└── requirements.txt         ← Python dependencies

πŸ“‘ API Reference

POST /api/sanitize

Upload a PDF or DOCX β€” receive a full threat intelligence report and download token.

Request

Content-Type: multipart/form-data
Body: file=<binary>

Response

{
  "status": "sanitized",
  "original_filename": "invoice.pdf",
  "sanitized_filename": "SAFE_invoice.pdf",
  "true_mime_type": "application/pdf",
  "file_size_original": 816384,
  "file_size_sanitized": 798720,
  "processing_time_ms": 1247,
  "page_count_original": 15,
  "page_count_sanitized": 15,
  "items_removed_count": 13,
  "threat_categories": [
    {
      "name": "Scripts & JavaScript",
      "items": [
        "Document Catalog xref 1: /JavaScript detected and stripped",
        "Threat in xref 1: /JS",
        "Threat in xref 3: /JavaScript"
      ],
      "icon": "⚑"
    },
    {
      "name": "Auto-Execute Actions",
      "items": [
        "Document Catalog xref 1: /OpenAction detected and stripped",
        "Threat in xref 1: /AA"
      ],
      "icon": "πŸš€"
    }
  ],
  "risk": {
    "score": 100,
    "level": "CRITICAL",
    "color": "#ff1a1a",
    "rationale": "Embedded JavaScript; Auto-execute on open; Shell launch command"
  },
  "ai_summary": "A thorough analysis of invoice.pdf revealed critical threats including JavaScript and OpenAction exploits, which could have allowed arbitrary code execution and unauthorized system access if not neutralized. The removal of 13 malicious items has mitigated the risk of these exploits being used to compromise system security. The document is now safe for use, with all identified threats stripped and visual integrity confirmed at 15 pages.",
  "groq_powered": true,
  "fallback_used": false,
  "download_token": "aegis_1709123456_SAFE_invoice.pdf"
}

Error Responses

Status Error Code Description
415 FILE_BLOCKED Magic bytes indicate dangerous file type
415 UNSUPPORTED_TYPE Not PDF or DOCX
500 SANITIZATION_FAILED Internal processing error

GET /api/download/{token}

Download the sanitized file by token.

curl http://localhost:8000/api/download/aegis_1709123456_SAFE_invoice.pdf \
  --output SAFE_invoice.pdf

GET /api/health

Check server status and Groq configuration.

{
  "status": "operational",
  "version": "2.0.0",
  "groq": {
    "configured": true,
    "model": "llama-3.3-70b-versatile"
  },
  "ui": "http://localhost:8000",
  "supported_formats": ["PDF", "DOCX"]
}

🎯 Threat Detection Coverage

PDF Threat Matrix

Threat PDF Key Points Impact
Embedded JavaScript /JavaScript, /JS +40 Arbitrary code execution on open
Auto-Execute Action /OpenAction +30 Triggers immediately when PDF opens
Additional Actions /AA +25 Page/annotation/field event triggers
Shell Launch /Launch +50 Spawns external process (cmd.exe, bash)
Embedded File /EmbeddedFile +20 Attached payload (exe, dll, bat)
Rich Media /RichMedia +30 Flash/video execution context
Form Widget /Widget +20 Interactive field with action trigger
PostScript XObject /XObject + /PS +35 PostScript code injection
External URI /URI +10 Tracking pixel / SSRF / phishing

DOCX Threat Matrix

Threat Location Points Impact
VBA Macros vbaProject.bin +45 AutoOpen/AutoExec code execution
Remote Template attachedTemplate rel +30 Loads macro payload from remote URL
DDE Field DDEAUTO in instrText +35 Dynamic Data Exchange cmd.exe execution
Macro Button MACROBUTTON field +40 Click-triggered macro execution
OLE Object word/embeddings/ +35 Embedded executable object
ActiveX Control word/activeX/ +40 Script-executable browser control
External Hyperlink word/_rels/ +10 Tracking/phishing/SSRF link
Custom XML customXml/ +10 Schema-based data injection

YARA Rules (13 Patterns)

PDF_Embedded_JavaScript         β€” /JS and /JavaScript in PDF streams
PDF_OpenAction_AutoLaunch       β€” /OpenAction trigger detection
PDF_Heap_Spray_Pattern          β€” Large repeated NOP sled patterns
PDF_Suspicious_URI              β€” Encoded/obfuscated URI actions
DOCX_VBA_Macro_Present          β€” OLE2 vbaProject.bin signature
DOCX_External_Template_Injection β€” MITRE ATT&CK T1221
DOCX_DDE_Injection              β€” MITRE ATT&CK T1559.002
DOCX_Macro_Auto_Execute         β€” AutoOpen/AutoExec triggers
Generic_Suspicious_PowerShell   β€” Encoded PowerShell download cradles
Generic_Base64_Shellcode        β€” Base64-encoded executable payloads
Generic_URL_Obfuscation         β€” Hex/percent-encoded malicious URLs
Generic_PE_In_Document          β€” MZ magic bytes inside document stream
Generic_OLE_Embedded            β€” OLE2 compound document signature

πŸ“Š Risk Scoring Engine

Aegis computes a cumulative risk score based on all threats found:

Score = Ξ£(threat_points) capped at 100
Score Level Color Indicator
0 βœ… CLEAN #00ff9d No active content
1 – 19 πŸ”΅ LOW #00c9ff Tracking links or custom XML only
20 – 39 🟑 MEDIUM #ffd700 Embedded files or form widgets
40 – 69 🟠 HIGH #ff6b35 VBA macros, DDE injection, OLE objects
70 – 100 πŸ”΄ CRITICAL #ff1a1a JavaScript, LaunchAction, or pixel fallback

Example Scoring:

Document with VBA macro (+45) + DDE injection (+35) + 2 hyperlinks (+20) = 100 β†’ CRITICAL
Document with 3 tracking hyperlinks only (+30) = 30 β†’ MEDIUM
Clean research paper = 0 β†’ CLEAN βœ…

πŸ–₯️ Frontend Interface

The entire frontend is a single self-contained HTML file (static/index.html) served directly by FastAPI. No npm, no Node.js, no build tools required.

UI States

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   DROP ZONE   │───▢│  PROCESSING  │───▢│   RESULTS    │───▢│  DOWNLOAD    β”‚
β”‚               β”‚    β”‚              β”‚    β”‚              β”‚    β”‚              β”‚
β”‚  Drag & Drop  β”‚    β”‚ Step-by-step β”‚    β”‚  Risk Gauge  β”‚    β”‚  SAFE_*.pdf  β”‚
β”‚  or Browse    β”‚    β”‚  animation   β”‚    β”‚  Threat Bars β”‚    β”‚  or .docx    β”‚
β”‚               β”‚    β”‚  5 stages    β”‚    β”‚  AI Report   β”‚    β”‚              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Processing Stages (animated in UI)

  1. πŸ” Fingerprinting magic bytes
  2. 🧩 Decomposing component tree
  3. πŸ’£ Disarming active content
  4. πŸ—οΈ Reconstructing clean document
  5. πŸ€– Groq AI Sentry analysis

Design System

Background:  #060a10 (near-black)
Surface:     #0c1220 / #111827
Accent:      #00d4ff (cyan)
Success:     #00ff9d (green)
Danger:      #ff3c3c (red)
Warning:     #ffd700 (gold)
Font:        Share Tech Mono + Rajdhani (Google Fonts)
Effects:     Animated grid, corner glows, scan-line, SVG ring meter

πŸ§ͺ Testing

Generate Malicious Test Files

python create_test_files.py

Produces test_files/ with three validation files:

malicious_test.pdf β€” Hand-crafted PDF with 5 real threat structures:

  • /OpenAction with /JavaScript in catalog (auto-runs on open)
  • /AA Additional Actions on catalog and page
  • /Launch annotation pointing to cmd.exe /c calc.exe
  • /EmbeddedFile attachment (malware_payload.exe)
  • /URI external tracking link

β†’ Expected: Risk CRITICAL (100) | Threats: 13 | Groq names JS/OpenAction/Launch


malicious_test.docx β€” Full-threat DOCX with 8 attack vectors:

  • vbaProject.bin with real OLE2 magic bytes + AutoOpen + Shell macro
  • External template injection via attachedTemplate
  • DDEAUTO field with cmd.exe + PowerShell -EncodedCommand
  • MACROBUTTON field
  • 2Γ— external hyperlinks (phishing + tracking pixel)
  • ActiveX control with CLSID
  • 3Γ— customXml parts with base64 payload

β†’ Expected: Risk CRITICAL (100) | Threats: 14 | All 5 categories populated


spoofed_exe.pdf β€” Windows PE executable with .pdf extension:

  • MZ magic bytes: 4D 5A 90 00 03 00...
  • Claimed extension: .pdf
  • True type: application/x-msdownload

β†’ Expected: 🚨 BLOCKED at Layer 1 β€” never reaches sanitization


CLI Testing

# Sanitize a single file
python aegis_standalone.py --file document.pdf

# Enable pixel fallback for high-risk files
python aegis_standalone.py --file risky.pdf --pixel-fallback

# Run built-in demo (creates and sanitizes test files automatically)
python aegis_standalone.py --demo

# API health check
curl http://localhost:8000/api/health

πŸ—ΊοΈ Roadmap

  • PDF CDR engine with incremental save
  • DOCX CDR engine with full XML sanitization
  • Groq LLM threat intelligence
  • Magic byte fingerprinting
  • Zero-dependency frontend
  • Risk scoring engine
  • YARA rule integration
  • XLSX support β€” Spreadsheet CDR (macro-enabled workbooks)
  • PPTX support β€” Presentation CDR
  • RTF support β€” Rich Text Format
  • Email pipeline β€” .eml / .msg attachment scanning
  • Batch API β€” Async queue for multiple files
  • Docker container β€” One-command deployment
  • Webhook callbacks β€” POST results to external URL
  • Audit log β€” Persistent scan history with search
  • Password-protected PDF β€” Decrypt before sanitize
  • Report export β€” PDF/HTML threat report download

🀝 Contributing

Contributions are warmly welcome!

# Fork and clone
git clone https://github.com/YOUR_USERNAME/aegis-cdr.git
cd aegis-cdr

# Set up development environment
python -m venv .venv
source .venv/Scripts/activate   # Windows
source .venv/bin/activate       # Linux/macOS
pip install -r requirements.txt
cp .env.example .env            # Add your GROQ_API_KEY

# Create a feature branch
git checkout -b feature/your-feature-name

# Make changes, then commit
git add .
git commit -m "feat: add XLSX support"

# Push and open PR
git push origin feature/your-feature-name

Contribution Areas

Area Description
πŸ†• New file formats Add XLSX, PPTX, RTF, EML engines
πŸ” Detection rules New YARA rules, threat signatures
πŸ€– AI improvements Better prompts, structured Groq output
🎨 Frontend UI improvements, dark/light theme
πŸ“¦ Deployment Docker, CI/CD, cloud deployment guides
πŸ“ Documentation Examples, tutorials, threat research

⚠️ Disclaimer

Aegis-CDR is a security research and document sanitization tool.

  • The test files in create_test_files.py contain simulated threat structures β€” no working exploits or actual malware
  • Always run in an isolated environment when processing real-world untrusted files
  • Aegis-CDR is a defence tool β€” do not use to craft malicious documents
  • The pixel fallback mode produces rasterized output β€” text will not be searchable/copyable

πŸ“„ License

This project is licensed under the MIT License β€” see the LICENSE file for details.


Built with πŸ›‘οΈ by Nauman Ahmad

Detect. Neutralize. Reconstruct.

⭐ Star this repo if Aegis-CDR helped secure your documents!

GitHub stars GitHub forks GitHub watchers

About

AI-powered Content Disarm & Reconstruction engine for PDF and DOCX files. Detects and strips malicious JavaScript, macros, OLE objects, and embedded threats using PyMuPDF, python-docx, and Groq LLM analysis. Built with FastAPI + vanilla JS frontend.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors