Cyber Security Attacks — Multiclass Classification

End-to-end machine learning project for multiclass classification of network attacks using the Cyber Security Attacks dataset from Kaggle. The primary model is a Random Forest classifier; Gradient Boosting and k-NN are trained for comparison. A Streamlit dashboard visualizes exploratory analysis, preprocessing, training, and evaluation.

Features

40,000 instances, 25 attributes (see the dashboard Project Overview for descriptions).
Target: Attack Type with three nearly balanced classes: DDoS, Malware, Intrusion.
Pipeline (pipeline.py): EDA plots, preprocessing (drops non-generalizing columns, binary flags for sparse fields, label encoding, scaling), 80/20 stratified split, Random Forest + comparison models, 5-fold CV, test metrics, confusion matrix, ROC (OvR), feature importance. Artifacts go to results/; the trained RF is saved under models/random_forest.joblib.
Dashboard (app.py): Tabbed Streamlit UI (overview, EDA, preprocessing, model design, model comparison, detailed RF results, interactive explorer).
Data download (download_data.py): Fetches the official CSV via Kaggle Hub into data/cybersecurity_attacks.csv (cache directory: .kaggle_cache/).

Requirements

Python 3.10–3.13 (tested with 3.12).
Internet access on first run to download the dataset (~5 MB) unless data/cybersecurity_attacks.csv is already present.

Quick start (one command)

Linux / macOS

chmod +x start.sh
./start.sh

Windows

Double-click start.bat or run in cmd / PowerShell:

start.bat

The script will:

Resolve Python 3.10+.
Create venv/ if needed and pip install -r requirements.txt.
Run python pipeline.py (downloads data if missing, trains models, writes results/).
Start the Streamlit app at http://localhost:8501 (streamlit run app.py).

Stop the server with Ctrl+C.

Manual steps (optional)

python3 -m venv venv
source venv/bin/activate          # Windows: venv\Scripts\activate
pip install -r requirements.txt
python download_data.py           # optional; pipeline also downloads if needed
python pipeline.py
streamlit run app.py

Project layout

Path	Purpose
`download_data.py`	Download dataset from Kaggle into `data/`
`pipeline.py`	Full ML pipeline and evaluation
`app.py`	Streamlit dashboard
`data/`	Dataset CSV (generated; see `.gitignore`)
`results/`	Metrics, JSON, PNG plots, NumPy confusion matrix (generated)
`models/`	Saved `random_forest.joblib` (generated)
`start.sh` / `start.bat`	One-command setup + pipeline + Streamlit

Academic report (Polish course outline)

For the Sztuczna inteligencja report, map sections as follows:

Student data — name, program, year, academic year (fill in manually).
Course — Artificial Intelligence (or your exact course title).
Project topic — Multiclass classification of cyber security attacks (Random Forest on Kaggle dataset).
Problem characterization — Supervised multiclass classification; balanced three-class target; network and security features with missing values in several columns.
Number of instances — 40,000.
Attributes — 25; use the table in the Streamlit Project Overview tab and the dataset documentation on Kaggle.
Preprocessing — Summarize steps from the Preprocessing tab / preprocessing_info.json (dropped columns, binary flags, encodings, scaling, split).
Model design — Random Forest (primary), plus Gradient Boosting and k-NN for comparison; hyperparameters as in the Model & Training tab and pipeline.py.
Results — Accuracy, macro F1, precision, recall, ROC AUC, confusion matrix, per-class metrics (metrics.json / dashboard).
Conclusions — Strengths of RF on this task, role of important features, limitations (e.g. label encoding of IPs removed; text fields dropped).

Notes

Empirical performance: On this Kaggle release, test accuracy is often near the random baseline (≈1/3) for balanced three-class prediction, and ROC AUC is near 0.5, with all three models behaving similarly. That is a valid finding for your report: after removing identifiers and free text, the remaining tabular features may carry little usable signal for Attack Type in this synthetic split. The pipeline and metrics are still correct; interpret results honestly in section Wnioski / Conclusions.
Git: venv/, .kaggle_cache/, data/cybersecurity_attacks.csv, and generated results/ / models/ artifacts are listed in .gitignore. Clone the repo and run ./start.sh to regenerate everything.
Kaggle authentication: Public dataset download via kagglehub typically works without extra setup; if you hit auth errors, follow Kaggle API credentials and set KAGGLE_USERNAME / KAGGLE_KEY or place kaggle.json in ~/.kaggle/.

License

Dataset usage is subject to the Kaggle dataset license. This repository code is provided for educational use.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cyber Security Attacks — Multiclass Classification

Features

Requirements

Quick start (one command)

Linux / macOS

Windows

Manual steps (optional)

Project layout

Academic report (Polish course outline)

Notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github		.github
data		data
docs		docs
img		img
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
download_data.py		download_data.py
pipeline.py		pipeline.py
requirements.txt		requirements.txt
start.bat		start.bat
start.sh		start.sh

Folders and files

Latest commit

History

Repository files navigation

Cyber Security Attacks — Multiclass Classification

Features

Requirements

Quick start (one command)

Linux / macOS

Windows

Manual steps (optional)

Project layout

Academic report (Polish course outline)

Notes

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages