Use the official Microsoft Azure packing trace:
- Dataset doc: Azure Trace for Packing 2020
- Direct download: AzurePackingTraceV1.zip
This trace is a good fit for the scheduling stage because it contains:
- VM request ids
- tenant ids
- VM types
- VM priority
- start times
- end times
- normalized resource requirements through the
vmTypetable
The download is practical at roughly 51 MB compressed, and it expands into a single SQLite file with millions of VM requests.
Local path used during setup:
data/external/AzurePackingTraceV1.zipdata/external/azure_packing/packing_trace_zone_a_v1.sqlite
These files are intentionally ignored by Git because they are raw external artifacts.
The SQLite database contains two tables:
vmvmType
The vm table provides:
vmIdtenantIdvmTypeIdprioritystarttimeendtime
The vmType table provides:
vmTypeIdmachineIdcorememoryhddssdnic
The project now includes src/workload_loader.py, which converts the Azure trace into a scheduler-oriented jobs table.
Example:
python3 -m src.workload_loader \
--input-path data/external/azure_packing/packing_trace_zone_a_v1.sqlite \
--output-path data/processed/azure_jobs_sample.csv \
--limit 5000 \
--start-datetime "2019-01-01 00:00:00"This trace is excellent for workload behavior, but it does not directly include:
- real
origin_city - explicit
power_demandin kW - user-facing latency SLAs
- true geographic placement
Because of that, the current converter uses proxies:
origin_cityis deterministically assigned from tenant id across a city poolpower_demandis approximated from normalized CPU allocationdeadlineis derived from observed duration plus configurable slack
These assumptions are acceptable for a research scheduler prototype, but they should be stated clearly in any report or presentation.