Reinforcement Learning Technology Development & Air Traffic/Logistics Applications: A Research Report

Last updated on 20 Feb 2026

February 2026 | Focus period: 2018–present (emphasis on 2021–2025)

(1) Executive Summary

RL is experiencing a "renaissance" driven by convergence with generative AI and LLMs. RLHF, DPO, and GRPO have become standard for LLM alignment, and pure RL can now incentivize emergent reasoning capabilities without supervised fine-tuning, as demonstrated by DeepSeek-R1 [^52][55].
World models are maturing rapidly. DreamerV3 (Nature, 2025) outperforms specialized methods across 150+ tasks with fixed hyperparameters, becoming the first algorithm to collect diamonds in Minecraft from scratch [^125][122].
Offline RL is production-ready for domains with logged data. Conservative Q-Learning (CQL), Decision Transformer, and Implicit Q-Learning enable policy learning from historical datasets without online exploration—ideal for aviation [^47][5].
Safe RL is formalized. Constrained MDPs (CMDPs), safety filters, and shielding methods are well-established theoretically, with SafeDreamer achieving near-zero constraint violations at ICLR 2024 [^64][21].
Production-grade RL deployments exist in chip design (AlphaChip for Google TPUs) [^109], robotics (AgiBot's real-world RL in manufacturing) [^60], recommendation systems (Netflix, Uber, Google), and energy management [^75].
In aviation, RL has been applied to conflict resolution, departure metering (22% fuel reduction at Singapore Changi) [^11], autonomous taxiing (29.5% fuel reduction) [^16], ground delay programs, and slot scheduling [^123]—but nearly all remain at academic or prototype stage.
BlueSky-Gym (SESAR Innovation Days, 2024) provides the first standardized RL benchmark environments for ATC/ATM built on the Gymnasium API [^32][35].
HYPERSOLVER (EU Horizon/SESAR JU) is the first project to holistically merge ATFM and ATC using deep RL, currently in its second year [^48][51].
ADS-B data is increasingly accessible for research—TartanAviation provides 661 days of multi-modal ADS-B data [^65], and ADSBexchange offers historical operations data including runway-level takeoffs/landings [^68].
Multi-Agent RL (MARL) is the natural paradigm for air traffic operations, with successful applications in conflict resolution using Rainbow DQN, A3C, and MADDPG [^23][17][^20].
The biggest bottleneck for RL in aviation is not algorithms but rather environment fidelity, safety certification, regulatory acceptance, and the sim-to-real gap [^14][2].
Offline RL and world models represent the most viable path for near-term aviation applications, as they can leverage historical operational data without requiring online exploration in safety-critical environments [^119][47].
RL + LLM Agent architectures are emerging for logistics planning (RLTR framework, EMNLP 2025), achieving 8–12% improvement in planning performance [^77][80].
Airport ground handling scheduling remains an under-explored but high-value target for RL research, with a recent TU Delft thesis (2025) being among the first to tackle the complete problem [^95].
For the user's profile (AI + Logistics PhD, ADS-B/airport ground data, Databricks/MLflow stack), the optimal research direction combines offline RL or model-based RL with airport ground operations, using the BlueSky simulator and real ADS-B data for validation.

(2) RL Technology Map (2018–Present)

Core Algorithm Families

Family	Key Algorithms	Strengths	Limitations
Value-based	DQN, Double DQN, Dueling DQN, Rainbow DQN, QR-DQN	Discrete action spaces, stable with experience replay	Poor scalability to continuous/high-dim actions
Policy-based	REINFORCE, TRPO, PPO	Continuous actions, direct optimization	High variance, sample inefficiency
Actor-Critic	A2C/A3C, SAC, DDPG, TD3	Combines value estimation with policy optimization	Hyperparameter sensitivity
Model-based	DreamerV1–V3, MuZero, MBPO, SafeDreamer	Sample efficiency, planning, safety integration	Model error compounding
Offline RL	CQL, IQL, Decision Transformer, BCQ, TD3+BC	Learn from logged data, no online exploration	Distribution shift, conservative bias
Multi-Agent RL	MADDPG, QMIX, VDN, MAPPO, COMA	Cooperative/competitive multi-agent tasks	Non-stationarity, scalability
Hierarchical RL	Option-Critic, HIRO, HAM	Temporal abstraction, sub-goal planning	Reward design complexity

[^7][14]

Key Milestones (2018–2025)

2017: PPO published (OpenAI); AlphaZero masters chess/Go/shogi with pure self-play RL [^93]
2018: OpenAI Five defeats professional Dota 2 teams [^91]
2019: AlphaStar reaches Grandmaster in StarCraft II using multi-agent RL + imitation learning [^96][99]
2020: MuZero plans in learned latent spaces; AlphaChip introduced for chip floorplanning [^103]
2021: Decision Transformer frames RL as sequence modeling; AlphaChip published in Nature and deployed for Google TPU production [^109]
2022: ChatGPT launches using RLHF (PPO + reward model); sim-to-real transfer advances [^91]
2023: DreamerV3 released (masters Minecraft from scratch); SafeDreamer achieves near-zero-cost safe RL [^64]; L2RPN competitions accelerate RL for power grids [^85]
2024: OpenAI o1 (test-time compute scaling via RL); DPO widely adopted as simpler RLHF alternative; BlueSky-Gym launched for aviation RL benchmarking [^32]
2025: DeepSeek-R1 demonstrates pure RL can incentivize LLM reasoning via GRPO without SFT [^52][55]; DreamerV3 published in Nature [^125]; AgiBot deploys real-world RL in manufacturing [^60]; EMERALD world model surpasses human experts on Crafter [^67]

RL for LLMs: RLHF → DPO → GRPO

Method	Process	Reward Model	Complexity	Key Use
RLHF	Human rankings → reward model → PPO optimization	Separate model required	High	ChatGPT, Claude
RLAIF	AI-generated rankings replace human annotators	AI-derived	Medium	Scaling alignment
DPO	Preference pairs → direct policy optimization	Implicit (no separate model)	Low	Stable, efficient alignment [^107]
GRPO	Group scoring → relative policy optimization	Continuous scoring function	Medium	DeepSeek-R1 reasoning [^52][104]

Benchmarks & Their Aviation Relevance

Benchmark	Domain	Aviation Relevance
Atari / DM Control / MuJoCo	Game / Continuous control	Low – Toy domains, useful for algorithm validation only
Procgen / Crafter	Procedural generation / Survival	Low-Medium – Tests generalization, relevant for domain randomization
MetaWorld	Multi-task robotic manipulation	Medium – Relevant for ground handling equipment control
Safety-Gymnasium	Safe RL tasks	High – Directly relevant for safety-constrained aviation RL [^64]
PettingZoo / SMAC	Multi-agent coordination	High – Multi-agent coordination mirrors ATC scenarios
BlueSky-Gym	ATC/ATM specific	Very High – Purpose-built for aviation RL research [^32][35]
D4RL	Offline RL benchmarks	High – Framework for offline methods applicable to logged ADS-B data [^47]

(3) RL Engineering & Deployment

Data & Simulation

Sim-to-Real Transfer: Domain randomization and system identification remain critical for bridging simulation-to-reality gaps. In aviation, simulators like BlueSky provide parametric models of aircraft performance, but real-world taxi operations exhibit stochastic delays, weather effects, and human controller variability [^14].
Digital Twins: Airport digital twins integrating ADS-B feeds, A-SMGCS data, and weather streams enable high-fidelity RL training environments. Singapore Changi's departure metering study used historical A-SMGCS data to construct training scenarios [^11].
Log Data-Driven Approaches: Offline RL (CQL, IQL, Decision Transformer) can directly leverage historical operational logs without building simulators—particularly attractive for aviation where online exploration is infeasible [^5][47].

Training Infrastructure & Experiment Tracking

Framework	Best For	MARL Support	Scalability	Ease of Use
Stable Baselines3	Research prototyping, single-agent	No	Single machine	High [^31]
RLlib (Ray)	Production, distributed, MARL	Yes	Multi-node cluster	Medium [^34]
CleanRL	Educational, reproducibility	No	Single machine	High [^31]
Tianshou	Flexibility, performance	Limited	Single machine	Medium [^40]
ACME (DeepMind)	Research-grade components	Yes	Multi-node	Low [^40]

For experiment tracking and reproducibility, MLflow (already in the user's stack) is well-suited for logging RL training runs, hyperparameter sweeps, reward curves, and model artifacts. Integration with RLlib via Ray Tune provides scalable hyperparameter optimization.

Safe RL & Constraint Handling

Safe RL methods formalize safety as Constrained Markov Decision Processes (CMDPs), where the agent maximizes reward subject to cost constraints [^21][24]:

Lagrangian methods: Convert constraints to penalty terms (CPO, PCPO, PDO) — most popular in practice
Safety filters/shielding: Hard constraint enforcement at every step via control barrier functions
SafeDreamer: Integrates Lagrangian constraints into Dreamer's world model planning, achieving near-zero violations on Safety-Gymnasium (ICLR 2024) [^64][70]
SafeMARL: Extends safe RL to multi-agent settings—critical for cooperative ATC applications [^21]

For aviation applications, the regulatory requirement for explainability and verifiability makes safety-constrained approaches mandatory. The ACAS-X system used POMDPs with formal verification as an early aviation-specific safe RL deployment [^14].

(4) Industrial Application Landscape

Industry	RL Application	RL Type	Maturity	Why It Works / Challenges
Gaming	AlphaStar (StarCraft II), OpenAI Five (Dota 2)	MARL, Self-play, PPO	Production	Simulated env = unlimited data; no safety constraints [^99]
Chip Design	AlphaChip (Google TPU floorplanning)	Policy gradient + GNN	Production	Clear reward (PPA metrics); graph-structured inputs [^109]
LLM Alignment	ChatGPT (RLHF), DeepSeek-R1 (GRPO)	PPO, DPO, GRPO	Production	Verifiable rewards (math, code); massive compute available [^52]
Robotics	AgiBot (manufacturing), sim-to-real manipulation	PPO, SAC, Real-world RL	Pilot Production	Fast sim training; still limited task generality [^60]
Recommendation	Netflix personalization, ad placement	DQN, Contextual bandits	Production	Abundant online feedback; A/B testing infrastructure [^75][78]
Energy/Grid	Microgrid management, L2RPN competitions	PPO, TD3, DQN	Prototype → Production	Well-defined physics; PPO outperforms MILP/PSO for real-time control [^79][85]
Finance	Market making, algorithmic trading	DQN, DDPG, PPO	Prototype	High noise, non-stationarity; implementation quality > algorithm choice [^92]
Supply Chain	Amazon/DHL warehouse ops, inventory management	DQN, MARL	Prototype → Production	Combinatorial action spaces; RL adapts to demand variability [^33][30]
Autonomous Driving	Wayve, Waymo (decision-making modules)	PPO, SAC, Offline RL	Pilot Production	Massive sim training (CARLA/nuPlan); safety-critical certification barrier [^4]
Air Traffic Mgmt	Conflict resolution, departure metering, slot scheduling	DQN, PPO, MARL	Academic / Prototype	Safety certification gap; limited sim fidelity; regulatory barriers [^14][2]

Key insight: RL succeeds in production when (a) a high-fidelity simulator or abundant online feedback exists, (b) reward signals are clear and dense, and (c) safety constraints are manageable or the domain tolerates exploration errors. Aviation fails on all three counts for online RL, making offline RL and world model-based approaches the most viable path forward.

(5) RL in Air Logistics / Air Traffic: Deep Dive

Task-by-Task Analysis

5.1 Airport Ground Taxiing & Departure Metering

Representative Works:

Tran, Pham & Alam (2023): DRL-based autonomous taxiing agent using PPO at Singapore Changi. State includes planning features (distance to target, expected time), surrounding traffic. Action: continuous acceleration. Reward balances fuel burn, taxi time, and delay. Result: 97.8% on-time performance within [-20, +5]s window, 29.5% fuel reduction [^16][19].
Pham et al. (IEEE TITS, 2022): Deep RL for airport departure metering under spatial-temporal airside interactions at Singapore Changi. Used A-SMGCS data. Result: 22% fuel reduction, contained taxiway congestion without significant throughput loss [^11].
MARL for Autonomous Taxiing (2025): Multi-agent approach optimizing taxi time, fuel consumption, and emissions simultaneously [^27].

Environment: Custom simulation built on airport taxiway graph networks with aircraft performance models. Simulator: Typically custom-built (not BlueSky), using real A-SMGCS/ADS-B data for scenario generation. Multi-agent: Yes, but most work treats it as single-agent with traffic as environment. MARL emerging.

Maturity: Academic → Early Prototype

5.2 Runway & Slot Allocation

Representative Works:

Nguyen-Duy & Pham (NTU): RL for collaborative multi-airport slot re-allocation under A-CDM framework. Tested on Hong Kong–Singapore–Bangkok hub using OAG 2018 data. RL agent significantly outperformed nearest heuristic under heavy-reduced capacity (total delay: 84 vs. 107) [^118].
RL for Strategic Airport Slot Scheduling (IEEE CAI, 2024): Formulated as MDP with DQN and PPO. Key finding: adding positive intermediate reward signals enables convergence. DQN outperformed PPO with average displacement of 1.44/1.99 per request for medium/high-density [^123][120].

MDP Design: State = current slot assignment status + demand profile; Action = assign/defer/displace request; Reward = negative delay + penalty for unaccommodated requests.

Maturity: Academic

5.3 Arrival/Departure Sequencing & Flow Management

Representative Works:

DRL for Ground Delay Programs (UC Berkeley, 2024): Behavioral Cloning and Conservative Q-Learning (CQL) for optimizing GDP airport program rates at Newark (EWR). Used full-year 2019 data. Simulation environment SAGDPENV uses queuing diagrams with real operational data [^5][22].
Air Traffic Flow Management is surveyed extensively in the RL-in-Aviation survey, with multiple approaches using DQN and PPO for demand-capacity balancing [^14].

Key insight: Offline RL (CQL) is particularly well-suited here, as GDP decisions can be learned from historical operational data without online exploration.

Maturity: Academic → Prototype

5.4 Conflict Detection & Resolution

Representative Works:

Chen et al. (Transportation Research Part C, 2023): General MARL with adaptive manoeuvre strategy using Rainbow DQN. Agents output flight intentions (increase speed, turn left) rather than specific parameters. Partial observation based on imminent threat detection sectors. Validated generalisation across arbitrary scenarios [^23][28].
Self-Prioritizing MARL (TU Delft, 2025): Novel learning-based priority mechanism minimises number of ATC directives while resolving conflicts. Continuous action space. Well-suited for centralized ATC where instruction bandwidth is limited [^20].
Deniz & Wang (AIAA, 2024): MARL for UAM conflict resolution using A3C with centralized learning, decentralized execution. Resolves nearly all conflicts in high-density eVTOL scenarios [^17].
DRL Conflict Resolution (Aviation journal, 2023): DQN agent using altitude/speed/heading adjustments modeled as MDP. Reward function incorporates ATC regulations [^25].

Simulators: Most use custom environments; BlueSky increasingly adopted. BlueSky-Gym provides standardised conflict resolution environments [^32][35].

Maturity: Academic (extensive literature)

5.5 Ground Service Resource Scheduling

Representative Works:

DRL for Multi-Objective Airport Ground Handling (TU Delft, 2025): First comprehensive RL formulation tackling the complete airport ground handling scheduling problem including tow tractors, baggage vehicles, refueling, catering, and cleaning—rather than simplified sub-problems [^95].
ULD Build-Up Scheduling (ZIB Berlin, 2021): Multi-commodity network design model for cargo terminal ULD scheduling. While not RL-based, provides the problem formulation (workstation assignment, batching, break-down/build-up sequencing) that is highly amenable to RL treatment [^90].
RL for Airline Maintenance Scheduling (2023): Adaptive RL for maintenance task scheduling with rescheduling capability. Demonstrated efficacy with ground time and time slack KPIs [^98].

Gap: This is the least explored area for RL in aviation, despite being a rich combinatorial optimization problem with clear operational value. Cargo terminal operations (ULD scheduling, equipment dispatching) and ground handling resource allocation represent major research opportunities.

Maturity: Early Academic / Unexplored

5.6 EU-Funded Projects & Institutional Resources

HYPERSOLVER (SESAR JU, 2023–2025): First project to holistically merge ATFM and ATC using deep RL. Partners include ENAC and Eurocontrol. Uses continuous reassessment and dynamic updates across the full conflict management timeline [^48][51].
SESAR Exploratory Research: Includes projects on improved airport surface movements using surveillance technologies [^45].
Eurocontrol A-CDM: Provides the operational framework (TSAT, TOBT, TTOT) and data-sharing protocols that any RL system for airport operations must integrate with [^114][108].
Eurocontrol DDR2 Dataset: Demand Data Repository with flight plans, sector configurations, and traffic counts [^14].

5.7 ADS-B Data & Simulation Resources

Resource	Type	Coverage	Aviation RL Relevance
TartanAviation	Open dataset	661 days ADS-B + 3.1M images + 3374h ATC audio, 2 US airports	High — multi-modal, trajectory + communications [^65]
ADSBexchange	Historical data service	Global, includes runway-level operations	High — takeoff/landing detection with runway ID [^68]
Eurocontrol DDR2	Flight plan data	European airspace	High — demand/capacity for ATFM [^14]
OpenSky Network	Open ADS-B archive	Global, academic access	High — large-scale trajectory data
BlueSky Simulator	Open-source ATC sim	Configurable, Gymnasium API via BlueSky-Gym	Very High — standardised RL environments [^32]
AirSim	Drone simulation	UAV focused	Medium — useful for UAM scenarios [^38]
JSBSim	Flight dynamics	Fixed/rotary wing	Medium — flight control, not ground ops [^14]
SUMO	Traffic simulation	Road network, adaptable	Medium — can model ground vehicle movements
AnyLogic	Discrete-event + agent-based	General-purpose	Medium — good for cargo terminal simulation

(6) Research Topics for Your Portfolio

Topic 1: Offline RL for Airport Taxi-Out Time Optimization

Problem Definition: Given an aircraft's pushback request, assigned runway, and current airport surface state, learn an optimal gate-hold / departure metering policy that minimizes total taxi-out time and fuel burn across all departures, using only historical operational logs.

Business Goal: Reduce taxi fuel burn by 15–25% and total surface delays by 10–20% without building a full online simulation.

RL Formalization: Episodic MDP where each episode is a departure bank. State = (aircraft queue at each taxiway segment, active runway config, weather category, time-of-day). Action = (hold at gate for Δt ∈ {0, 1, 2, ...5} min). Reward = −(taxi_time + α·fuel_burn + β·delay_penalty).

Data Sources:

ADS-B surface trajectories (ADSBexchange operations data or OpenSky) [^68]
Airport TOBT/TSAT from A-CDM records (if accessible via Frankfurt Airport/Fraport) [^114]
Weather data (METAR/TAF from aviation weather services)

Baseline Methods: FCFS gate release, MILP-based departure scheduling, simple heuristic metering rules.

RL Method Route: Offline RL (CQL or IQL) on historical taxi logs → evaluate with importance-weighted off-policy evaluation (OPE) → optional online fine-tuning in BlueSky simulation. Model-based alternative: train a world model (Dreamer-style) on logged transitions, then plan in imagination.

Evaluation Metrics: Average taxi-out time (min), fuel burn (kg), throughput (departures/hour), delay distribution, safety separation violations (must be 0).

Experiment Design:

Train/val/test split by date (e.g., 8 months train, 2 val, 2 test)
Ablation: state features, reward weights α/β, CQL α penalty
OOD test: evaluate on different traffic density days, different runway configs
Comparison table: FCFS vs. MILP vs. Heuristic vs. CQL vs. IQL vs. DT

Risk & Alternatives: Data quality (ADS-B surface coverage may be incomplete at some airports). Alternative: imitation learning (BC) on well-performing operational days; MPC if a transition model can be fit.

Output: Paper (Transportation Research Part C or JAIR) + GitHub repo with offline RL pipeline + Streamlit/Grafana demo dashboard.

Topic 2: MARL for Airport Ground Resource Dispatching

Problem Definition: Coordinate the dispatching of ground handling vehicles (tow tractors, fuel trucks, baggage carts, de-icing trucks) across an airport to minimize turnaround delays, using a multi-agent framework where each vehicle type is an agent.

Business Goal: Reduce average turnaround time by 10–15% and improve resource utilization rate by 20%.

RL Formalization: Decentralized POMDP / Markov Game. Each agent (vehicle dispatcher) observes local state (own fleet positions, assigned tasks, time to next event) and selects action (assign vehicle X to flight Y). Shared reward = −(total turnaround delay + idle time penalty). Constraint: safety separation on apron.

Data Sources:

Airport ground event logs (TOBT, AIBT, AOBT from A-CDM) [^108]
Simulated using AnyLogic or custom discrete-event simulation
ADS-B for aircraft arrival/departure times; flight schedules from OAG

Baseline Methods: Rule-based dispatching (nearest-vehicle-first), MILP scheduling [^90], manual scheduling.

RL Method Route: MAPPO or QMIX (centralized training, decentralized execution) in simulation → offline evaluation with replayed logs. Start with simplified 2-agent version (tow tractors + baggage), scale to 4–5 agents.

Evaluation Metrics: Average turnaround time, resource utilization %, number of delayed departures, CO₂ from vehicle movements, computation time vs. MILP.

Experiment Design:

Build gym environment wrapping AnyLogic or custom Python discrete-event sim
Compare: rule-based → single-agent RL → MAPPO → QMIX
Ablation on communication mechanism (no comm, shared observations, attention-based message passing)
Robustness: test with ±20% demand variation, equipment failures

Risk & Alternatives: Simulation fidelity; data availability for ground handling events. Alternative: Hierarchical RL (high-level task assignment + low-level routing). Backup: constraint programming + learned heuristic.

Output: Paper (AAAI / AAMAS / Transportation Science) + GitHub (env + training + evaluation) + animated demo.

Topic 3: World Model-Based RL for Airport Surface Movement Prediction & Control

Problem Definition: Learn a latent-space world model of airport surface dynamics from ADS-B trajectory data, then use it for (a) predicting taxi times and surface congestion, and (b) planning optimal taxi routes / speeds via imagined rollouts.

Business Goal: Enable predictive surface management — anticipate congestion 10–15 min ahead, provide recommended taxi speeds to reduce stop-and-go.

RL Formalization: POMDP with image-like state (airport surface heatmap from ADS-B + runway status + weather). World model learns transition dynamics in latent space (DreamerV3 architecture). Actor-critic trained entirely in imagination.

Data Sources:

ADS-B surface position reports (1s resolution from TartanAviation or ADSBexchange) [^65][68]
Airport layout graph (OpenStreetMap / Eurocontrol)
Weather (METAR)

Baseline Methods: Rule-based taxi control, data-driven taxi time prediction (XGBoost/LSTM), MPC with simplified kinematic model.

RL Method Route: DreamerV3 adaptation — encode airport surface state as spatial grid (like image observations), learn RSSM world model, train actor-critic in dreamed trajectories. Can incorporate safety constraints via SafeDreamer's Lagrangian approach [^64][125].

Evaluation Metrics: Taxi time prediction MAE, congestion prediction accuracy (precision/recall for hotspot detection), fuel savings, safety margin maintenance, planning horizon accuracy.

Experiment Design:

Phase 1: Train world model purely on ADS-B logs (unsupervised)
Phase 2: Fine-tune with reward signal (taxi time + fuel objective)
Phase 3: Compare imagined rollouts vs. actual outcomes
Ablation: latent dimension, prediction horizon, with/without weather encoding
OOD: Test on different airports / runway configurations

Risk & Alternatives: ADS-B surface resolution may be insufficient for precise taxiway segment modeling. Alternative: encode state as graph (GNN) rather than image. Fallback: use model for prediction only (supervised) without RL control.

Output: Paper (NeurIPS workshop / ICML workshop / IEEE TITS) + open-source world model code + interactive prediction dashboard.

Topic 4: Offline RL for Air Cargo Terminal ULD Scheduling

Problem Definition: Optimize the scheduling of ULD (Unit Load Device) build-up and break-down at an air cargo hub, assigning ULDs to workstations and time slots while respecting cargo availability, flight connections, and workforce constraints.

Business Goal: Reduce average cargo connection time by 15–20%, increase workstation utilization, reduce labor overtime.

RL Formalization: Finite-horizon MDP. State = (current ULD queue, workstation status, inbound cargo availability timeline, outbound flight schedule). Action = (assign ULD X to workstation W at time T, or defer). Reward = −(connection_delay + overtime_cost) + throughput_bonus.

Data Sources:

Cargo terminal event logs (if available from Fraport/Lufthansa Cargo)
Simulated scenarios based on published cargo hub models [^90]
Flight schedules + cargo booking data (synthesized from OAG)

Baseline Methods: FCFS scheduling, priority-based heuristics, MILP formulation from ZIB Berlin [^90].

RL Method Route: Offline RL (CQL/IQL) if historical scheduling logs exist. Otherwise, train online in AnyLogic/SimPy-based cargo terminal simulator, then evaluate with replay. Decision Transformer could work well given the sequential nature of assignment decisions.

Evaluation Metrics: Average ULD dwell time, connection rate (% ULDs making outbound flight), workstation utilization, labor cost, scalability to peak season volumes.

Experiment Design:

Build discrete-event simulation of cargo hub (break-down → storage → build-up → delivery)
Train: DQN → CQL → DT, compare with MILP (small instances) and heuristics (all instances)
Ablation on state representation (raw features vs. learned embeddings)
Scale test: 50 → 200 → 500 ULDs/day

Risk & Alternatives: Real cargo data is commercially sensitive and hard to obtain. Mitigation: synthetic data generation based on published operational parameters. Alternative: imitation learning from expert scheduler logs; constraint programming for exact solutions on small instances.

Output: Paper (Transportation Research Part E / Computers & Operations Research) + GitHub (simulator + RL training) + Grafana monitoring dashboard.

Topic 5: RL-Enhanced LLM Agent for Flight Delay Prediction & Proactive Rescheduling

Problem Definition: Build an LLM-based agent that combines natural language understanding of NOTAM/weather briefings with RL-based planning to predict flight delays and recommend proactive rescheduling actions (gate changes, stand swaps, slot trades).

Business Goal: Reduce reactionary delays by 20–30% through earlier, smarter intervention.

RL Formalization: Agent observes structured + unstructured data (flight schedule, ADS-B feeds, METAR/TAF text, NOTAMs). LLM parses unstructured inputs into structured features. RL policy (trained with RLTR-style framework) selects actions from a discrete set (hold, swap gate, request slot trade, alert ops). Reward from actual vs. predicted delay reduction [^77].

Data Sources:

ADS-B trajectory data [^65]
Historical METAR/TAF, NOTAM text
Flight schedule + actual performance data (BTS for US; Eurocontrol for EU)
A-CDM event data [^114]

Baseline Methods: Rule-based delay prediction (EUROCONTROL CODA), ML regression (XGBoost/LSTM), LLM-only zero-shot prediction.

RL Method Route: RL for LLM agent planning (RLTR framework) — train planner with tool-use rewards for accurate delay prediction and appropriate intervention selection [^80]. Offline RL warm-start from historical operations data.

Evaluation Metrics: Delay prediction MAE/RMSE, intervention success rate, reactionary delay reduction, false alarm rate, computational latency.

Experiment Design:

Phase 1: Build delay prediction baseline (ML + LLM)
Phase 2: Add RL planner for intervention selection
Phase 3: Evaluate end-to-end on historical disruption days
Compare: LLM-only → ML+Rules → RL Agent → RL+LLM Agent

Risk & Alternatives: LLM hallucination on operational decisions; reward shaping for rare events. Alternative: pure offline RL without LLM component; hybrid LLM + constraint optimization.

Output: Paper (KDD / AAAI Applied AI) + GitHub (agent framework) + Streamlit interactive demo.

(7) Future Trend Assessment (12–24 Months)

RL + LLM/Agent Impact on Scheduling & Logistics

The convergence of RL with LLM agents will reshape operational optimization in three ways:

Natural language interfaces for operations: LLM agents with RL-trained planners will enable controllers and dispatchers to interact with optimization systems using natural language, lowering adoption barriers. The RLTR framework (EMNLP 2025) showing 8–12% planning improvement is an early signal [^77].
Reasoning-augmented optimization: DeepSeek-R1's demonstration that pure RL can induce chain-of-thought reasoning opens the door to RL agents that can explain their scheduling decisions—a critical requirement for aviation certification [^52][55].
Multi-modal situation awareness: LLM agents processing NOTAM text, weather briefings, and ADS-B data simultaneously will enable more holistic delay prediction and response planning than current siloed systems.

Assessment: High impact within 24 months for decision support tools; 3–5 years for autonomous operational deployment.

Will Offline RL, World Models, and Constrained Optimization Become Mainstream?

Offline RL: Yes, for aviation and other safety-critical domains. The ability to learn from historical logs without online exploration directly addresses the primary barrier to RL adoption in aviation. CQL/IQL + OPE (off-policy evaluation) provide a complete workflow [^5][47][^119].

World Models: Likely to become standard. DreamerV3's success with fixed hyperparameters across 150+ tasks (Nature, 2025) suggests readiness for domain-specific applications. SafeDreamer's constraint integration (ICLR 2024) adds the safety layer aviation needs [^125][64].

Constrained Optimization Integration: Essential, not optional. Pure RL without hard constraints will never be accepted in aviation. Expect hybrid approaches: RL for exploration/improvement + MPC/MILP for constraint enforcement and safety certification.

Most Likely Near-Term Aviation Deployments

Application	Likelihood (12–24 months)	Rationale
Decision support for GDP/ATFM	High	Offline RL on historical data; advisory (non-autonomous); HYPERSOLVER project advancing [^48]
Taxi-out time prediction & advisory	High	Predictive (not control); abundant ADS-B data; clear business case (fuel savings)
Airport slot scheduling	Medium-High	NP-hard problem; RL shown to generalize to unseen densities [^123]
Conflict resolution advisory	Medium	Strong academic foundation; but safety certification barrier remains high [^23]
Autonomous surface movement control	Low	Requires human-out-of-loop approval; far from certification
Cargo terminal optimization	Medium	Less safety-critical than ATC; clear ROI; but lacks research attention

(8) References (Grouped by Topic)

RL Surveys & Algorithms

Ghasemi et al. "A Comprehensive Survey of Reinforcement Learning" (arXiv, 2024) [^7]
RL Renaissance Report (emerge.haus, 2025) [^1]
State of RL in 2025 (datarootlabs, 2025) [^4]

RL for LLMs

DeepSeek-R1: Incentivizing Reasoning via RL (arXiv, Jan 2025) [^52][55]
Wang et al. "RL Enhanced LLMs: A Survey" (arXiv, 2024) [^10]
Li et al. "RL for LLM Agent Planning (RLTR)" (EMNLP 2025) [^77]
RLHF vs DPO vs GRPO comparison [^104]

Offline RL & World Models

DreamerV3 (Nature, 2025) [^125][122]
SafeDreamer (ICLR 2024) [^64][70]
EMERALD (ICML 2025) [^67]
CoWorld (NeurIPS 2024) [^9]
When to prefer DT for Offline RL (ICLR 2024) [^47]
ADEPT: Offline RL with Diffusion World Models (ICLR 2025) [^6]
Offline RL vs Imitation Learning (BAIR Blog, 2022) [^124]

Safe RL

Kushwaha et al. "Survey of Safe RL and Constrained RL" (arXiv, 2025) [^21][24]
Safe RL overview (emergentmind, 2025) [^18]

RL in Aviation (Surveys)

Razzaghi et al. "Survey on RL in Aviation Applications" (arXiv, 2022) [^14]
ML in ATM Systematic Literature Review (SSRN, 2025) [^2]

Airport Taxiing & Departure Metering

Tran, Pham & Alam. "Greener Airport Surface Operations: RL for Autonomous Taxiing" (JJSASS, 2023) [^16][19]
Pham et al. "DRL for Airport Departure Metering" (IEEE TITS, 2022) [^11]
MARL for Autonomous Aircraft Taxiing (ScienceDirect, 2025) [^27]

Runway & Slot Scheduling

Nguyen-Duy & Pham. "RL for Multi-Airport Slot Re-Allocation" (IWAC, 2022) [^118]
"RL for Strategic Airport Slot Scheduling" (IEEE CAI, 2024) [^123][120]

Ground Delay Programs & Flow Management

Liu et al. "DRL for Real-Time GDP Revision" (arXiv, 2024) [^5][22]

Conflict Resolution

Chen et al. "General MARL for Multi-Aircraft Conflict Resolution" (TRC, 2023) [^23][28]
"Self-Prioritizing MARL for Conflict Resolution" (TU Delft, 2025) [^20]
Deniz & Wang. "MARL for UAM Conflict Resolution" (AIAA, 2024) [^17]
"DRL Conflict Resolution Strategy for ATM" (Aviation, 2023) [^25]

Ground Operations & Cargo

"DRL for Multi-Objective Airport Ground Handling" (TU Delft, 2025) [^95]
Euler et al. "ULD Build-Up Scheduling" (ZIB, 2021) [^90]
Silva et al. "Adaptive RL for Aircraft Maintenance Task Scheduling" (Sci Rep, 2023) [^98]

Simulators & Benchmarks

Groot et al. "BlueSky-Gym: RL Environments for ATC" (SESAR Innovation Days, 2024) [^32][35]
AAM-Gym (Brittain, 2022) [^44]
Brittain et al. RL for multi-agent collision avoidance (cited in survey) [^8]

EU/Institutional Projects

HYPERSOLVER (SESAR JU / CORDIS) [^48][51]
Eurocontrol A-CDM [^114][108]
SESAR Exploratory Research [^45]

ADS-B Data

TartanAviation Dataset (CMU, 2024) [^65]
ADSBexchange Historical Data [^68]
ADS-B Ground Bit Analysis (2026) [^71]

Production RL

AlphaChip / Google (Nature, 2021; blog 2024) [^109][103]
AgiBot Real-World RL (2025) [^60]
RL in Production (RunPod, 2025) [^75]
AlphaStar (DeepMind, 2019) [^96][99]

RL Frameworks

SB3 vs RLlib vs CleanRL comparison [^31][34]
Framework feature comparison [^40]

Next Steps: 4-Week Execution Plan

Goal: Complete an RL research prototype (Topic 1 or Topic 4) suitable for paper submission and portfolio showcase.

Week 1: Foundation & Data Pipeline

[ ] Day 1–2: Select primary topic (recommend Topic 1: Offline RL for Taxi-Out Time). Set up GitHub repo with project structure: env/, data/, models/, evaluation/, notebooks/, docs/.
[ ] Day 2–3: Download and preprocess ADS-B data from ADSBexchange (focus on one major airport, e.g., EDDF Frankfurt or KEWR Newark). Extract taxi-out trajectories using ground-bit transitions.
[ ] Day 3–4: Build data pipeline in PySpark/Pandas: raw ADS-B → cleaned trajectories → episode-level features (gate → runway taxi time, queue length, weather, time-of-day). Log everything in MLflow.
[ ] Day 5–7: Implement FCFS baseline and simple heuristic metering. Compute baseline metrics. Write data exploration notebook.

Week 2: Environment & Offline RL Training

[ ] Day 8–9: Construct offline dataset in D4RL-compatible format: (s, a, r, s', done) tuples from historical taxi logs. Define state space, action space, and reward function.
[ ] Day 10–11: Implement CQL and IQL using d3rlpy or CORL library. Train on offline dataset. Track experiments with MLflow.
[ ] Day 12–13: Implement Decision Transformer baseline. Compare learning curves.
[ ] Day 14: Set up off-policy evaluation (OPE) metrics: FQE (Fitted Q-Evaluation), importance sampling. Evaluate all trained policies.

Week 3: BlueSky Simulation & Evaluation

[ ] Day 15–16: Install BlueSky-Gym. Create custom environment matching your target airport's taxiway layout. Validate simulation against real ADS-B data distributions.
[ ] Day 17–18: Deploy trained offline RL policies in BlueSky environment. Compare: Offline-trained → Online-evaluated vs. baselines. Run ablation studies (state features, reward weights).
[ ] Day 19–20: OOD robustness testing: different traffic densities, weather conditions, runway configurations. Generate comparison tables and figures.
[ ] Day 21: Prepare visualization: animated taxi simulations, Grafana dashboard for live metrics.

Week 4: Paper Writing & Portfolio Polish

[ ] Day 22–23: Write paper draft (target: Transportation Research Part C or IEEE ITSC). Structure: Introduction → Related Work → Problem Formulation → Method → Experiments → Results → Discussion → Conclusion.
[ ] Day 24–25: Polish GitHub repository: README with architecture diagram, installation instructions, reproducibility guide, model checkpoints in MLflow registry. Create Streamlit/Gradio demo.
[ ] Day 26–27: Finalize paper: add all figures/tables, write abstract, proofread. Prepare ArXiv preprint.
[ ] Day 28: Submit to ArXiv. Update portfolio website. Share on LinkedIn with technical summary. Plan next iteration (add MARL or world model component for follow-up paper).

Tools & Stack (Aligned with Your Existing Setup)

Component	Tool
Data processing	PySpark on Databricks / local Spark
RL training	d3rlpy (offline RL), SB3 (online), RLlib (MARL)
Simulation	BlueSky-Gym + custom env
Experiment tracking	MLflow (already in your stack)
Visualization	Grafana + Plotly + Streamlit
Infrastructure	Docker, MinIO for data storage
Version control	GitHub with CI/CD

References

Reinforcement Learning Renaissance: An AI Atlas Report - The RL Renaissance is transforming AI by combining generative models with reinforcement learning to ...
Machine Learning in Air Traffic Management for Trajectory Optimization and Aviation Safety - a Systematic Literature Review - As air traffic numbers continue to increase, maintaining efficiency and safety in Air Traffic Manage...
The State of Reinforcement Learning in 2025 - Comprehensive Report on Startups, Innovation, and Market Trends shaping the RL innovation landscape.
Deep Reinforcement Learning for Real-Time
Offline Reinforcement Learning with Closed-loop Policy ... - This paper proposes a new model-based offline RL algorithm ADEPT adopting uncertainty-penalized diff...
A Comprehensive Survey of Reinforcement Learning: From Algorithms to Practical Challenges - Reinforcement Learning (RL) has emerged as a powerful paradigm in Artificial Intelligence (AI), enab...
[PDF] Single and Multi-Agent Reinforcement Learning Approach ... - eucass
Making Offline RL Online: Collaborative World
Reinforcement Learning Enhanced LLMs: A Survey - Reinforcement learning (RL) enhanced large language models (LLMs), particularly exemplified by DeepS...
A Deep Reinforcement Learning Approach for Airport Departure Metering Under Spatial–Temporal Airside Interactions
A Survey on Reinforcement Learning in Aviation Applications
Towards Greener Airport Surface Operations: A Reinforcement Learning Approach for Autonomous Taxiing*
Autonomous Conflict Resolution in Urban Air Mobility: A Deep Multi-Agent Reinforcement Learning Approach | AIAA Aviation Forum and ASCEND co-located Conference Proceedings - As urban air mobility (UAM) expands, electric vertical take-off and landing (eVTOL) vehicles are bec...
Safety-oriented Reinforcement Learning - Safe RL develops policies that maximize performance while strictly adhering to safety constraints fo...
Towards greener airport surface operations: a reinforcement learning approach for autonomous taxiing
Self-Prioritizing Multi-Agent Reinforcement Learning for Conflict Resolution in Air Traffic Control with Limited Instructions - DiVA portal is a finding tool for research publications and student theses written at the following ...
A Survey of Safe Reinforcement Learning and Constrained ... - This survey provides a mathematically rigorous overview of SafeRL formulations based on Constrained ...
Deep Reinforcement Learning for Real-Time Ground Delay Program ...
General multi-agent reinforcement learning integrating adaptive manoeuvre strategy for real-time multi-aircraft conflict resolution - Reinforcement learning (RL) techniques are under investigation for resolving conflict in air traffic...
A Survey of Safe Reinforcement Learning and Constrained ... - von A Kushwaha · 2025 · Zitiert von: 12 — This survey provides a mathematically rigorous overview of...
Conflict resolution strategy based on deep reinforcement learning ... - With the continuous increase in flight flows, the flight conflict risk in the airspace has increased...
A Multi-Agent Reinforcement Learning Approach to ... - A Multi-Agent Reinforcement Learning Approach to Autonomous Aircraft Taxiing with Taxiing Time, Fuel...
General multi-agent reinforcement learning integrating adaptive manoeuvre strategy for real-time multi-aircraft conflict resolution - Reinforcement learning (RL) techniques are under investigation for resolving conflict in air traffic...
Reinforcement Learning for Warehouse Management and ... - The rapid advancement of Artificial Intelligence (AI) and Machine Learning (ML) has revolutionized w...
Stable baselines vs RLlib vs CleanRL
BlueSky-Gym: Reinforcement Learning Environments for Air Traffic ...
Reinforcement Learning in Operations & Supply Chain Management - Reinforcement learning (RL), a branch of machine learning, is increasingly being applied in operatio...
Are there any significant advantages to RlLib over stable baselines 3?
BlueSky-Gym: Reinforcement Learning Environments for ... - von DJ Groot · Zitiert von: 8 — Built upon the. Gymnasium API and the BlueSky air traffic simulator,...
Simulation and Reinforcement Learning - Presentation by Wenshan Wang and Sourish Ghosh, part of the Air Lab Summer School 2020.

Sessions l...

Rllib (ray)
AAM-Gym: Artificial Intelligence Testbed for Advanced Air ... - von M Brittain · 2022 · Zitiert von: 15 — Simulated AAM traffic in the NYC area displayed in the Blu...
SESAR projects to foster new ideas and knowledge ...
When should we prefer Decision Transformers for Offline ...
Artificial Intelligence controller able to manage Air traffic ... - This project covers every aspect, from managing aircraft density and intricate flight paths to addre...
#sesarjuprojects #digitalsky #sesarju #horizoneu #atm #atfm #ai #drl | ENAC - Ecole Nationale de l'Aviation Civile - The HYPERSOLVER consortium, with ENAC as a contributing partner gathered at EUROCONTROL’s premises o...
[PDF] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs ... - arXiv
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs ... - General reasoning represents a long-standing and formidable challenge in artificial intelligence. Re...
AgiBot Achieves First Real-World Deployment of Reinforcement ... - Bridging embodied AI research with real-world manufacturing systems
Safe Reinforcement Learning with World Models - The deployment of Reinforcement Learning (RL) in real-world applications is constrained by its failu...
TartanAviation: Image, Speech, and ADS-B Trajectory ...
Accurate and Efficient World Modeling with Masked Latent ... - von M Burchi · Zitiert von: 1 — The Dreamer algorithm has recently obtained remarkable performance a...
Historical Data - ADSBexchange ADS-B data services for aircraft tracking and application development.
GitHub - PKU-Alignment/SafeDreamer: ICLR 2024: SafeDreamer: Safe Reinforcement Learning with World Models - ICLR 2024: SafeDreamer: Safe Reinforcement Learning with World Models - PKU-Alignment/SafeDreamer
Analysis of Ground Bit Changes in ADS-B Data from Airport ... - von B Stloukal · 2026 — In this paper, the behavior of the ground bit in data from real airport oper...
Reinforcement Learning in Production: Building Adaptive ... - Deploy adaptive reinforcement learning systems on Runpod to create intelligent applications that lea...
Reinforcement Learning for LLM Agent Planning - Zhiwei Li, Yong Hu, Wenqing Wang. Proceedings of the 2025 Conference on Empirical Methods in Natural...
Deep Reinforcement Learning for Online Advertising Impression in ... - With the recent prevalence of Reinforcement Learning (RL), there have been tremendous interests in u...
Deep reinforcement learning for optimal microgrid energy ... - von B Xiong · 2025 · Zitiert von: 20 — This paper proposes a deep reinforcement learning (DRL)-based...
Reinforcement Learning for LLM Agent Planning - arXiv - The functionality of Large Language Model (LLM) agents is primarily determined by two capabilities: ...
Optimizing Power Grid Topologies with Reinforcement ... - Power grid operation is becoming increasingly complex due to the rising integration of renewable ene...
ULD Build-Up Scheduling with Dynamic Batching in an Air ... - von R Euler · 2021 · Zitiert von: 2 — An intricate scheduling problem thus arises at the hub airport...
Interviewing Finbarr Timbers on the "We are So Back" Era ... - Era 1: Deep RL fundamentals — when modern algorithms we designed and proven. Era 2: Major projects —...
Reinforcement Learning in Financial Decision Making - Lastly, it suggests practical implementation models that confront real-world deployment hurdles whil...
What role did self-play and reinforcement learning ... - AlphaZero, an artificial intelligence (AI) developed by DeepMind, represents a significant milestone...
Deep Reinforcement Learning for Multi-Objective Airport ... - This highlights the need for efficient and optimal scheduling of ground handling processes to minimi...
AlphaStar: Mastering the real-time strategy game StarCraft II - Games have been used for decades as an important way to test and evaluate the performance of artific...
Adaptive reinforcement learning for task scheduling in ... - This paper proposes using reinforcement learning (RL) to schedule maintenance tasks, which can signi...
AlphaStar: Grandmaster level in StarCraft II using multi ... - TL;DR: AlphaStar is the first AI to reach the top league of a widely popular esport without any game...
Chip Design with Deep Reinforcement Learning - Posted by Anna Goldie, Senior Software Engineer and Azalia Mirhoseini, Senior Research Scientist, Go...
RLHF vs DPO vs GRPO Visually Explained: 1️⃣ ... - RLHF vs DPO vs GRPO Visually Explained: 1️⃣ Reinforcement Learning with Human Feedback (RLHF) Proces...
How to align open LLMs in 2025 with DPO & and synthetic ... - Learn how to align LLMs using Hugging Face TRL and RLHF through Direct Preference Optimization (DPO)...
Airport Collaborative Decision Making - Wikipedia
How AlphaChip transformed computer chip design - Our AI method has accelerated and optimized chip design, and its superhuman chip layouts are used in...
Airport collaborative decision-making (A-CDM) - Improving the efficiency and resilience of airport operations by optimising the use of resources and...
T8-2-A
Should I Use Offline RL or Imitation Learning? | MKAI - Figure 1: Summary of our recommendations for when a practitioner should BC and various imitation lea...
Reinforcement Learning for Strategic Airport Slot Scheduling: Analysis of State Observations and Reward Designs
danijar/dreamerv3: Mastering Diverse Domains through ... - Mastering Diverse Domains through World Models. Contribute to danijar/dreamerv3 development by creat...
Reinforcement learning for strategic airport slot scheduling: Analysis of state observations and reward designs - Due to the NP-hard nature, the strategic airport slot scheduling problem is calling for exploring su...
Should I Use Offline RL or Imitation Learning? - Berkeley AI Research - The BAIR Blog
Mastering diverse control tasks through world models - PMC - Developing a general algorithm that learns to solve tasks across a wide range of applications has be...