Reinforcement Learning Technology Development & Air Traffic/Logistics Applications: A Research Report
February 2026 | Focus period: 2018–present (emphasis on 2021–2025)
(1) Executive Summary
- RL is experiencing a "renaissance" driven by convergence with generative AI and LLMs. RLHF, DPO, and GRPO have become standard for LLM alignment, and pure RL can now incentivize emergent reasoning capabilities without supervised fine-tuning, as demonstrated by DeepSeek-R1 [52][55].
- World models are maturing rapidly. DreamerV3 (Nature, 2025) outperforms specialized methods across 150+ tasks with fixed hyperparameters, becoming the first algorithm to collect diamonds in Minecraft from scratch [125][122].
- Offline RL is production-ready for domains with logged data. Conservative Q-Learning (CQL), Decision Transformer, and Implicit Q-Learning enable policy learning from historical datasets without online exploration—ideal for aviation [47][5].
- Safe RL is formalized. Constrained MDPs (CMDPs), safety filters, and shielding methods are well-established theoretically, with SafeDreamer achieving near-zero constraint violations at ICLR 2024 [64][21].
- Production-grade RL deployments exist in chip design (AlphaChip for Google TPUs) [^109], robotics (AgiBot's real-world RL in manufacturing) [^60], recommendation systems (Netflix, Uber, Google), and energy management [^75].
- In aviation, RL has been applied to conflict resolution, departure metering (22% fuel reduction at Singapore Changi) [^11], autonomous taxiing (29.5% fuel reduction) [^16], ground delay programs, and slot scheduling [^123]—but nearly all remain at academic or prototype stage.
- BlueSky-Gym (SESAR Innovation Days, 2024) provides the first standardized RL benchmark environments for ATC/ATM built on the Gymnasium API [32][35].
- HYPERSOLVER (EU Horizon/SESAR JU) is the first project to holistically merge ATFM and ATC using deep RL, currently in its second year [48][51].
- ADS-B data is increasingly accessible for research—TartanAviation provides 661 days of multi-modal ADS-B data [^65], and ADSBexchange offers historical operations data including runway-level takeoffs/landings [^68].
- Multi-Agent RL (MARL) is the natural paradigm for air traffic operations, with successful applications in conflict resolution using Rainbow DQN, A3C, and MADDPG [23][17][^20].
- The biggest bottleneck for RL in aviation is not algorithms but rather environment fidelity, safety certification, regulatory acceptance, and the sim-to-real gap [14][2].
- Offline RL and world models represent the most viable path for near-term aviation applications, as they can leverage historical operational data without requiring online exploration in safety-critical environments [119][47].
- RL + LLM Agent architectures are emerging for logistics planning (RLTR framework, EMNLP 2025), achieving 8–12% improvement in planning performance [77][80].
- Airport ground handling scheduling remains an under-explored but high-value target for RL research, with a recent TU Delft thesis (2025) being among the first to tackle the complete problem [^95].
- For the user's profile (AI + Logistics PhD, ADS-B/airport ground data, Databricks/MLflow stack), the optimal research direction combines offline RL or model-based RL with airport ground operations, using the BlueSky simulator and real ADS-B data for validation.
(2) RL Technology Map (2018–Present)
Core Algorithm Families
| Family | Key Algorithms | Strengths | Limitations |
|---|---|---|---|
| Value-based | DQN, Double DQN, Dueling DQN, Rainbow DQN, QR-DQN | Discrete action spaces, stable with experience replay | Poor scalability to continuous/high-dim actions |
| Policy-based | REINFORCE, TRPO, PPO | Continuous actions, direct optimization | High variance, sample inefficiency |
| Actor-Critic | A2C/A3C, SAC, DDPG, TD3 | Combines value estimation with policy optimization | Hyperparameter sensitivity |
| Model-based | DreamerV1–V3, MuZero, MBPO, SafeDreamer | Sample efficiency, planning, safety integration | Model error compounding |
| Offline RL | CQL, IQL, Decision Transformer, BCQ, TD3+BC | Learn from logged data, no online exploration | Distribution shift, conservative bias |
| Multi-Agent RL | MADDPG, QMIX, VDN, MAPPO, COMA | Cooperative/competitive multi-agent tasks | Non-stationarity, scalability |
| Hierarchical RL | Option-Critic, HIRO, HAM | Temporal abstraction, sub-goal planning | Reward design complexity |
[7][14]
Key Milestones (2018–2025)
- 2017: PPO published (OpenAI); AlphaZero masters chess/Go/shogi with pure self-play RL [^93]
- 2018: OpenAI Five defeats professional Dota 2 teams [^91]
- 2019: AlphaStar reaches Grandmaster in StarCraft II using multi-agent RL + imitation learning [96][99]
- 2020: MuZero plans in learned latent spaces; AlphaChip introduced for chip floorplanning [^103]
- 2021: Decision Transformer frames RL as sequence modeling; AlphaChip published in Nature and deployed for Google TPU production [^109]
- 2022: ChatGPT launches using RLHF (PPO + reward model); sim-to-real transfer advances [^91]
- 2023: DreamerV3 released (masters Minecraft from scratch); SafeDreamer achieves near-zero-cost safe RL [^64]; L2RPN competitions accelerate RL for power grids [^85]
- 2024: OpenAI o1 (test-time compute scaling via RL); DPO widely adopted as simpler RLHF alternative; BlueSky-Gym launched for aviation RL benchmarking [^32]
- 2025: DeepSeek-R1 demonstrates pure RL can incentivize LLM reasoning via GRPO without SFT [52][55]; DreamerV3 published in Nature [^125]; AgiBot deploys real-world RL in manufacturing [^60]; EMERALD world model surpasses human experts on Crafter [^67]
RL for LLMs: RLHF → DPO → GRPO
| Method | Process | Reward Model | Complexity | Key Use |
|---|---|---|---|---|
| RLHF | Human rankings → reward model → PPO optimization | Separate model required | High | ChatGPT, Claude |
| RLAIF | AI-generated rankings replace human annotators | AI-derived | Medium | Scaling alignment |
| DPO | Preference pairs → direct policy optimization | Implicit (no separate model) | Low | Stable, efficient alignment [^107] |
| GRPO | Group scoring → relative policy optimization | Continuous scoring function | Medium | DeepSeek-R1 reasoning [52][104] |
Benchmarks & Their Aviation Relevance
| Benchmark | Domain | Aviation Relevance |
|---|---|---|
| Atari / DM Control / MuJoCo | Game / Continuous control | Low – Toy domains, useful for algorithm validation only |
| Procgen / Crafter | Procedural generation / Survival | Low-Medium – Tests generalization, relevant for domain randomization |
| MetaWorld | Multi-task robotic manipulation | Medium – Relevant for ground handling equipment control |
| Safety-Gymnasium | Safe RL tasks | High – Directly relevant for safety-constrained aviation RL [^64] |
| PettingZoo / SMAC | Multi-agent coordination | High – Multi-agent coordination mirrors ATC scenarios |
| BlueSky-Gym | ATC/ATM specific | Very High – Purpose-built for aviation RL research [32][35] |
| D4RL | Offline RL benchmarks | High – Framework for offline methods applicable to logged ADS-B data [^47] |
(3) RL Engineering & Deployment
Data & Simulation
- Sim-to-Real Transfer: Domain randomization and system identification remain critical for bridging simulation-to-reality gaps. In aviation, simulators like BlueSky provide parametric models of aircraft performance, but real-world taxi operations exhibit stochastic delays, weather effects, and human controller variability [^14].
- Digital Twins: Airport digital twins integrating ADS-B feeds, A-SMGCS data, and weather streams enable high-fidelity RL training environments. Singapore Changi's departure metering study used historical A-SMGCS data to construct training scenarios [^11].
- Log Data-Driven Approaches: Offline RL (CQL, IQL, Decision Transformer) can directly leverage historical operational logs without building simulators—particularly attractive for aviation where online exploration is infeasible [5][47].
Training Infrastructure & Experiment Tracking
| Framework | Best For | MARL Support | Scalability | Ease of Use |
|---|---|---|---|---|
| Stable Baselines3 | Research prototyping, single-agent | No | Single machine | High [^31] |
| RLlib (Ray) | Production, distributed, MARL | Yes | Multi-node cluster | Medium [^34] |
| CleanRL | Educational, reproducibility | No | Single machine | High [^31] |
| Tianshou | Flexibility, performance | Limited | Single machine | Medium [^40] |
| ACME (DeepMind) | Research-grade components | Yes | Multi-node | Low [^40] |
For experiment tracking and reproducibility, MLflow (already in the user's stack) is well-suited for logging RL training runs, hyperparameter sweeps, reward curves, and model artifacts. Integration with RLlib via Ray Tune provides scalable hyperparameter optimization.
Safe RL & Constraint Handling
Safe RL methods formalize safety as Constrained Markov Decision Processes (CMDPs), where the agent maximizes reward subject to cost constraints [21][24]:
- Lagrangian methods: Convert constraints to penalty terms (CPO, PCPO, PDO) — most popular in practice
- Safety filters/shielding: Hard constraint enforcement at every step via control barrier functions
- SafeDreamer: Integrates Lagrangian constraints into Dreamer's world model planning, achieving near-zero violations on Safety-Gymnasium (ICLR 2024) [64][70]
- SafeMARL: Extends safe RL to multi-agent settings—critical for cooperative ATC applications [^21]
For aviation applications, the regulatory requirement for explainability and verifiability makes safety-constrained approaches mandatory. The ACAS-X system used POMDPs with formal verification as an early aviation-specific safe RL deployment [^14].
(4) Industrial Application Landscape
| Industry | RL Application | RL Type | Maturity | Why It Works / Challenges |
|---|---|---|---|---|
| Gaming | AlphaStar (StarCraft II), OpenAI Five (Dota 2) | MARL, Self-play, PPO | Production | Simulated env = unlimited data; no safety constraints [^99] |
| Chip Design | AlphaChip (Google TPU floorplanning) | Policy gradient + GNN | Production | Clear reward (PPA metrics); graph-structured inputs [^109] |
| LLM Alignment | ChatGPT (RLHF), DeepSeek-R1 (GRPO) | PPO, DPO, GRPO | Production | Verifiable rewards (math, code); massive compute available [^52] |
| Robotics | AgiBot (manufacturing), sim-to-real manipulation | PPO, SAC, Real-world RL | Pilot Production | Fast sim training; still limited task generality [^60] |
| Recommendation | Netflix personalization, ad placement | DQN, Contextual bandits | Production | Abundant online feedback; A/B testing infrastructure [75][78] |
| Energy/Grid | Microgrid management, L2RPN competitions | PPO, TD3, DQN | Prototype → Production | Well-defined physics; PPO outperforms MILP/PSO for real-time control [79][85] |
| Finance | Market making, algorithmic trading | DQN, DDPG, PPO | Prototype | High noise, non-stationarity; implementation quality > algorithm choice [^92] |
| Supply Chain | Amazon/DHL warehouse ops, inventory management | DQN, MARL | Prototype → Production | Combinatorial action spaces; RL adapts to demand variability [33][30] |
| Autonomous Driving | Wayve, Waymo (decision-making modules) | PPO, SAC, Offline RL | Pilot Production | Massive sim training (CARLA/nuPlan); safety-critical certification barrier [^4] |
| Air Traffic Mgmt | Conflict resolution, departure metering, slot scheduling | DQN, PPO, MARL | Academic / Prototype | Safety certification gap; limited sim fidelity; regulatory barriers [14][2] |
Key insight: RL succeeds in production when (a) a high-fidelity simulator or abundant online feedback exists, (b) reward signals are clear and dense, and (c) safety constraints are manageable or the domain tolerates exploration errors. Aviation fails on all three counts for online RL, making offline RL and world model-based approaches the most viable path forward.
(5) RL in Air Logistics / Air Traffic: Deep Dive
Task-by-Task Analysis
5.1 Airport Ground Taxiing & Departure Metering
Representative Works:
- Tran, Pham & Alam (2023): DRL-based autonomous taxiing agent using PPO at Singapore Changi. State includes planning features (distance to target, expected time), surrounding traffic. Action: continuous acceleration. Reward balances fuel burn, taxi time, and delay. Result: 97.8% on-time performance within [-20, +5]s window, 29.5% fuel reduction [16][19].
- Pham et al. (IEEE TITS, 2022): Deep RL for airport departure metering under spatial-temporal airside interactions at Singapore Changi. Used A-SMGCS data. Result: 22% fuel reduction, contained taxiway congestion without significant throughput loss [^11].
- MARL for Autonomous Taxiing (2025): Multi-agent approach optimizing taxi time, fuel consumption, and emissions simultaneously [^27].
Environment: Custom simulation built on airport taxiway graph networks with aircraft performance models. Simulator: Typically custom-built (not BlueSky), using real A-SMGCS/ADS-B data for scenario generation. Multi-agent: Yes, but most work treats it as single-agent with traffic as environment. MARL emerging.
Maturity: Academic → Early Prototype
5.2 Runway & Slot Allocation
Representative Works:
- Nguyen-Duy & Pham (NTU): RL for collaborative multi-airport slot re-allocation under A-CDM framework. Tested on Hong Kong–Singapore–Bangkok hub using OAG 2018 data. RL agent significantly outperformed nearest heuristic under heavy-reduced capacity (total delay: 84 vs. 107) [^118].
- RL for Strategic Airport Slot Scheduling (IEEE CAI, 2024): Formulated as MDP with DQN and PPO. Key finding: adding positive intermediate reward signals enables convergence. DQN outperformed PPO with average displacement of 1.44/1.99 per request for medium/high-density [123][120].
MDP Design: State = current slot assignment status + demand profile; Action = assign/defer/displace request; Reward = negative delay + penalty for unaccommodated requests.
Maturity: Academic
5.3 Arrival/Departure Sequencing & Flow Management
Representative Works:
- DRL for Ground Delay Programs (UC Berkeley, 2024): Behavioral Cloning and Conservative Q-Learning (CQL) for optimizing GDP airport program rates at Newark (EWR). Used full-year 2019 data. Simulation environment SAGDPENV uses queuing diagrams with real operational data [5][22].
- Air Traffic Flow Management is surveyed extensively in the RL-in-Aviation survey, with multiple approaches using DQN and PPO for demand-capacity balancing [^14].
Key insight: Offline RL (CQL) is particularly well-suited here, as GDP decisions can be learned from historical operational data without online exploration.
Maturity: Academic → Prototype
5.4 Conflict Detection & Resolution
Representative Works:
- Chen et al. (Transportation Research Part C, 2023): General MARL with adaptive manoeuvre strategy using Rainbow DQN. Agents output flight intentions (increase speed, turn left) rather than specific parameters. Partial observation based on imminent threat detection sectors. Validated generalisation across arbitrary scenarios [23][28].
- Self-Prioritizing MARL (TU Delft, 2025): Novel learning-based priority mechanism minimises number of ATC directives while resolving conflicts. Continuous action space. Well-suited for centralized ATC where instruction bandwidth is limited [^20].
- Deniz & Wang (AIAA, 2024): MARL for UAM conflict resolution using A3C with centralized learning, decentralized execution. Resolves nearly all conflicts in high-density eVTOL scenarios [^17].
- DRL Conflict Resolution (Aviation journal, 2023): DQN agent using altitude/speed/heading adjustments modeled as MDP. Reward function incorporates ATC regulations [^25].
Simulators: Most use custom environments; BlueSky increasingly adopted. BlueSky-Gym provides standardised conflict resolution environments [32][35].
Maturity: Academic (extensive literature)
5.5 Ground Service Resource Scheduling
Representative Works:
- DRL for Multi-Objective Airport Ground Handling (TU Delft, 2025): First comprehensive RL formulation tackling the complete airport ground handling scheduling problem including tow tractors, baggage vehicles, refueling, catering, and cleaning—rather than simplified sub-problems [^95].
- ULD Build-Up Scheduling (ZIB Berlin, 2021): Multi-commodity network design model for cargo terminal ULD scheduling. While not RL-based, provides the problem formulation (workstation assignment, batching, break-down/build-up sequencing) that is highly amenable to RL treatment [^90].
- RL for Airline Maintenance Scheduling (2023): Adaptive RL for maintenance task scheduling with rescheduling capability. Demonstrated efficacy with ground time and time slack KPIs [^98].
Gap: This is the least explored area for RL in aviation, despite being a rich combinatorial optimization problem with clear operational value. Cargo terminal operations (ULD scheduling, equipment dispatching) and ground handling resource allocation represent major research opportunities.
Maturity: Early Academic / Unexplored
5.6 EU-Funded Projects & Institutional Resources
- HYPERSOLVER (SESAR JU, 2023–2025): First project to holistically merge ATFM and ATC using deep RL. Partners include ENAC and Eurocontrol. Uses continuous reassessment and dynamic updates across the full conflict management timeline [48][51].
- SESAR Exploratory Research: Includes projects on improved airport surface movements using surveillance technologies [^45].
- Eurocontrol A-CDM: Provides the operational framework (TSAT, TOBT, TTOT) and data-sharing protocols that any RL system for airport operations must integrate with [114][108].
- Eurocontrol DDR2 Dataset: Demand Data Repository with flight plans, sector configurations, and traffic counts [^14].
5.7 ADS-B Data & Simulation Resources
| Resource | Type | Coverage | Aviation RL Relevance |
|---|---|---|---|
| TartanAviation | Open dataset | 661 days ADS-B + 3.1M images + 3374h ATC audio, 2 US airports | High — multi-modal, trajectory + communications [^65] |
| ADSBexchange | Historical data service | Global, includes runway-level operations | High — takeoff/landing detection with runway ID [^68] |
| Eurocontrol DDR2 | Flight plan data | European airspace | High — demand/capacity for ATFM [^14] |
| OpenSky Network | Open ADS-B archive | Global, academic access | High — large-scale trajectory data |
| BlueSky Simulator | Open-source ATC sim | Configurable, Gymnasium API via BlueSky-Gym | Very High — standardised RL environments [^32] |
| AirSim | Drone simulation | UAV focused | Medium — useful for UAM scenarios [^38] |
| JSBSim | Flight dynamics | Fixed/rotary wing | Medium — flight control, not ground ops [^14] |
| SUMO | Traffic simulation | Road network, adaptable | Medium — can model ground vehicle movements |
| AnyLogic | Discrete-event + agent-based | General-purpose | Medium — good for cargo terminal simulation |
(6) Research Topics for Your Portfolio
Topic 1: Offline RL for Airport Taxi-Out Time Optimization
Problem Definition: Given an aircraft's pushback request, assigned runway, and current airport surface state, learn an optimal gate-hold / departure metering policy that minimizes total taxi-out time and fuel burn across all departures, using only historical operational logs.
Business Goal: Reduce taxi fuel burn by 15–25% and total surface delays by 10–20% without building a full online simulation.
RL Formalization: Episodic MDP where each episode is a departure bank. State = (aircraft queue at each taxiway segment, active runway config, weather category, time-of-day). Action = (hold at gate for Δt ∈ {0, 1, 2, ...5} min). Reward = −(taxi_time + α·fuel_burn + β·delay_penalty).
Data Sources:
- ADS-B surface trajectories (ADSBexchange operations data or OpenSky) [^68]
- Airport TOBT/TSAT from A-CDM records (if accessible via Frankfurt Airport/Fraport) [^114]
- Weather data (METAR/TAF from aviation weather services)
Baseline Methods: FCFS gate release, MILP-based departure scheduling, simple heuristic metering rules.
RL Method Route: Offline RL (CQL or IQL) on historical taxi logs → evaluate with importance-weighted off-policy evaluation (OPE) → optional online fine-tuning in BlueSky simulation. Model-based alternative: train a world model (Dreamer-style) on logged transitions, then plan in imagination.
Evaluation Metrics: Average taxi-out time (min), fuel burn (kg), throughput (departures/hour), delay distribution, safety separation violations (must be 0).
Experiment Design:
- Train/val/test split by date (e.g., 8 months train, 2 val, 2 test)
- Ablation: state features, reward weights α/β, CQL α penalty
- OOD test: evaluate on different traffic density days, different runway configs
- Comparison table: FCFS vs. MILP vs. Heuristic vs. CQL vs. IQL vs. DT
Risk & Alternatives: Data quality (ADS-B surface coverage may be incomplete at some airports). Alternative: imitation learning (BC) on well-performing operational days; MPC if a transition model can be fit.
Output: Paper (Transportation Research Part C or JAIR) + GitHub repo with offline RL pipeline + Streamlit/Grafana demo dashboard.
Topic 2: MARL for Airport Ground Resource Dispatching
Problem Definition: Coordinate the dispatching of ground handling vehicles (tow tractors, fuel trucks, baggage carts, de-icing trucks) across an airport to minimize turnaround delays, using a multi-agent framework where each vehicle type is an agent.
Business Goal: Reduce average turnaround time by 10–15% and improve resource utilization rate by 20%.
RL Formalization: Decentralized POMDP / Markov Game. Each agent (vehicle dispatcher) observes local state (own fleet positions, assigned tasks, time to next event) and selects action (assign vehicle X to flight Y). Shared reward = −(total turnaround delay + idle time penalty). Constraint: safety separation on apron.
Data Sources:
- Airport ground event logs (TOBT, AIBT, AOBT from A-CDM) [^108]
- Simulated using AnyLogic or custom discrete-event simulation
- ADS-B for aircraft arrival/departure times; flight schedules from OAG
Baseline Methods: Rule-based dispatching (nearest-vehicle-first), MILP scheduling [^90], manual scheduling.
RL Method Route: MAPPO or QMIX (centralized training, decentralized execution) in simulation → offline evaluation with replayed logs. Start with simplified 2-agent version (tow tractors + baggage), scale to 4–5 agents.
Evaluation Metrics: Average turnaround time, resource utilization %, number of delayed departures, CO₂ from vehicle movements, computation time vs. MILP.
Experiment Design:
- Build gym environment wrapping AnyLogic or custom Python discrete-event sim
- Compare: rule-based → single-agent RL → MAPPO → QMIX
- Ablation on communication mechanism (no comm, shared observations, attention-based message passing)
- Robustness: test with ±20% demand variation, equipment failures
Risk & Alternatives: Simulation fidelity; data availability for ground handling events. Alternative: Hierarchical RL (high-level task assignment + low-level routing). Backup: constraint programming + learned heuristic.
Output: Paper (AAAI / AAMAS / Transportation Science) + GitHub (env + training + evaluation) + animated demo.
Topic 3: World Model-Based RL for Airport Surface Movement Prediction & Control
Problem Definition: Learn a latent-space world model of airport surface dynamics from ADS-B trajectory data, then use it for (a) predicting taxi times and surface congestion, and (b) planning optimal taxi routes / speeds via imagined rollouts.
Business Goal: Enable predictive surface management — anticipate congestion 10–15 min ahead, provide recommended taxi speeds to reduce stop-and-go.
RL Formalization: POMDP with image-like state (airport surface heatmap from ADS-B + runway status + weather). World model learns transition dynamics in latent space (DreamerV3 architecture). Actor-critic trained entirely in imagination.
Data Sources:
- ADS-B surface position reports (1s resolution from TartanAviation or ADSBexchange) [65][68]
- Airport layout graph (OpenStreetMap / Eurocontrol)
- Weather (METAR)
Baseline Methods: Rule-based taxi control, data-driven taxi time prediction (XGBoost/LSTM), MPC with simplified kinematic model.
RL Method Route: DreamerV3 adaptation — encode airport surface state as spatial grid (like image observations), learn RSSM world model, train actor-critic in dreamed trajectories. Can incorporate safety constraints via SafeDreamer's Lagrangian approach [64][125].
Evaluation Metrics: Taxi time prediction MAE, congestion prediction accuracy (precision/recall for hotspot detection), fuel savings, safety margin maintenance, planning horizon accuracy.
Experiment Design:
- Phase 1: Train world model purely on ADS-B logs (unsupervised)
- Phase 2: Fine-tune with reward signal (taxi time + fuel objective)
- Phase 3: Compare imagined rollouts vs. actual outcomes
- Ablation: latent dimension, prediction horizon, with/without weather encoding
- OOD: Test on different airports / runway configurations
Risk & Alternatives: ADS-B surface resolution may be insufficient for precise taxiway segment modeling. Alternative: encode state as graph (GNN) rather than image. Fallback: use model for prediction only (supervised) without RL control.
Output: Paper (NeurIPS workshop / ICML workshop / IEEE TITS) + open-source world model code + interactive prediction dashboard.
Topic 4: Offline RL for Air Cargo Terminal ULD Scheduling
Problem Definition: Optimize the scheduling of ULD (Unit Load Device) build-up and break-down at an air cargo hub, assigning ULDs to workstations and time slots while respecting cargo availability, flight connections, and workforce constraints.
Business Goal: Reduce average cargo connection time by 15–20%, increase workstation utilization, reduce labor overtime.
RL Formalization: Finite-horizon MDP. State = (current ULD queue, workstation status, inbound cargo availability timeline, outbound flight schedule). Action = (assign ULD X to workstation W at time T, or defer). Reward = −(connection_delay + overtime_cost) + throughput_bonus.
Data Sources:
- Cargo terminal event logs (if available from Fraport/Lufthansa Cargo)
- Simulated scenarios based on published cargo hub models [^90]
- Flight schedules + cargo booking data (synthesized from OAG)
Baseline Methods: FCFS scheduling, priority-based heuristics, MILP formulation from ZIB Berlin [^90].
RL Method Route: Offline RL (CQL/IQL) if historical scheduling logs exist. Otherwise, train online in AnyLogic/SimPy-based cargo terminal simulator, then evaluate with replay. Decision Transformer could work well given the sequential nature of assignment decisions.
Evaluation Metrics: Average ULD dwell time, connection rate (% ULDs making outbound flight), workstation utilization, labor cost, scalability to peak season volumes.
Experiment Design:
- Build discrete-event simulation of cargo hub (break-down → storage → build-up → delivery)
- Train: DQN → CQL → DT, compare with MILP (small instances) and heuristics (all instances)
- Ablation on state representation (raw features vs. learned embeddings)
- Scale test: 50 → 200 → 500 ULDs/day
Risk & Alternatives: Real cargo data is commercially sensitive and hard to obtain. Mitigation: synthetic data generation based on published operational parameters. Alternative: imitation learning from expert scheduler logs; constraint programming for exact solutions on small instances.
Output: Paper (Transportation Research Part E / Computers & Operations Research) + GitHub (simulator + RL training) + Grafana monitoring dashboard.
Topic 5: RL-Enhanced LLM Agent for Flight Delay Prediction & Proactive Rescheduling
Problem Definition: Build an LLM-based agent that combines natural language understanding of NOTAM/weather briefings with RL-based planning to predict flight delays and recommend proactive rescheduling actions (gate changes, stand swaps, slot trades).
Business Goal: Reduce reactionary delays by 20–30% through earlier, smarter intervention.
RL Formalization: Agent observes structured + unstructured data (flight schedule, ADS-B feeds, METAR/TAF text, NOTAMs). LLM parses unstructured inputs into structured features. RL policy (trained with RLTR-style framework) selects actions from a discrete set (hold, swap gate, request slot trade, alert ops). Reward from actual vs. predicted delay reduction [^77].
Data Sources:
- ADS-B trajectory data [^65]
- Historical METAR/TAF, NOTAM text
- Flight schedule + actual performance data (BTS for US; Eurocontrol for EU)
- A-CDM event data [^114]
Baseline Methods: Rule-based delay prediction (EUROCONTROL CODA), ML regression (XGBoost/LSTM), LLM-only zero-shot prediction.
RL Method Route: RL for LLM agent planning (RLTR framework) — train planner with tool-use rewards for accurate delay prediction and appropriate intervention selection [^80]. Offline RL warm-start from historical operations data.
Evaluation Metrics: Delay prediction MAE/RMSE, intervention success rate, reactionary delay reduction, false alarm rate, computational latency.
Experiment Design:
- Phase 1: Build delay prediction baseline (ML + LLM)
- Phase 2: Add RL planner for intervention selection
- Phase 3: Evaluate end-to-end on historical disruption days
- Compare: LLM-only → ML+Rules → RL Agent → RL+LLM Agent
Risk & Alternatives: LLM hallucination on operational decisions; reward shaping for rare events. Alternative: pure offline RL without LLM component; hybrid LLM + constraint optimization.
Output: Paper (KDD / AAAI Applied AI) + GitHub (agent framework) + Streamlit interactive demo.
(7) Future Trend Assessment (12–24 Months)
RL + LLM/Agent Impact on Scheduling & Logistics
The convergence of RL with LLM agents will reshape operational optimization in three ways:
- Natural language interfaces for operations: LLM agents with RL-trained planners will enable controllers and dispatchers to interact with optimization systems using natural language, lowering adoption barriers. The RLTR framework (EMNLP 2025) showing 8–12% planning improvement is an early signal [^77].
- Reasoning-augmented optimization: DeepSeek-R1's demonstration that pure RL can induce chain-of-thought reasoning opens the door to RL agents that can explain their scheduling decisions—a critical requirement for aviation certification [52][55].
- Multi-modal situation awareness: LLM agents processing NOTAM text, weather briefings, and ADS-B data simultaneously will enable more holistic delay prediction and response planning than current siloed systems.
Assessment: High impact within 24 months for decision support tools; 3–5 years for autonomous operational deployment.
Will Offline RL, World Models, and Constrained Optimization Become Mainstream?
Offline RL: Yes, for aviation and other safety-critical domains. The ability to learn from historical logs without online exploration directly addresses the primary barrier to RL adoption in aviation. CQL/IQL + OPE (off-policy evaluation) provide a complete workflow [5][47][^119].
World Models: Likely to become standard. DreamerV3's success with fixed hyperparameters across 150+ tasks (Nature, 2025) suggests readiness for domain-specific applications. SafeDreamer's constraint integration (ICLR 2024) adds the safety layer aviation needs [125][64].
Constrained Optimization Integration: Essential, not optional. Pure RL without hard constraints will never be accepted in aviation. Expect hybrid approaches: RL for exploration/improvement + MPC/MILP for constraint enforcement and safety certification.
Most Likely Near-Term Aviation Deployments
| Application | Likelihood (12–24 months) | Rationale |
|---|---|---|
| Decision support for GDP/ATFM | High | Offline RL on historical data; advisory (non-autonomous); HYPERSOLVER project advancing [^48] |
| Taxi-out time prediction & advisory | High | Predictive (not control); abundant ADS-B data; clear business case (fuel savings) |
| Airport slot scheduling | Medium-High | NP-hard problem; RL shown to generalize to unseen densities [^123] |
| Conflict resolution advisory | Medium | Strong academic foundation; but safety certification barrier remains high [^23] |
| Autonomous surface movement control | Low | Requires human-out-of-loop approval; far from certification |
| Cargo terminal optimization | Medium | Less safety-critical than ATC; clear ROI; but lacks research attention |
(8) References (Grouped by Topic)
RL Surveys & Algorithms
- Ghasemi et al. "A Comprehensive Survey of Reinforcement Learning" (arXiv, 2024) [^7]
- RL Renaissance Report (emerge.haus, 2025) [^1]
- State of RL in 2025 (datarootlabs, 2025) [^4]
RL for LLMs
- DeepSeek-R1: Incentivizing Reasoning via RL (arXiv, Jan 2025) [52][55]
- Wang et al. "RL Enhanced LLMs: A Survey" (arXiv, 2024) [^10]
- Li et al. "RL for LLM Agent Planning (RLTR)" (EMNLP 2025) [^77]
- RLHF vs DPO vs GRPO comparison [^104]
Offline RL & World Models
- DreamerV3 (Nature, 2025) [125][122]
- SafeDreamer (ICLR 2024) [64][70]
- EMERALD (ICML 2025) [^67]
- CoWorld (NeurIPS 2024) [^9]
- When to prefer DT for Offline RL (ICLR 2024) [^47]
- ADEPT: Offline RL with Diffusion World Models (ICLR 2025) [^6]
- Offline RL vs Imitation Learning (BAIR Blog, 2022) [^124]
Safe RL
- Kushwaha et al. "Survey of Safe RL and Constrained RL" (arXiv, 2025) [21][24]
- Safe RL overview (emergentmind, 2025) [^18]
RL in Aviation (Surveys)
- Razzaghi et al. "Survey on RL in Aviation Applications" (arXiv, 2022) [^14]
- ML in ATM Systematic Literature Review (SSRN, 2025) [^2]
Airport Taxiing & Departure Metering
- Tran, Pham & Alam. "Greener Airport Surface Operations: RL for Autonomous Taxiing" (JJSASS, 2023) [16][19]
- Pham et al. "DRL for Airport Departure Metering" (IEEE TITS, 2022) [^11]
- MARL for Autonomous Aircraft Taxiing (ScienceDirect, 2025) [^27]
Runway & Slot Scheduling
- Nguyen-Duy & Pham. "RL for Multi-Airport Slot Re-Allocation" (IWAC, 2022) [^118]
- "RL for Strategic Airport Slot Scheduling" (IEEE CAI, 2024) [123][120]
Ground Delay Programs & Flow Management
- Liu et al. "DRL for Real-Time GDP Revision" (arXiv, 2024) [5][22]
Conflict Resolution
- Chen et al. "General MARL for Multi-Aircraft Conflict Resolution" (TRC, 2023) [23][28]
- "Self-Prioritizing MARL for Conflict Resolution" (TU Delft, 2025) [^20]
- Deniz & Wang. "MARL for UAM Conflict Resolution" (AIAA, 2024) [^17]
- "DRL Conflict Resolution Strategy for ATM" (Aviation, 2023) [^25]
Ground Operations & Cargo
- "DRL for Multi-Objective Airport Ground Handling" (TU Delft, 2025) [^95]
- Euler et al. "ULD Build-Up Scheduling" (ZIB, 2021) [^90]
- Silva et al. "Adaptive RL for Aircraft Maintenance Task Scheduling" (Sci Rep, 2023) [^98]
Simulators & Benchmarks
- Groot et al. "BlueSky-Gym: RL Environments for ATC" (SESAR Innovation Days, 2024) [32][35]
- AAM-Gym (Brittain, 2022) [^44]
- Brittain et al. RL for multi-agent collision avoidance (cited in survey) [^8]
EU/Institutional Projects
- HYPERSOLVER (SESAR JU / CORDIS) [48][51]
- Eurocontrol A-CDM [114][108]
- SESAR Exploratory Research [^45]
ADS-B Data
- TartanAviation Dataset (CMU, 2024) [^65]
- ADSBexchange Historical Data [^68]
- ADS-B Ground Bit Analysis (2026) [^71]
Production RL
- AlphaChip / Google (Nature, 2021; blog 2024) [109][103]
- AgiBot Real-World RL (2025) [^60]
- RL in Production (RunPod, 2025) [^75]
- AlphaStar (DeepMind, 2019) [96][99]
RL Frameworks
- SB3 vs RLlib vs CleanRL comparison [31][34]
- Framework feature comparison [^40]
Next Steps: 4-Week Execution Plan
Goal: Complete an RL research prototype (Topic 1 or Topic 4) suitable for paper submission and portfolio showcase.
Week 1: Foundation & Data Pipeline
- [ ] Day 1–2: Select primary topic (recommend Topic 1: Offline RL for Taxi-Out Time). Set up GitHub repo with project structure:
env/,data/,models/,evaluation/,notebooks/,docs/. - [ ] Day 2–3: Download and preprocess ADS-B data from ADSBexchange (focus on one major airport, e.g., EDDF Frankfurt or KEWR Newark). Extract taxi-out trajectories using ground-bit transitions.
- [ ] Day 3–4: Build data pipeline in PySpark/Pandas: raw ADS-B → cleaned trajectories → episode-level features (gate → runway taxi time, queue length, weather, time-of-day). Log everything in MLflow.
- [ ] Day 5–7: Implement FCFS baseline and simple heuristic metering. Compute baseline metrics. Write data exploration notebook.
Week 2: Environment & Offline RL Training
- [ ] Day 8–9: Construct offline dataset in D4RL-compatible format: (s, a, r, s', done) tuples from historical taxi logs. Define state space, action space, and reward function.
- [ ] Day 10–11: Implement CQL and IQL using
d3rlpyorCORLlibrary. Train on offline dataset. Track experiments with MLflow. - [ ] Day 12–13: Implement Decision Transformer baseline. Compare learning curves.
- [ ] Day 14: Set up off-policy evaluation (OPE) metrics: FQE (Fitted Q-Evaluation), importance sampling. Evaluate all trained policies.
Week 3: BlueSky Simulation & Evaluation
- [ ] Day 15–16: Install BlueSky-Gym. Create custom environment matching your target airport's taxiway layout. Validate simulation against real ADS-B data distributions.
- [ ] Day 17–18: Deploy trained offline RL policies in BlueSky environment. Compare: Offline-trained → Online-evaluated vs. baselines. Run ablation studies (state features, reward weights).
- [ ] Day 19–20: OOD robustness testing: different traffic densities, weather conditions, runway configurations. Generate comparison tables and figures.
- [ ] Day 21: Prepare visualization: animated taxi simulations, Grafana dashboard for live metrics.
Week 4: Paper Writing & Portfolio Polish
- [ ] Day 22–23: Write paper draft (target: Transportation Research Part C or IEEE ITSC). Structure: Introduction → Related Work → Problem Formulation → Method → Experiments → Results → Discussion → Conclusion.
- [ ] Day 24–25: Polish GitHub repository: README with architecture diagram, installation instructions, reproducibility guide, model checkpoints in MLflow registry. Create Streamlit/Gradio demo.
- [ ] Day 26–27: Finalize paper: add all figures/tables, write abstract, proofread. Prepare ArXiv preprint.
- [ ] Day 28: Submit to ArXiv. Update portfolio website. Share on LinkedIn with technical summary. Plan next iteration (add MARL or world model component for follow-up paper).
Tools & Stack (Aligned with Your Existing Setup)
| Component | Tool |
|---|---|
| Data processing | PySpark on Databricks / local Spark |
| RL training | d3rlpy (offline RL), SB3 (online), RLlib (MARL) |
| Simulation | BlueSky-Gym + custom env |
| Experiment tracking | MLflow (already in your stack) |
| Visualization | Grafana + Plotly + Streamlit |
| Infrastructure | Docker, MinIO for data storage |
| Version control | GitHub with CI/CD |
References
- Reinforcement Learning Renaissance: An AI Atlas Report - The RL Renaissance is transforming AI by combining generative models with reinforcement learning to ...
- Machine Learning in Air Traffic Management for Trajectory Optimization and Aviation Safety - a Systematic Literature Review - As air traffic numbers continue to increase, maintaining efficiency and safety in Air Traffic Manage...
- The State of Reinforcement Learning in 2025 - Comprehensive Report on Startups, Innovation, and Market Trends shaping the RL innovation landscape.
- Deep Reinforcement Learning for Real-Time
- Offline Reinforcement Learning with Closed-loop Policy ... - This paper proposes a new model-based offline RL algorithm ADEPT adopting uncertainty-penalized diff...
- A Comprehensive Survey of Reinforcement Learning: From Algorithms to Practical Challenges - Reinforcement Learning (RL) has emerged as a powerful paradigm in Artificial Intelligence (AI), enab...
- [PDF] Single and Multi-Agent Reinforcement Learning Approach ... - eucass
- Making Offline RL Online: Collaborative World
- Reinforcement Learning Enhanced LLMs: A Survey - Reinforcement learning (RL) enhanced large language models (LLMs), particularly exemplified by DeepS...
- A Deep Reinforcement Learning Approach for Airport Departure Metering Under Spatial–Temporal Airside Interactions
- A Survey on Reinforcement Learning in Aviation Applications
- Towards Greener Airport Surface Operations: A Reinforcement Learning Approach for Autonomous Taxiing*
- Autonomous Conflict Resolution in Urban Air Mobility: A Deep Multi-Agent Reinforcement Learning Approach | AIAA Aviation Forum and ASCEND co-located Conference Proceedings - As urban air mobility (UAM) expands, electric vertical take-off and landing (eVTOL) vehicles are bec...
- Safety-oriented Reinforcement Learning - Safe RL develops policies that maximize performance while strictly adhering to safety constraints fo...
- Towards greener airport surface operations: a reinforcement learning approach for autonomous taxiing
- Self-Prioritizing Multi-Agent Reinforcement Learning for Conflict Resolution in Air Traffic Control with Limited Instructions - DiVA portal is a finding tool for research publications and student theses written at the following ...
- A Survey of Safe Reinforcement Learning and Constrained ... - This survey provides a mathematically rigorous overview of SafeRL formulations based on Constrained ...
- Deep Reinforcement Learning for Real-Time Ground Delay Program ...
- General multi-agent reinforcement learning integrating adaptive manoeuvre strategy for real-time multi-aircraft conflict resolution - Reinforcement learning (RL) techniques are under investigation for resolving conflict in air traffic...
- A Survey of Safe Reinforcement Learning and Constrained ... - von A Kushwaha · 2025 · Zitiert von: 12 — This survey provides a mathematically rigorous overview of...
- Conflict resolution strategy based on deep reinforcement learning ... - With the continuous increase in flight flows, the flight conflict risk in the airspace has increased...
- A Multi-Agent Reinforcement Learning Approach to ... - A Multi-Agent Reinforcement Learning Approach to Autonomous Aircraft Taxiing with Taxiing Time, Fuel...
- General multi-agent reinforcement learning integrating adaptive manoeuvre strategy for real-time multi-aircraft conflict resolution - Reinforcement learning (RL) techniques are under investigation for resolving conflict in air traffic...
- Reinforcement Learning for Warehouse Management and ... - The rapid advancement of Artificial Intelligence (AI) and Machine Learning (ML) has revolutionized w...
- Stable baselines vs RLlib vs CleanRL
- BlueSky-Gym: Reinforcement Learning Environments for Air Traffic ...
- Reinforcement Learning in Operations & Supply Chain Management - Reinforcement learning (RL), a branch of machine learning, is increasingly being applied in operatio...
- Are there any significant advantages to RlLib over stable baselines 3?
- BlueSky-Gym: Reinforcement Learning Environments for ... - von DJ Groot · Zitiert von: 8 — Built upon the. Gymnasium API and the BlueSky air traffic simulator,...
- Simulation and Reinforcement Learning - Presentation by Wenshan Wang and Sourish Ghosh, part of the Air Lab Summer School 2020.
Sessions l...
- Rllib (ray)
- AAM-Gym: Artificial Intelligence Testbed for Advanced Air ... - von M Brittain · 2022 · Zitiert von: 15 — Simulated AAM traffic in the NYC area displayed in the Blu...
- SESAR projects to foster new ideas and knowledge ...
- When should we prefer Decision Transformers for Offline ...
- Artificial Intelligence controller able to manage Air traffic ... - This project covers every aspect, from managing aircraft density and intricate flight paths to addre...
- #sesarjuprojects #digitalsky #sesarju #horizoneu #atm #atfm #ai #drl | ENAC - Ecole Nationale de l'Aviation Civile - The HYPERSOLVER consortium, with ENAC as a contributing partner gathered at EUROCONTROL’s premises o...
- [PDF] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs ... - arXiv
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs ... - General reasoning represents a long-standing and formidable challenge in artificial intelligence. Re...
- AgiBot Achieves First Real-World Deployment of Reinforcement ... - Bridging embodied AI research with real-world manufacturing systems
- Safe Reinforcement Learning with World Models - The deployment of Reinforcement Learning (RL) in real-world applications is constrained by its failu...
- TartanAviation: Image, Speech, and ADS-B Trajectory ...
- Accurate and Efficient World Modeling with Masked Latent ... - von M Burchi · Zitiert von: 1 — The Dreamer algorithm has recently obtained remarkable performance a...
- Historical Data - ADSBexchange ADS-B data services for aircraft tracking and application development.
- GitHub - PKU-Alignment/SafeDreamer: ICLR 2024: SafeDreamer: Safe Reinforcement Learning with World Models - ICLR 2024: SafeDreamer: Safe Reinforcement Learning with World Models - PKU-Alignment/SafeDreamer
- Analysis of Ground Bit Changes in ADS-B Data from Airport ... - von B Stloukal · 2026 — In this paper, the behavior of the ground bit in data from real airport oper...
- Reinforcement Learning in Production: Building Adaptive ... - Deploy adaptive reinforcement learning systems on Runpod to create intelligent applications that lea...
- Reinforcement Learning for LLM Agent Planning - Zhiwei Li, Yong Hu, Wenqing Wang. Proceedings of the 2025 Conference on Empirical Methods in Natural...
- Deep Reinforcement Learning for Online Advertising Impression in ... - With the recent prevalence of Reinforcement Learning (RL), there have been tremendous interests in u...
- Deep reinforcement learning for optimal microgrid energy ... - von B Xiong · 2025 · Zitiert von: 20 — This paper proposes a deep reinforcement learning (DRL)-based...
- Reinforcement Learning for LLM Agent Planning - arXiv - The functionality of Large Language Model (LLM) agents is primarily determined by two capabilities: ...
- Optimizing Power Grid Topologies with Reinforcement ... - Power grid operation is becoming increasingly complex due to the rising integration of renewable ene...
- ULD Build-Up Scheduling with Dynamic Batching in an Air ... - von R Euler · 2021 · Zitiert von: 2 — An intricate scheduling problem thus arises at the hub airport...
- Interviewing Finbarr Timbers on the "We are So Back" Era ... - Era 1: Deep RL fundamentals — when modern algorithms we designed and proven. Era 2: Major projects —...
- Reinforcement Learning in Financial Decision Making - Lastly, it suggests practical implementation models that confront real-world deployment hurdles whil...
- What role did self-play and reinforcement learning ... - AlphaZero, an artificial intelligence (AI) developed by DeepMind, represents a significant milestone...
- Deep Reinforcement Learning for Multi-Objective Airport ... - This highlights the need for efficient and optimal scheduling of ground handling processes to minimi...
- AlphaStar: Mastering the real-time strategy game StarCraft II - Games have been used for decades as an important way to test and evaluate the performance of artific...
- Adaptive reinforcement learning for task scheduling in ... - This paper proposes using reinforcement learning (RL) to schedule maintenance tasks, which can signi...
- AlphaStar: Grandmaster level in StarCraft II using multi ... - TL;DR: AlphaStar is the first AI to reach the top league of a widely popular esport without any game...
- Chip Design with Deep Reinforcement Learning - Posted by Anna Goldie, Senior Software Engineer and Azalia Mirhoseini, Senior Research Scientist, Go...
- RLHF vs DPO vs GRPO Visually Explained: 1️⃣ ... - RLHF vs DPO vs GRPO Visually Explained: 1️⃣ Reinforcement Learning with Human Feedback (RLHF) Proces...
- How to align open LLMs in 2025 with DPO & and synthetic ... - Learn how to align LLMs using Hugging Face TRL and RLHF through Direct Preference Optimization (DPO)...
- Airport Collaborative Decision Making - Wikipedia
- How AlphaChip transformed computer chip design - Our AI method has accelerated and optimized chip design, and its superhuman chip layouts are used in...
- Airport collaborative decision-making (A-CDM) - Improving the efficiency and resilience of airport operations by optimising the use of resources and...
- T8-2-A
- Should I Use Offline RL or Imitation Learning? | MKAI - Figure 1: Summary of our recommendations for when a practitioner should BC and various imitation lea...
- Reinforcement Learning for Strategic Airport Slot Scheduling: Analysis of State Observations and Reward Designs
- danijar/dreamerv3: Mastering Diverse Domains through ... - Mastering Diverse Domains through World Models. Contribute to danijar/dreamerv3 development by creat...
- Reinforcement learning for strategic airport slot scheduling: Analysis of state observations and reward designs - Due to the NP-hard nature, the strategic airport slot scheduling problem is calling for exploring su...
- Should I Use Offline RL or Imitation Learning? - Berkeley AI Research - The BAIR Blog
- Mastering diverse control tasks through world models - PMC - Developing a general algorithm that learns to solve tasks across a wide range of applications has be...