Airport Taxi Optimization with Offline Reinforcement Learning Using ADS-B Data
A Comprehensive Engineering Research Report
1. Introduction and Problem Statement
Airport surface congestion costs the US aviation industry approximately $900 million annually in excess fuel burn alone [^1]. The inefficiency of ground movement—including prolonged taxi times, departure queue congestion, and suboptimal pushback sequencing—directly impacts fuel consumption, carbon emissions, airline operational costs, and passenger experience. Airport Departure Metering (DM) and taxi optimization represent high-value targets for AI-driven improvement [10][20].
This report provides an engineering-focused blueprint for building a production-grade system that uses historical ADS-B trajectory data and offline reinforcement learning to optimize airport ground movement. The focus is on actionable implementation guidance, concrete algorithm selection, and a realistic deployment pathway.
Why Offline RL?
Aviation is a safety-critical domain where online exploration is infeasible—you cannot randomly experiment with aircraft ground movements. Offline RL learns policies exclusively from pre-collected historical data without any real-time environment interaction [36][48]. This makes it the natural paradigm for airport surface optimization, where decades of operational data exist but experimentation carries unacceptable risk.
2. Problem Formulation: MDP and CMDP Design
2.1 Markov Decision Process Formulation
Airport ground movement is modeled as a Markov Decision Process (MDP) defined by the tuple ((\mathcal{S}, \mathcal{A}, P, R, \gamma)) [4][10]:
- (\mathcal{S}): State space representing the airport surface situation
- (\mathcal{A}): Action space for controlling aircraft movement
- (P: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \to [0,1]): Transition probability
- (R: \mathcal{S} \times \mathcal{A} \to \mathbb{R}): Reward function
- (\gamma \in [0,1]): Discount factor
The optimal policy (\pi^*) maximizes the expected cumulative discounted reward:
[
\pi^* = \arg\max_\pi \mathbb{E}\pi \left[\sum{t=0}^{\infty} \gamma^t R(s_t, a_t)\right]
]
2.2 State Space Design
The state vector should capture four categories of features, following the proven framework from Tran et al. (NTU) and Ali et al. (NTU) [4][10]:
| Category | Features | Source |
|---|---|---|
| Planning (F1) | Distance to target, expected arrival time, remaining route segments, assigned runway | Flight plan / AMAN |
| Ego Aircraft (F2) | Current speed, heading, position on graph, aircraft type/weight class, fuel state | ADS-B / A-SMGCS |
| Traffic/Environment (F3) | Queue length at runway, number of aircraft on taxiway segment, hotspot occupancy, conflict proximity | ADS-B aggregate |
| Context (F4) | Runway configuration, wind direction/speed, visibility, time-of-day, day-of-week | METAR / Airport ops |
Taxiway hotspot features deserve special attention. Ali et al. demonstrated that encoding spatial-temporal congestion levels at known bottleneck intersections significantly improves DM policy convergence during training [20][10]. Define hotspot occupancy as the count of aircraft within a configurable radius of each critical taxiway junction.
2.3 Action Space
Two formulations are viable depending on the optimization level:
Speed Control (Continuous): The action is the acceleration of the controlled aircraft at each timestep, (a_t \in [-a_{max}, +a_{max}]), where (a_{max}) is estimated from historical A-SMGCS data [^4]. This is suitable for individual aircraft taxi speed optimization.
Pushback Control (Discrete): The action is a binary hold/release decision per departure at each terminal gate. This is the formulation used for departure metering [10][26]. At each decision epoch, the DM agent decides whether to hold or pushback each ready-to-depart aircraft.
Hybrid: A hierarchical approach combining pushback sequencing at the strategic level with speed profile optimization at the tactical level.
2.4 Multi-Objective Reward Design
The reward function must balance competing objectives. Following Tran et al.'s validated design [4][1]:
[
R(s_t, a_t) = w_1 R_{time} + w_2 R_{fuel} + w_3 R_{safety} + w_4 R_{throughput}
]
| Component | Formula Concept | Rationale |
|---|---|---|
| (R_{time}) | Penalty proportional to deviation from reference speed (v_{ref}) to reach target on-time | Encourages timely arrival |
| (R_{fuel}) | Negative fuel burn estimate based on speed and acceleration profile | Minimizes fuel consumption |
| (R_{safety}) | Large negative penalty for conflicts (aircraft separation < threshold) or invalid actions | Hard safety constraint |
| (R_{throughput}) | Bonus for completing taxi within planned window | Maintains schedule conformance |
Reference speed is defined as (v_{ref} = d_{remaining} / t_{remaining}), giving the agent a dynamic target [^4]. The agent learns to track this reference while optimizing fuel and avoiding conflicts.
Practical weight tuning guidance: Start with equal weights and use sensitivity analysis. Tran et al. found that excessively penalizing fuel burn makes aircraft maintain constant speed disregarding traffic, while too-high conflict penalties cause aircraft to freeze in place [^4].
2.5 Constrained MDP (CMDP) Formulation
For safety-critical deployment, extend the MDP to a CMDP where safety constraints are explicit rather than embedded in reward weights [47][50]:
[
\max_{\pi} J^R(\pi) \quad \text{s.t.} \quad J^{C_i}(\pi) \leq d_i, \quad i = 1, \dots, m
]
where (J^{C_i}(\pi)) are expected cumulative costs (e.g., number of separation violations, runway incursions) and (d_i) are tolerance thresholds. This is solved via Lagrangian relaxation [^47]:
[
\min_{\lambda \geq 0} \max_\pi \left[ J^R(\pi) - \sum_{i=1}^{m} \lambda_i (J^{C_i}(\pi) - d_i) \right]
]
The Lagrange multipliers (\lambda_i) are interpreted as the "price" of violating each constraint and can be updated via gradient ascent or PID-controlled updates during training [^47].
3. Data Layer: ADS-B Based Pipeline
3.1 ADS-B Ground Trajectory Reconstruction
ADS-B (Automatic Dependent Surveillance–Broadcast) transmits aircraft position, velocity, altitude, and identification at approximately 1 Hz. On the airport surface, ADS-B provides the raw signal for reconstructing taxi trajectories [17][22].
Key data fields from ADS-B surface messages:
| Field | Description | Use |
|---|---|---|
icao24 |
Aircraft unique identifier | Track identity |
latitude, longitude |
Position (WGS-84) | Trajectory reconstruction |
velocity |
Ground speed (knots) | Speed profile analysis |
heading |
Track angle | Direction of movement |
on_ground |
Boolean surface flag | Filter ground movements |
timestamp |
Unix epoch time | Temporal sequencing |
3.2 Data Sources
| Source | Access | Coverage | Resolution | Cost |
|---|---|---|---|---|
| OpenSky Network | Trino DB (authenticated), REST API | Global, 30k+ receivers | ~1-2 Hz (cooperative) | Free for research [22][28] |
| ADS-B Exchange | API (tiered), Historical archive | Global, unfiltered | ~1 Hz | Paid for historical [^25] |
| A-SMGCS (airport-specific) | Airport authority agreement | Single airport, second-by-second | 1 Hz (cooperative + non-cooperative) | Restricted [10][71] |
| Eurocontrol DDR2 | Application required | European airspace | Flight plan level | Research access |
Recommended starting point: OpenSky Network via the pyopensky library [^35] and the traffic Python library [^38]. These provide immediate access to historical ADS-B state vectors with Trino SQL queries.
# Example: Query surface movements at Frankfurt Airport (EDDF) using pyopensky
from pyopensky.trino import Trino
trino = Trino()
df = trino.query("""
SELECT time, icao24, lat, lon, velocity, heading, onground, callsign
FROM state_vectors_data4
WHERE onground = true
AND lat BETWEEN 49.95 AND 50.10
AND lon BETWEEN 8.45 AND 8.65
AND hour BETWEEN '2024-06-01T00:00:00Z' AND '2024-06-30T23:59:59Z'
""")
3.3 Map Matching to Taxiway/Runway Graph
Raw ADS-B positions must be snapped to the airport's taxiway graph. Szymanski et al. (AIAA 2023) developed a validated map-matching algorithm achieving 97-99% accuracy with processing speed exceeding 100 trajectory points in under 2 seconds [^17].
Algorithm: Hidden Markov Model (HMM) based map-matching:
- Graph construction: Automatically generate a directed graph from OpenStreetMap (OSM) data for any airport. Nodes are taxiway junctions; edges are taxiway segments with attributes (type, name, bearing, distance, speed limit) [^17].
- Emission probability: For each ADS-B position, compute the likelihood of being on each candidate edge based on perpendicular distance.
- Transition probability: Compute likelihood of transitioning between edges based on route plausibility (shortest path distance vs. great-circle distance).
- Viterbi decoding: Find the most likely sequence of edges (the matched trajectory).
Implementation tools:
osmnx: Extract airport taxiway network from OSMnetworkx: Graph representation and shortest path computation- Custom HMM or
hmmlearn: Viterbi decoding for map matching trafficlibrary: Built-in airport surface operations support [^38]
3.4 Feature Engineering for Ground Operations
After map-matching, engineer the following features per aircraft per timestep:
Spatial features:
- Edge ID (current taxiway segment)
- Distance traveled on current edge, distance remaining to target
- Number of intersections remaining on route
- Euclidean distance to runway threshold
Traffic features:
- Count of aircraft on same taxiway segment
- Count of aircraft within 200m, 500m, 1000m radii
- Queue length at assigned runway (aircraft between holding point and threshold)
- Hotspot density: aircraft count at each of the top-K congested junctions [^20]
- Nearest conflicting aircraft: distance, relative speed, relative heading
Temporal features:
- Scheduled departure time minus current time
- Historical average taxi time for this route
- Time since pushback
Environmental features:
- Active runway configuration (extracted from METAR/ATIS)
- Wind speed and direction
- Visibility category (CAT I/II/III)
- Hour of day, day of week (one-hot or cyclical encoding)
3.5 Handling Data Quality Issues
| Issue | Mitigation |
|---|---|
| Irregular sampling (0.5-5 Hz) | Resample to fixed 1 Hz using linear interpolation for position, zero-order hold for discrete states |
| GPS noise (±10-50m) | Kalman filter before map-matching; the HMM map-matching itself acts as a spatial filter [^17] |
| Missing data (gaps > 30s) | Segment trajectories at gaps; discard segments shorter than minimum taxi time |
| Mixed ground/airborne | Filter using on_ground flag combined with altitude < 200 ft AGL and speed < 80 knots |
| Non-aircraft targets | Filter by known aircraft icao24 identifiers; cross-reference with flight schedules |
3.6 Offline Dataset Construction
The final offline RL dataset follows the format (\mathcal{D} = {(s_t, a_t, r_t, s_{t+1}, d_t)}_{t=1}^{N}) where (d_t) indicates episode termination.
Episode definition: One episode = one taxi operation (gate to runway for departures, runway to gate for arrivals). Each episode begins at pushback/touchdown and ends at runway entry/gate arrival.
Action labeling from historical data: Since historical data reflects what controllers/pilots actually did (the behavior policy (\pi_\beta)), actions are extracted as:
- Speed control: (a_t = (v_{t+1} - v_t) / \Delta t) (observed acceleration)
- Pushback control: Binary hold/release derived from actual pushback times vs. scheduled times
Reward labeling: Compute reward retrospectively using the multi-objective function above, with fuel burn estimated using the ICAO emission model (fuel flow rate × engine time × number of engines) [77][86].
4. Environment and Simulation
4.1 BlueSky and BlueSky-Gym
BlueSky is an open-source air traffic simulator developed at TU Delft, capable of fast-time simulation of air traffic including ground movements [^12]. BlueSky-Gym wraps BlueSky with a Gymnasium-compatible API for RL research [3][6][^15].
Capabilities:
- Aircraft performance models (OpenAP database)
- TrafScript command language for traffic control
- Fast-time simulation (100x+ real-time)
- Plugin architecture for custom extensions
- 7 built-in RL environments (descent, conflict resolution, merging, waypoint planning) [^6]
Limitations for airport surface simulation:
- No built-in taxiway network modeling (airside graph must be custom-built)
- No pushback/gate dynamics
- No ground vehicle interactions
- Limited surface conflict detection
- No weather impact on surface operations
Recommendation: Use BlueSky-Gym as the foundation but build a custom airport surface plugin that adds taxiway graph navigation, pushback queuing, and surface conflict detection.
4.2 Custom Graph-Based Airport Simulator
For departure metering and taxi optimization, a purpose-built simulator is more practical. Ali et al. built a representative airside simulator for Singapore Changi Airport with the following components [^10]:
- Airside network graph: Nodes (gates, taxiway junctions, runway thresholds), edges (taxiway segments with capacity and speed constraints)
- Traffic flow model: Aircraft move along assigned routes with stochastic speed profiles calibrated from A-SMGCS data
- Episode generation: Sample departure schedules from historical data, stochastically perturb pushback-ready times
- Conflict model: Check separation constraints at each timestep; enforce first-come-first-served at intersections
# Pseudocode: Minimal Airport Surface Simulator
class AirportSurfaceEnv(gymnasium.Env):
def __init__(self, airport_graph, schedule):
self.graph = airport_graph # networkx DiGraph
self.schedule = schedule # list of (callsign, gate, runway, time)
self.aircraft = {} # active aircraft states
def step(self, action):
# action: dict of {callsign: acceleration} or {callsign: hold/release}
for ac in self.aircraft.values():
ac.update_position(action, self.graph)
conflicts = self.detect_conflicts()
reward = self.compute_reward(conflicts)
obs = self.get_observation()
done = all(ac.reached_target for ac in self.aircraft.values())
return obs, reward, done, info
def reset(self):
# Sample new episode from schedule
episode = self.sample_episode()
self.aircraft = self.initialize_aircraft(episode)
return self.get_observation()
4.3 SUMO as an Alternative
Eclipse SUMO (Simulation of Urban Mobility) is a microscopic traffic simulator that can model individual vehicle movements on a network [78][84]. While designed for road traffic, it can be adapted for airport surface movement by treating taxiways as road segments. However, aircraft dynamics (acceleration profiles, minimum turn radii, wake turbulence) differ substantially from road vehicles, making a custom simulator preferable.
4.4 Creating Offline Datasets for RL Training
Two approaches for generating the offline training dataset:
Approach A: Direct from historical data (recommended for initial development)
- Extract trajectories from OpenSky/A-SMGCS
- Map-match to taxiway graph
- Label states, actions, rewards as described in Section 3.6
- This gives a "behavior policy" dataset reflecting actual controller decisions
Approach B: Simulator-generated (recommended for augmentation)
- Run the custom simulator with rule-based policies (FIFO, shortest path, speed-optimal)
- Collect diverse trajectories covering various traffic densities
- Mix with historical data for broader state-action coverage
Dataset quality indicators (from D4RL best practices [61][70]):
- Coverage: Trajectories should span low, medium, and high traffic densities
- Diversity: Include multiple behavior policies (expert, suboptimal, mixed)
- Sufficiency: Aim for >100,000 episodes across at least 6 months of operations
4.5 Sim-to-Real Considerations
Key gaps between simulation and real operations:
- Communication delays: Real ATC instructions have latency not modeled in simulation
- Pilot compliance variance: Pilots may not follow speed advisories precisely
- Non-modeled factors: Ground vehicles, construction, engine startup delays
- Weather dynamics: Surface conditions change continuously
Mitigation: Use domain randomization during training (randomize speed compliance, add noise to transition dynamics) and conservative policy constraints during deployment.
5. Offline RL Algorithms
5.1 Algorithm Landscape
Offline RL addresses the fundamental challenge that standard RL algorithms fail when trained on fixed datasets due to distributional shift—the learned policy encounters states not in the training data, leading to catastrophic overestimation of Q-values [36][48].
| Algorithm | Type | Key Mechanism | Complexity | Best For |
|---|---|---|---|---|
| CQL | Value regularization | Learns conservative Q-function that lower-bounds true value [36][30] | Medium | High stochasticity, dense rewards |
| IQL | In-sample learning | Never queries OOD actions; uses expectile regression on V-function [48][45] | Low | General offline RL; fine-tuning |
| BCQ | Policy constraint | Constrains policy to behavior data support via VAE [46][58] | High | Continuous actions, narrow data |
| TD3+BC | Policy regularization | Adds BC regularization term to TD3 objective [49][52] | Low | Simple baseline, good default |
| Decision Transformer | Sequence modeling | Casts RL as conditional sequence generation via GPT [60][66][^69] | High | Sparse rewards, long horizons |
5.2 Conservative Q-Learning (CQL)
CQL augments the standard Bellman error objective with a regularizer that pushes down Q-values for out-of-distribution actions while pushing up Q-values for in-distribution actions [36][33]:
[
\min_Q \alpha \left( \mathbb{E}{s \sim \mathcal{D}, a \sim \mu}[Q(s,a)] - \mathbb{E}{s,a \sim \mathcal{D}}[Q(s,a)] \right) + \frac{1}{2} \mathbb{E}_{s,a,s' \sim \mathcal{D}} \left[ (Q(s,a) - \hat{\mathcal{B}}^\pi Q(s,a))^2 \right]
]
where (\mu) is a broad distribution (e.g., uniform) and (\alpha) controls conservatism. CQL can be implemented in less than 20 lines of code on top of standard Q-learning [^33].
Aviation suitability: Strong for departure metering where state transitions are stochastic and reward is dense (per-step taxi delay). The conservative lower bound aligns with the safety-first aviation culture.
5.3 Implicit Q-Learning (IQL)
IQL avoids evaluating any out-of-distribution actions entirely by using expectile regression to estimate an upper expectile of the value function [45][48]:
[
L_V(\psi) = \mathbb{E}{(s,a) \sim \mathcal{D}} \left[ L_2^\tau (Q{\hat{\theta}}(s,a) - V_\psi(s)) \right]
]
where (L_2^\tau(u) = |\tau - \mathbf{1}(u < 0)| u^2) is the asymmetric squared loss with expectile parameter (\tau). Policy extraction uses advantage-weighted behavioral cloning [^48].
Aviation suitability: Excellent for taxi speed optimization where you want to improve over the average historical behavior without risk of extrapolation. IQL's state-of-the-art D4RL performance and strong online fine-tuning capability make it the recommended primary algorithm.
5.4 Decision Transformer (DT)
DT frames RL as autoregressive sequence generation, conditioning on desired return-to-go (\hat{R}_t), past states, and actions to predict optimal future actions [66][69]:
[
a_t = \text{Transformer}(\hat{R}_1, s_1, a_1, \hat{R}_2, s_2, a_2, \ldots, \hat{R}_t, s_t)
]
Aviation suitability: Best for long-horizon taxi operations with sparse terminal rewards (e.g., total taxi time only known at episode end). Meta's empirical study found DT is substantially better than CQL with sparse rewards and low-quality data, but requires more data overall [^21].
5.5 Recommended Algorithm Strategy
| Phase | Algorithm | Rationale |
|---|---|---|
| Baseline | Behavioral Cloning (BC) | Simple supervised learning on expert demonstrations; establishes floor |
| Primary | IQL | Best balance of performance, simplicity, and fine-tuning potential [^48] |
| Conservative | CQL | Safety-oriented lower-bound estimation; cross-validate against IQL [^36] |
| Long-horizon | Decision Transformer | If terminal rewards dominate; leverage Transformer scaling [^69] |
| Simple alternative | TD3+BC | Minimal modification to standard TD3; surprisingly effective [^49] |
Implementation library: d3rlpy [93][99]—the most mature offline RL library, supporting CQL, IQL, BCQ, TD3+BC, and Decision Transformer with a scikit-learn-compatible API and PyTorch backend.
import d3rlpy
# Load airport taxi dataset (custom format)
dataset = d3rlpy.dataset.MDPDataset(
observations=states, # (N, state_dim)
actions=actions, # (N, action_dim)
rewards=rewards, # (N,)
terminals=terminals # (N,) boolean
)
# Configure IQL
iql = d3rlpy.algos.IQLConfig(
actor_learning_rate=3e-4,
critic_learning_rate=3e-4,
value_learning_rate=3e-4,
expectile=0.7, # tau parameter
weight_temp=3.0, # advantage weighting temperature
batch_size=256,
).create(device="cuda:0")
# Train offline
iql.fit(dataset, n_steps=500000, evaluators={"td_error": d3rlpy.metrics.TDErrorEvaluator(episodes=test_episodes)})
5.6 Safety-Constrained Offline RL
For deployment in aviation, standard offline RL is insufficient—explicit safety constraints are mandatory.
C2IQL (Constraint-Conditioned Implicit Q-Learning) extends IQL to the constrained setting by expanding the implicit value function update to handle cost constraints, then reconstructing non-discounted cumulative costs from discounted values [^16]. This is the most natural extension of IQL for safety-critical aviation use.
Lagrangian methods remain the workhorse for CMDP solving. The multiplier (\lambda) can be updated via gradient ascent: (\lambda_{k+1} = [\lambda_k + \eta \cdot (J^C(\pi_{\theta_k}) - d)]_+) or via PID control for smoother convergence [^47]. Empirical evidence shows automated GA updates can even exceed the performance of fixed optimal (\lambda^*) in complex tasks [^47].
Shielding provides a hard safety guarantee by constructing a shield from formal specifications (e.g., Linear Temporal Logic) that overrides unsafe actions before execution [^50]. For airport operations, a shield would enforce: "No two aircraft may occupy the same taxiway segment simultaneously" and "No aircraft may enter an active runway without clearance."
5.7 Handling Distribution Shift
| Technique | How It Works | Used By |
|---|---|---|
| Conservative value estimation | Lower-bound Q-values for OOD actions | CQL [^36] |
| In-sample learning | Never query Q for unseen actions | IQL [^48] |
| Policy constraint | Restrict policy to data support | BCQ [^46], TD3+BC [^49] |
| Ensemble disagreement | Use Q-function ensemble variance as uncertainty | CQL-ensemble |
| Return conditioning | Condition on achievable returns only | Decision Transformer [^69] |
6. Baselines and Comparisons
6.1 Rule-Based Heuristics
- FIFO (First-In-First-Out): Aircraft are pushed back and routed in scheduled order. This is the simplest and most common real-world baseline.
- Shortest Path First: Route each aircraft via the shortest taxiway path (Dijkstra/A* on the airport graph).
- Time-Based Metering: Hold departures at gates until estimated taxi time matches the target (e.g., N-Control—release when taxiway aircraft count drops below N) [^34].
6.2 Queueing Theory Models
Model the runway as a server with aircraft as customers. M/M/1 or M/G/1 queue models provide analytical estimates of taxi delay as a function of arrival rate and service rate. Useful for validating whether RL policies achieve theoretically plausible improvements.
6.3 MILP / Optimization Scheduling
Mixed-Integer Linear Programming models optimize taxiway and runway schedules simultaneously. Lee (MIT, 2014) developed integrated MILP models for both taxiway and runway scheduling at Detroit Airport [^34]. MILP approaches can achieve 1-3% emissions reduction with 15-second replanning cycles [^31].
MILP limitations: Computational cost scales exponentially with traffic density; difficulty handling stochastic events; requires complete information about all future aircraft.
6.4 Supervised Learning Predictors
Train models to predict taxi time, congestion, or delay from features—but without prescriptive optimization. These serve as useful feature extractors or for comparison to demonstrate the value of the RL optimization component.
6.5 Fair Benchmarking Protocol
| Requirement | Implementation |
|---|---|
| Same evaluation scenarios | Fix 100+ test episodes spanning low/medium/high traffic |
| Same metrics computation | Standardize taxi time, fuel, conflicts measurement |
| Multiple seeds | Report mean ± std over ≥5 random seeds |
| Statistical tests | Wilcoxon signed-rank test for pairwise algorithm comparison |
| OPE before deployment | Use Fitted Q-Evaluation (FQE) for off-policy evaluation of policies before any online testing [^67] |
7. Evaluation Metrics
| Metric | Definition | Target |
|---|---|---|
| Average taxi time (min) | Mean time from pushback to runway entry (departures) or touchdown to gate (arrivals) | ≥10% reduction vs. FIFO |
| Fuel burn (kg/flight) | Estimated using ICAO fuel flow model: (F = T \times f \times N) where (T) = taxi time, (f) = fuel flow per engine, (N) = number of engines [77][86] | ≥15% reduction |
| CO₂ emissions (kg/flight) | 3.16 × fuel burn (kg) | Proportional to fuel |
| Throughput (movements/hour) | Number of departures + arrivals per unit time | ≥5% improvement |
| Taxi delay (min) | Actual taxi time minus unimpeded reference time (20th percentile) [^86] | ≥25-44% reduction [^20] |
| Conflict rate (events/hour) | Number of separation violations or near-misses per operational hour | Zero-tolerance target |
| Schedule conformance (%) | Percentage of flights arriving at runway within ±20 seconds of planned time | ≥97.8% (matching Tran et al. [^4]) |
| Gate hold time (min) | Additional time aircraft spend at gate due to metering | Should not increase disproportionately |
Robustness evaluation: Test the learned policy on at least 3 different airport layouts (e.g., EDDF Frankfurt, WSSS Singapore Changi, KJFK New York JFK) to assess transfer learning potential.
8. System Architecture
8.1 End-to-End Engineering Architecture
The system follows a standard ML platform architecture with aviation-specific components:
┌─────────────────────────────────────────────────────────────────┐
│ DATA INGESTION LAYER │
│ OpenSky API → Kafka/Pulsar → Bronze (Raw ADS-B Parquet) │
│ METAR Feed → Bronze (Weather JSON) │
│ Flight Schedule → Bronze (Schedule CSV) │
└────────────────────────┬────────────────────────────────────────┘
│
┌────────────────────────▼────────────────────────────────────────┐
│ PROCESSING LAYER │
│ Bronze → Silver: Dedup, filter, resample, map-match │
│ Silver → Gold: Feature engineering, episode construction │
│ Orchestration: Airflow / Dagster / Prefect │
└────────────────────────┬────────────────────────────────────────┘
│
┌────────────────────────▼────────────────────────────────────────┐
│ ML TRAINING LAYER │
│ Feature Store (Feast/Hopsworks) ← Gold tables │
│ d3rlpy Training Pipeline (IQL/CQL/DT) │
│ MLflow Experiment Tracking + Model Registry │
│ Safety Evaluation Module (CMDP constraint checking) │
└────────────────────────┬────────────────────────────────────────┘
│
┌────────────────────────▼────────────────────────────────────────┐
│ SERVING LAYER │
│ Batch Inference: Daily policy evaluation on new data │
│ Decision Support Dashboard (Streamlit/Grafana) │
│ Advisory Output: Recommended pushback times / speed profiles │
└─────────────────────────────────────────────────────────────────┘
8.2 Lakehouse Data Stack (Bronze/Silver/Gold)
Following the Databricks medallion architecture [92][101][^104]:
Bronze Layer — Raw, immutable, append-only:
bronze.adsb_state_vectors: Raw ADS-B messages (Parquet, partitioned by date/hour)bronze.metar_weather: Raw METAR observationsbronze.flight_schedules: Airline schedule databronze.airport_config: Runway configuration logs
Silver Layer — Cleaned, validated, deduplicated:
silver.surface_trajectories: Map-matched trajectories with taxiway segment IDssilver.taxi_episodes: Complete gate-to-runway/runway-to-gate episodessilver.traffic_state: Per-timestep airport-wide traffic snapshotsilver.weather_interpolated: Time-interpolated weather at airport
Gold Layer — Feature-engineered, ML-ready:
gold.rl_transitions: ((s_t, a_t, r_t, s_{t+1}, d_t)) tuples for offline RLgold.episode_metrics: Per-episode summary (taxi time, fuel, delays)gold.feature_store_sync: Materialized features for servinggold.evaluation_results: Model performance metrics over time
Storage format: Delta Lake on object storage (MinIO for self-hosted, S3/ADLS for cloud). Delta provides ACID transactions, schema evolution, and time travel—essential for reproducible ML experiments.
8.3 Feature Store
Use Feast (open-source) or Hopsworks to serve features consistently between training and inference:
- Offline store: Parquet files in the Gold layer for historical feature retrieval during training
- Online store: Redis/DynamoDB for low-latency feature serving during batch inference
- Feature groups:
aircraft_state_features,traffic_context_features,weather_features,planning_features
8.4 MLflow Experiment Tracking
MLflow provides the experiment management backbone [91][97][^100]:
- Experiments: Organize by algorithm (IQL, CQL, DT) and airport
- Runs: Each training configuration (hyperparameters, dataset version, random seed)
- Metrics: Log training loss, OPE scores (FQE), evaluation metrics (taxi time reduction, fuel savings)
- Artifacts: Save trained model checkpoints, evaluation plots, dataset metadata
- Model Registry: Version control for production-ready policies; stage transitions (Staging → Production)
import mlflow
mlflow.set_experiment("airport-taxi-optimization/IQL")
with mlflow.start_run(run_name="iql_eddf_v3"):
mlflow.log_params({
"algorithm": "IQL",
"expectile": 0.7,
"batch_size": 256,
"n_steps": 500000,
"airport": "EDDF",
"dataset_version": "2024Q2"
})
# Training loop with d3rlpy
iql.fit(dataset, n_steps=500000)
mlflow.log_metrics({
"avg_taxi_time_reduction_pct": 12.3,
"avg_fuel_reduction_pct": 18.7,
"conflict_rate": 0.0,
"schedule_conformance_pct": 98.1
})
mlflow.pytorch.log_model(iql, "model")
8.5 Batch Inference Pipeline
The system operates in advisory mode (not autonomous control):
- Daily batch: Process previous day's operations through Bronze→Silver→Gold pipeline
- Policy evaluation: Run trained policy on reconstructed episodes to compute counterfactual metrics ("what would the RL policy have recommended?")
- Prospective advisory: For next-day schedule, generate recommended pushback times and speed profiles
- Dashboard update: Visualize comparison of actual vs. recommended operations
8.6 Decision Support Dashboard
Build with Streamlit or Grafana:
- Real-time airport surface map: Aircraft positions, congestion heatmap
- Recommended actions: Pushback hold times per terminal, speed advisories per taxiway
- Performance metrics: Rolling 7-day taxi time, fuel burn, delay comparisons (actual vs. RL-recommended)
- Historical analysis: Drill-down into individual episodes showing actual vs. optimal trajectories
9. Literature Review
9.1 Airport Taxi Optimization with RL
| Paper | Year | Method | Airport | Key Result |
|---|---|---|---|---|
| Tran et al. [1][4] | 2023 | PPO (speed control) | Singapore Changi | 29.5% fuel reduction, 97.8% schedule conformance |
| Ali et al. [10][20] | 2022 | DRL (departure metering) | Singapore Changi | 44% taxi delay reduction (medium traffic), hotspot features improve convergence |
| Watteau et al. [^2] | 2024 | PPO (multi-agent routing) | Montreal, Toronto | Shortest routes with on-time arrival via graph-based agent |
| Szymanski et al. [^13] | 2023 | Single/Multi-agent RL | Generic | RL routing with genetic algorithm route assignment |
| Ma et al. [^68] | 2022 | Data-driven + optimization | Beijing PEK | 5.1 min taxi-in reduction, 3.7 min taxi-out reduction via ADS-B analysis |
9.2 Departure Metering and Pushback Control
Ali et al.'s work at NTU on DRL-based departure metering remains the benchmark. Their MDP framework models spot metering at each terminal with a centralized DM agent. The introduction of taxiway hotspot features to capture spatial-temporal congestion evolution was a key innovation that significantly improved convergence [10][26]. Murca et al. (2017) provided a robust optimization approach to departure metering as an optimization baseline [^29].
9.3 Offline RL in Transportation
While offline RL has been extensively applied in robotics and autonomous driving [^18], its application to aviation ground operations remains nascent. The safety-critical nature of aviation aligns naturally with offline RL's non-exploratory paradigm. C2IQL represents the latest advance in safe offline RL, handling dynamic safety constraints via constraint conditioning [^16].
9.4 ADS-B Data Processing for Surface Operations
Szymanski et al.'s HMM-based map-matching algorithm (AIAA 2023) demonstrated 97-99% accuracy for reconstructing airport surface trajectories from ADS-B data [^17]. Schlosser et al. (JOAS 2024) extended surface trajectory analysis to include stochastic pavement roughness modeling using OpenSky Network data [^19]. The traffic Python library by Xavier Olive provides production-grade ADS-B trajectory processing with built-in airport surface support [^38].
9.5 Digital Twins for Aviation
NREL developed a digital twin framework for DFW Airport that combines SARIMA-based traffic demand forecasting with microscopic traffic simulation [^37]. This "digital twin intelligence platform" helps airport operations staff explore policy changes and infrastructure scenarios. Lu et al. (2025) applied digital twin technology to aircraft turnaround operations [^43].
10. Research Gaps and Opportunities
10.1 What Is Still Unsolved
- Offline RL specifically for airport surface operations: No published work applies CQL, IQL, or Decision Transformer to airport taxi optimization. All existing RL work uses online PPO/DRL [1][10].
- Multi-airport generalization: All existing approaches train per-airport. Transfer learning of taxi policies across airports with different layouts is unexplored.
- Joint optimization: Simultaneous optimization of pushback timing, taxi routing, and speed control within a single RL framework (current work addresses each in isolation).
- Human-in-the-loop evaluation: No study has evaluated RL taxi advisories with real ATCOs or pilots.
10.2 Where Offline RL Has Clear Advantages
- Data abundance: Decades of A-SMGCS/ADS-B data exist for major airports—far more than what's needed for offline RL training.
- Safety-first: No online exploration risk; policies can be validated offline before deployment.
- Counterfactual analysis: Offline RL naturally supports "what-if" analysis on historical operations.
- Incremental deployment: Start with advisory (decision support) mode before any autonomous control.
10.3 Publishable Contribution Opportunities
- "Offline RL for Airport Departure Metering: A CQL/IQL Approach with ADS-B Data" — First application of modern offline RL algorithms to airport surface optimization. Compare CQL, IQL, DT against PPO-based baselines from Tran et al. and Ali et al.
- "Safety-Constrained Offline RL for Airport Surface Movement via CMDP" — Apply C2IQL or Lagrangian-constrained IQL with explicit separation and runway incursion constraints.
- "ADS-B to MDP: An Open Pipeline for Airport Surface RL Datasets" — Release an open-source pipeline converting OpenSky Network ADS-B data into standardized offline RL datasets for multiple airports.
- "Transfer Learning of Taxi Policies Across Airport Layouts" — Investigate whether offline RL policies trained at one airport generalize to others via graph neural network state representations.
- "Decision Transformer for Long-Horizon Airport Surface Scheduling" — Leverage Transformer's sequence modeling for full-day airport surface scheduling with sparse end-of-day throughput rewards.
11. Implementation Roadmap
Phase 1: Data Foundation (Months 1-3)
| Milestone | Deliverable | Tools |
|---|---|---|
| M1.1 Airport graph construction | OSM-based taxiway graph for target airport (e.g., EDDF) | osmnx, networkx |
| M1.2 ADS-B data pipeline | Bronze layer ingestion from OpenSky (6+ months of data) | pyopensky, traffic, Delta Lake |
| M1.3 Map-matching | HMM map-matching module achieving >95% accuracy | hmmlearn, custom HMM |
| M1.4 Episode construction | Silver→Gold pipeline producing RL episodes with labeled (s,a,r,s',d) | PySpark/Polars, Airflow |
| M1.5 EDA and validation | Analysis notebook confirming data quality and feature distributions | Jupyter, Matplotlib |
Phase 2: Environment and Baselines (Months 3-5)
| Milestone | Deliverable | Tools |
|---|---|---|
| M2.1 Custom airport simulator | Gymnasium-compatible env with graph navigation, conflict detection | Gymnasium, NetworkX |
| M2.2 Simulator calibration | Validate simulator dynamics against historical data (speed profiles, taxi times) | Statistical testing |
| M2.3 FIFO baseline | Implement and evaluate FIFO pushback policy | Custom |
| M2.4 MILP baseline | Implement simplified MILP scheduler for comparison | PuLP/Gurobi |
| M2.5 BC baseline | Train behavioral cloning on historical data | d3rlpy |
Phase 3: Offline RL Training (Months 5-8)
| Milestone | Deliverable | Tools |
|---|---|---|
| M3.1 IQL training | Train IQL on offline dataset with hyperparameter sweep | d3rlpy, MLflow [93][97] |
| M3.2 CQL training | Train CQL for conservative comparison | d3rlpy |
| M3.3 DT training | Train Decision Transformer for long-horizon variant | HuggingFace Transformers |
| M3.4 Safety constraints | Implement C2IQL or Lagrangian IQL with CMDP constraints | Custom on d3rlpy |
| M3.5 OPE evaluation | Off-policy evaluation of all policies using FQE | d3rlpy OPE, SCOPE-RL [^82] |
Phase 4: Evaluation and Analysis (Months 8-10)
| Milestone | Deliverable | Tools |
|---|---|---|
| M4.1 Comparative analysis | Head-to-head evaluation of all methods on test scenarios | Custom evaluation framework |
| M4.2 Robustness testing | Evaluate on unseen traffic densities and weather conditions | Simulator perturbation |
| M4.3 Multi-airport test | Transfer evaluation to 2nd airport (e.g., LFPG Paris CDG) | Same pipeline, new data |
| M4.4 Paper draft | Write research paper for AIAA or IEEE TITS | LaTeX |
Phase 5: Deployment Demo (Months 10-12)
| Milestone | Deliverable | Tools |
|---|---|---|
| M5.1 Feature store | Production feature pipeline with Feast | Feast, Redis |
| M5.2 Batch inference | Daily batch prediction pipeline | Airflow, d3rlpy |
| M5.3 Dashboard | Interactive decision support dashboard | Streamlit, Plotly |
| M5.4 Documentation | Technical documentation and user guide | MkDocs |
| M5.5 Open-source release | Release data pipeline and baseline models | GitHub |
12. Key Libraries and Tools Reference
| Category | Tool | Purpose |
|---|---|---|
| ADS-B Data | pyopensky [^35] |
OpenSky Network data access |
| Trajectory Processing | traffic [^38] |
ADS-B trajectory analysis |
| Airport Graph | osmnx, networkx |
Taxiway graph from OSM |
| Offline RL | d3rlpy [93][99] |
CQL, IQL, BCQ, TD3+BC, DT |
| RL Environments | gymnasium, BlueSky-Gym [^15] |
Environment interface |
| Experiment Tracking | MLflow [97][100] | Metrics, models, artifacts |
| Feature Store | Feast | Online/offline feature serving |
| Data Processing | PySpark, Polars, Delta Lake | Lakehouse ETL |
| Orchestration | Airflow / Dagster | Pipeline scheduling |
| Visualization | Streamlit, Grafana, Plotly | Dashboards |
| Simulation | BlueSky [^12], custom | Airport surface simulation |
| Fuel Estimation | OpenAP, ICAO Engine DB [77][86] | Fuel burn calculation |
| OPE | SCOPE-RL [^82], d3rlpy OPE | Off-policy evaluation |
| Safety RL | OmniSafe, custom CMDP | Constrained RL [^47] |
13. Concrete Recommendations
- Start with IQL on OpenSky data for Frankfurt Airport (EDDF). Your proximity and domain knowledge of EDDF make it an ideal first target. Use
pyopenskyto extract 12 months of surface trajectories. - Build the map-matching pipeline first. This is the highest-risk component—without accurate trajectory reconstruction, all downstream RL is unreliable. Validate against known taxiway routes.
- Use d3rlpy as the RL backbone. It provides all needed algorithms with minimal code, MLflow integration, and active maintenance [^93].
- Implement the CMDP/safety layer before any deployment claims. Aviation regulators and stakeholders will immediately ask about safety guarantees. C2IQL or Lagrangian IQL provides the formal framework [16][47].
- Publish the data pipeline as an open-source contribution. There is no standard open dataset for airport surface RL. Creating one—even for a single airport—would be a significant community contribution and attract citations.
- Target IEEE Transactions on Intelligent Transportation Systems or AIAA Journal for publication. Ali et al.'s departure metering work was published in IEEE TITS [^20]; Tran et al.'s in the Transaction of the Japan Society for Aeronautical and Space Sciences [^1].
- Connect this to your PhD in AI + Logistics. Airport surface movement is a logistics optimization problem—aircraft routing on a constrained network with time windows, capacity constraints, and stochastic disruptions. Frame it as "logistics optimization via offline RL" for maximum relevance to your dissertation.
References
- Towards Greener Airport Surface Operations: A Reinforcement Learning Approach for Autonomous Taxiing*
- Optimizing Airport Ground Movements Using Multi-Agents Reinforcement Learning | AIAA Aviation Forum and ASCEND co-located Conference Proceedings - This paper presents an efficient methodology for optimizing aircraft ground trajectories at airports...
- Delft University of Technology
- Towards greener airport surface operations: a reinforcement learning approach for autonomous taxiing
- BlueSky-Gym: Reinforcement Learning Environments for ... - von DJ Groot · Zitiert von: 8 — Built upon the. Gymnasium API and the BlueSky air traffic simulator,...
- A Deep Reinforcement Learning Approach for Airport ... - von H Ali · Zitiert von: 34 — Based on step and cumulative reward signals, a centralised. DM agent i...
- BlueSky ATC Simulator Project: an Open Data and ...
- Single and Multi-Agent Reinforcement Learning Approach ... - von M Szymanski · 2023 · Zitiert von: 2 — Abstract. This paper presents a methodology for routing ai...
- TUDelft-CNS-ATM/bluesky-gym - A gymnasium style environment for standardized Reinforcement Learning research in Air Traffic Manage...
- C2IQL: Constraint-Conditioned Implicit Q-learning for Safe ... - Safe offline reinforcement learning aims to develop policies that maximize cumulative rewards while ...
- Development of a Map-Matching Algorithm for the Analysis of Aircraft Ground Trajectories using ADS-B Data | AIAA AVIATION Forum - View Video Presentation: https://doi.org/10.2514/6.2023-3758.vid This paper presents the results of ...
- Exploring Reinforcement Learning approaches for Safety ... - von S Khaitan · 2023 · Zitiert von: 1 — This method, while being able to learn from offline demonstr...
- Journal of Open Aviation Science (2024), Vol.2
- A Deep Reinforcement Learning Approach for Airport Departure Metering Under Spatial–Temporal Airside Interactions - Airport taxi delays adversely affect airports and airlines around the world leading to airside conge...
- When should we prefer Decision Transformers for Offline ... - Offline reinforcement learning (RL) allows agents to learn effective, return-maximizing policies fro...
- OpenSky Network Data
- ADS-B Exchange: Serving the Flight Tracking Enthusiast - Join the Community to Unlock Early Access to New Releases & More What is ADSBx? The world’s largest ...
- Deep reinforcement learning based airport departure metering - Airport taxi delays adversely affect airports and airlines around the world in terms of congestion, ...
- OpenSky Network
- A robust optimization approach for airport departure ... - von MCR Murça · 2017 · Zitiert von: 64 — A departure flight, for instance, undergoes a series of sta...
- Conservative Q-Learning for Offline RL - Aviral Kumar1, Aurick Zhou1, George Tucker2, Sergey Levine1,2 1UC Berkeley 2Google Research, Brain T...
- Real-time airport surface movement planning: Minimizing aircraft emissions ☆ - This paper presents a study towards the development of a real-time taxi movement planning system tha...
- Conservative Q-Learning for Offline Reinforcement Learning
- Airport surface traffic optimization and simulation in the presence of ... - Thesis: Ph. D., Massachusetts Institute of Technology, Department of Aeronautics and Astronautics, 2...
- open-aviation/pyopensky: The Python interface ... - The Python interface for OpenSky database. Contribute to open-aviation/pyopensky development by crea...
- Conservative Q-Learning for Offline Reinforcement Learning - arXiv - Effectively leveraging large, previously collected datasets in reinforcement learning (RL) is a key ...
- Airport Surface Transportation Digital Twin Framework
- Tests And Code Quality - A toolbox for processing and analysing air traffic data - xoolive/traffic
- Harnessing digital twin technology for enhanced aircraft ... - von J Lu · 2025 · Zitiert von: 1 — The aircraft turnaround operation is a critical component of air ...
- O R L WITH IMPLICIT Q-LEARNING - OpenReview
- Batch-Constrained Q-learning (BCQ) | Offline RL
- [Literature Review] An Empirical Study of Lagrangian Methods in Safe Reinforcement Learning - This paper presents an empirical study on the role and sensitivity of Lagrange multipliers ($\lambda...
- Offline Reinforcement Learning with Implicit Q-Learning - arXiv - Offline reinforcement learning requires reconciling two conflicting aims: learning a policy that imp...
- Improving TD3-BC: Relaxed Policy Constraint for Offline ... - The ability to discover optimal behaviour from fixed data sets has the potential to transfer the suc...
- A Survey of Safe Reinforcement Learning and Constrained ... - This survey provides a mathematically rigorous overview of SafeRL formulations based on Constrained ...
- [RL] Offline Reinforcement Learning - ... offline RL was BCQ, abbreviation for batch-constrained deep Q-Learning. ... Performance comparis...
- GitHub - sfujim/BCQ: Author's PyTorch implementation of BCQ for continuous and discrete actions - Author's PyTorch implementation of BCQ for continuous and discrete actions - sfujim/BCQ
- Decision Transformer: Reinforcement Learning via Sequence ... - This page contains metadata information for the record with PAR ID 10300400
- D4RL Benchmark: Offline RL Evaluation - Explore D4RL, a standardized offline RL benchmark suite that rigorously evaluates data-driven algori...
- [PDF] Decision Transformer: Reinforcement Learning via Sequence Modeling | Semantic Scholar - Despite its simplicity, Decision Transformer matches or exceeds the performance of state-of-the-art ...
- Off Policy Evaluation - A collection of reference environments for offline reinforcement learning - Farama-Foundation/D4RL
- Data-driven trajectory-based analysis and optimization of ... - von J Ma · 2022 · Zitiert von: 24 — This paper proposes a hybrid approach combining traffic analysis...
- Decision Transformer: Reinforcement Learning via ...
- D4RL: Building Better Benchmarks for Offline ... - In off-policy RL, the algorithm learns from experience collected online from an exploration or behav...
- Advanced Surface Movement Guidance and Control System
- Estimation of Aircraft Taxi-out Fuel Burn using Flight Data ... - von H Khadilkar · Zitiert von: 217 — The ICAO procedure for estimation of taxi-out fuel burn assumes...
- Simulation of Urban MObility - Wikipedia
- Quickstart#
- Eclipse SUMO - Simulation of Urban MObility - An open source, highly portable, microscopic and continuous multi-modal traffic simulation package d...
- Fuel Estimation For Operational Performance Benchmarking: Model
- Track model development using MLflow - Azure Databricks - Learn about experiments and tracking machine learning training runs automatically using MLflow.
- Building the Data Lake (Bronze, Silver, Gold Architecture) - Deep Technical Series: Building a Modern Data Warehouse and Data Lake A well-designed data lake is t...
- takuseno/d3rlpy: An offline deep reinforcement learning ... - offline RL: d3rlpy supports state-of-the-art offline RL algorithms. Offline RL is extremely powerful...
- MLflow Tracking - The MLflow Tracking is an API and UI for logging parameters, code versions, metrics, and output file...
- d3rlpy: An Offline Deep Reinforcement Learning Library - von T Seno · 2022 · Zitiert von: 225 — In this paper, we introduce d3rlpy, an open-sourced offline d...
- MLflow Model Registry - The MLflow Model Registry is a centralized model store, set of APIs and a UI designed to
- What is the medallion lakehouse architecture? - The medallion architecture describes a series of data layers that denote the quality of data stored ...
- What is a Medallion Architecture? - A medallion architecture is a data design pattern used to logically organize data in a lakehouse, wi...