Airport Taxi Optimization with Offline Reinforcement Learning Using ADS-B Data

Last updated on 20 Feb 2026

A Comprehensive Engineering Research Report

1. Introduction and Problem Statement

Airport surface congestion costs the US aviation industry approximately $900 million annually in excess fuel burn alone [^1]. The inefficiency of ground movement—including prolonged taxi times, departure queue congestion, and suboptimal pushback sequencing—directly impacts fuel consumption, carbon emissions, airline operational costs, and passenger experience. Airport Departure Metering (DM) and taxi optimization represent high-value targets for AI-driven improvement [^10][20].

This report provides an engineering-focused blueprint for building a production-grade system that uses historical ADS-B trajectory data and offline reinforcement learning to optimize airport ground movement. The focus is on actionable implementation guidance, concrete algorithm selection, and a realistic deployment pathway.

Why Offline RL?

Aviation is a safety-critical domain where online exploration is infeasible—you cannot randomly experiment with aircraft ground movements. Offline RL learns policies exclusively from pre-collected historical data without any real-time environment interaction [^36][48]. This makes it the natural paradigm for airport surface optimization, where decades of operational data exist but experimentation carries unacceptable risk.

2. Problem Formulation: MDP and CMDP Design

2.1 Markov Decision Process Formulation

Airport ground movement is modeled as a Markov Decision Process (MDP) defined by the tuple ((\mathcal{S}, \mathcal{A}, P, R, \gamma)) [^4][10]:

(\mathcal{S}): State space representing the airport surface situation
(\mathcal{A}): Action space for controlling aircraft movement
(P: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \to [0,1]): Transition probability
(R: \mathcal{S} \times \mathcal{A} \to \mathbb{R}): Reward function
(\gamma \in [0,1]): Discount factor

The optimal policy (\pi^*) maximizes the expected cumulative discounted reward:

[
\pi^* = \arg\max_\pi \mathbb{E}\pi \left[\sum{t=0}^{\infty} \gamma^t R(s_t, a_t)\right]
]

2.2 State Space Design

The state vector should capture four categories of features, following the proven framework from Tran et al. (NTU) and Ali et al. (NTU) [^4][10]:

Category	Features	Source
Planning (F1)	Distance to target, expected arrival time, remaining route segments, assigned runway	Flight plan / AMAN
Ego Aircraft (F2)	Current speed, heading, position on graph, aircraft type/weight class, fuel state	ADS-B / A-SMGCS
Traffic/Environment (F3)	Queue length at runway, number of aircraft on taxiway segment, hotspot occupancy, conflict proximity	ADS-B aggregate
Context (F4)	Runway configuration, wind direction/speed, visibility, time-of-day, day-of-week	METAR / Airport ops

Taxiway hotspot features deserve special attention. Ali et al. demonstrated that encoding spatial-temporal congestion levels at known bottleneck intersections significantly improves DM policy convergence during training [^20][10]. Define hotspot occupancy as the count of aircraft within a configurable radius of each critical taxiway junction.

2.3 Action Space

Two formulations are viable depending on the optimization level:

Speed Control (Continuous): The action is the acceleration of the controlled aircraft at each timestep, (a_t \in [-a_{max}, +a_{max}]), where (a_{max}) is estimated from historical A-SMGCS data [^4]. This is suitable for individual aircraft taxi speed optimization.

Pushback Control (Discrete): The action is a binary hold/release decision per departure at each terminal gate. This is the formulation used for departure metering [^10][26]. At each decision epoch, the DM agent decides whether to hold or pushback each ready-to-depart aircraft.

Hybrid: A hierarchical approach combining pushback sequencing at the strategic level with speed profile optimization at the tactical level.

2.4 Multi-Objective Reward Design

The reward function must balance competing objectives. Following Tran et al.'s validated design [^4][1]:

[
R(s_t, a_t) = w_1 R_{time} + w_2 R_{fuel} + w_3 R_{safety} + w_4 R_{throughput}
]

Component	Formula Concept	Rationale
(R_{time})	Penalty proportional to deviation from reference speed (v_{ref}) to reach target on-time	Encourages timely arrival
(R_{fuel})	Negative fuel burn estimate based on speed and acceleration profile	Minimizes fuel consumption
(R_{safety})	Large negative penalty for conflicts (aircraft separation < threshold) or invalid actions	Hard safety constraint
(R_{throughput})	Bonus for completing taxi within planned window	Maintains schedule conformance

Reference speed is defined as (v_{ref} = d_{remaining} / t_{remaining}), giving the agent a dynamic target [^4]. The agent learns to track this reference while optimizing fuel and avoiding conflicts.

Practical weight tuning guidance: Start with equal weights and use sensitivity analysis. Tran et al. found that excessively penalizing fuel burn makes aircraft maintain constant speed disregarding traffic, while too-high conflict penalties cause aircraft to freeze in place [^4].

2.5 Constrained MDP (CMDP) Formulation

For safety-critical deployment, extend the MDP to a CMDP where safety constraints are explicit rather than embedded in reward weights [^47][50]:

[
\max_{\pi} J^R(\pi) \quad \text{s.t.} \quad J^{C_i}(\pi) \leq d_i, \quad i = 1, \dots, m
]

where (J^{C_i}(\pi)) are expected cumulative costs (e.g., number of separation violations, runway incursions) and (d_i) are tolerance thresholds. This is solved via Lagrangian relaxation [^47]:

[
\min_{\lambda \geq 0} \max_\pi \left[ J^R(\pi) - \sum_{i=1}^{m} \lambda_i (J^{C_i}(\pi) - d_i) \right]
]

The Lagrange multipliers (\lambda_i) are interpreted as the "price" of violating each constraint and can be updated via gradient ascent or PID-controlled updates during training [^47].

3. Data Layer: ADS-B Based Pipeline

3.1 ADS-B Ground Trajectory Reconstruction

ADS-B (Automatic Dependent Surveillance–Broadcast) transmits aircraft position, velocity, altitude, and identification at approximately 1 Hz. On the airport surface, ADS-B provides the raw signal for reconstructing taxi trajectories [^17][22].

Key data fields from ADS-B surface messages:

Field	Description	Use
`icao24`	Aircraft unique identifier	Track identity
`latitude`, `longitude`	Position (WGS-84)	Trajectory reconstruction
`velocity`	Ground speed (knots)	Speed profile analysis
`heading`	Track angle	Direction of movement
`on_ground`	Boolean surface flag	Filter ground movements
`timestamp`	Unix epoch time	Temporal sequencing

3.2 Data Sources

Source	Access	Coverage	Resolution	Cost
OpenSky Network	Trino DB (authenticated), REST API	Global, 30k+ receivers	~1-2 Hz (cooperative)	Free for research [^22][28]
ADS-B Exchange	API (tiered), Historical archive	Global, unfiltered	~1 Hz	Paid for historical [^25]
A-SMGCS (airport-specific)	Airport authority agreement	Single airport, second-by-second	1 Hz (cooperative + non-cooperative)	Restricted [^10][71]
Eurocontrol DDR2	Application required	European airspace	Flight plan level	Research access

Recommended starting point: OpenSky Network via the pyopensky library [^35] and the traffic Python library [^38]. These provide immediate access to historical ADS-B state vectors with Trino SQL queries.

# Example: Query surface movements at Frankfurt Airport (EDDF) using pyopensky
from pyopensky.trino import Trino

trino = Trino()
df = trino.query("""
    SELECT time, icao24, lat, lon, velocity, heading, onground, callsign
    FROM state_vectors_data4
    WHERE onground = true
    AND lat BETWEEN 49.95 AND 50.10
    AND lon BETWEEN 8.45 AND 8.65
    AND hour BETWEEN '2024-06-01T00:00:00Z' AND '2024-06-30T23:59:59Z'
""")

3.3 Map Matching to Taxiway/Runway Graph

Raw ADS-B positions must be snapped to the airport's taxiway graph. Szymanski et al. (AIAA 2023) developed a validated map-matching algorithm achieving 97-99% accuracy with processing speed exceeding 100 trajectory points in under 2 seconds [^17].

Algorithm: Hidden Markov Model (HMM) based map-matching:

Graph construction: Automatically generate a directed graph from OpenStreetMap (OSM) data for any airport. Nodes are taxiway junctions; edges are taxiway segments with attributes (type, name, bearing, distance, speed limit) [^17].
Emission probability: For each ADS-B position, compute the likelihood of being on each candidate edge based on perpendicular distance.
Transition probability: Compute likelihood of transitioning between edges based on route plausibility (shortest path distance vs. great-circle distance).
Viterbi decoding: Find the most likely sequence of edges (the matched trajectory).

Implementation tools:

osmnx: Extract airport taxiway network from OSM
networkx: Graph representation and shortest path computation
Custom HMM or hmmlearn: Viterbi decoding for map matching
traffic library: Built-in airport surface operations support [^38]

3.4 Feature Engineering for Ground Operations

After map-matching, engineer the following features per aircraft per timestep:

Spatial features:

Edge ID (current taxiway segment)
Distance traveled on current edge, distance remaining to target
Number of intersections remaining on route
Euclidean distance to runway threshold

Traffic features:

Count of aircraft on same taxiway segment
Count of aircraft within 200m, 500m, 1000m radii
Queue length at assigned runway (aircraft between holding point and threshold)
Hotspot density: aircraft count at each of the top-K congested junctions [^20]
Nearest conflicting aircraft: distance, relative speed, relative heading

Temporal features:

Scheduled departure time minus current time
Historical average taxi time for this route
Time since pushback

Environmental features:

Active runway configuration (extracted from METAR/ATIS)
Wind speed and direction
Visibility category (CAT I/II/III)
Hour of day, day of week (one-hot or cyclical encoding)

3.5 Handling Data Quality Issues

Issue	Mitigation
Irregular sampling (0.5-5 Hz)	Resample to fixed 1 Hz using linear interpolation for position, zero-order hold for discrete states
GPS noise (±10-50m)	Kalman filter before map-matching; the HMM map-matching itself acts as a spatial filter [^17]
Missing data (gaps > 30s)	Segment trajectories at gaps; discard segments shorter than minimum taxi time
Mixed ground/airborne	Filter using `on_ground` flag combined with altitude < 200 ft AGL and speed < 80 knots
Non-aircraft targets	Filter by known aircraft `icao24` identifiers; cross-reference with flight schedules

3.6 Offline Dataset Construction

The final offline RL dataset follows the format (\mathcal{D} = {(s_t, a_t, r_t, s_{t+1}, d_t)}_{t=1}^{N}) where (d_t) indicates episode termination.

Episode definition: One episode = one taxi operation (gate to runway for departures, runway to gate for arrivals). Each episode begins at pushback/touchdown and ends at runway entry/gate arrival.

Action labeling from historical data: Since historical data reflects what controllers/pilots actually did (the behavior policy (\pi_\beta)), actions are extracted as:

Speed control: (a_t = (v_{t+1} - v_t) / \Delta t) (observed acceleration)
Pushback control: Binary hold/release derived from actual pushback times vs. scheduled times

Reward labeling: Compute reward retrospectively using the multi-objective function above, with fuel burn estimated using the ICAO emission model (fuel flow rate × engine time × number of engines) [^77][86].

4. Environment and Simulation

4.1 BlueSky and BlueSky-Gym

BlueSky is an open-source air traffic simulator developed at TU Delft, capable of fast-time simulation of air traffic including ground movements [^12]. BlueSky-Gym wraps BlueSky with a Gymnasium-compatible API for RL research [^3][6][^15].

Capabilities:

Aircraft performance models (OpenAP database)
TrafScript command language for traffic control
Fast-time simulation (100x+ real-time)
Plugin architecture for custom extensions
7 built-in RL environments (descent, conflict resolution, merging, waypoint planning) [^6]

Limitations for airport surface simulation:

No built-in taxiway network modeling (airside graph must be custom-built)
No pushback/gate dynamics
No ground vehicle interactions
Limited surface conflict detection
No weather impact on surface operations

Recommendation: Use BlueSky-Gym as the foundation but build a custom airport surface plugin that adds taxiway graph navigation, pushback queuing, and surface conflict detection.

4.2 Custom Graph-Based Airport Simulator

For departure metering and taxi optimization, a purpose-built simulator is more practical. Ali et al. built a representative airside simulator for Singapore Changi Airport with the following components [^10]:

Airside network graph: Nodes (gates, taxiway junctions, runway thresholds), edges (taxiway segments with capacity and speed constraints)
Traffic flow model: Aircraft move along assigned routes with stochastic speed profiles calibrated from A-SMGCS data
Episode generation: Sample departure schedules from historical data, stochastically perturb pushback-ready times
Conflict model: Check separation constraints at each timestep; enforce first-come-first-served at intersections

# Pseudocode: Minimal Airport Surface Simulator
class AirportSurfaceEnv(gymnasium.Env):
    def __init__(self, airport_graph, schedule):
        self.graph = airport_graph  # networkx DiGraph
        self.schedule = schedule     # list of (callsign, gate, runway, time)
        self.aircraft = {}           # active aircraft states
        
    def step(self, action):
        # action: dict of {callsign: acceleration} or {callsign: hold/release}
        for ac in self.aircraft.values():
            ac.update_position(action, self.graph)
        conflicts = self.detect_conflicts()
        reward = self.compute_reward(conflicts)
        obs = self.get_observation()
        done = all(ac.reached_target for ac in self.aircraft.values())
        return obs, reward, done, info
    
    def reset(self):
        # Sample new episode from schedule
        episode = self.sample_episode()
        self.aircraft = self.initialize_aircraft(episode)
        return self.get_observation()

4.3 SUMO as an Alternative

Eclipse SUMO (Simulation of Urban Mobility) is a microscopic traffic simulator that can model individual vehicle movements on a network [^78][84]. While designed for road traffic, it can be adapted for airport surface movement by treating taxiways as road segments. However, aircraft dynamics (acceleration profiles, minimum turn radii, wake turbulence) differ substantially from road vehicles, making a custom simulator preferable.

4.4 Creating Offline Datasets for RL Training

Two approaches for generating the offline training dataset:

Approach A: Direct from historical data (recommended for initial development)

Extract trajectories from OpenSky/A-SMGCS
Map-match to taxiway graph
Label states, actions, rewards as described in Section 3.6
This gives a "behavior policy" dataset reflecting actual controller decisions

Approach B: Simulator-generated (recommended for augmentation)

Run the custom simulator with rule-based policies (FIFO, shortest path, speed-optimal)
Collect diverse trajectories covering various traffic densities
Mix with historical data for broader state-action coverage

Dataset quality indicators (from D4RL best practices [^61][70]):

Coverage: Trajectories should span low, medium, and high traffic densities
Diversity: Include multiple behavior policies (expert, suboptimal, mixed)
Sufficiency: Aim for >100,000 episodes across at least 6 months of operations

4.5 Sim-to-Real Considerations

Key gaps between simulation and real operations:

Communication delays: Real ATC instructions have latency not modeled in simulation
Pilot compliance variance: Pilots may not follow speed advisories precisely
Non-modeled factors: Ground vehicles, construction, engine startup delays
Weather dynamics: Surface conditions change continuously

Mitigation: Use domain randomization during training (randomize speed compliance, add noise to transition dynamics) and conservative policy constraints during deployment.

5. Offline RL Algorithms

5.1 Algorithm Landscape

Offline RL addresses the fundamental challenge that standard RL algorithms fail when trained on fixed datasets due to distributional shift—the learned policy encounters states not in the training data, leading to catastrophic overestimation of Q-values [^36][48].

Algorithm	Type	Key Mechanism	Complexity	Best For
CQL	Value regularization	Learns conservative Q-function that lower-bounds true value [^36][30]	Medium	High stochasticity, dense rewards
IQL	In-sample learning	Never queries OOD actions; uses expectile regression on V-function [^48][45]	Low	General offline RL; fine-tuning
BCQ	Policy constraint	Constrains policy to behavior data support via VAE [^46][58]	High	Continuous actions, narrow data
TD3+BC	Policy regularization	Adds BC regularization term to TD3 objective [^49][52]	Low	Simple baseline, good default
Decision Transformer	Sequence modeling	Casts RL as conditional sequence generation via GPT [^60][66][^69]	High	Sparse rewards, long horizons

5.2 Conservative Q-Learning (CQL)

CQL augments the standard Bellman error objective with a regularizer that pushes down Q-values for out-of-distribution actions while pushing up Q-values for in-distribution actions [^36][33]:

[
\min_Q \alpha \left( \mathbb{E}{s \sim \mathcal{D}, a \sim \mu}[Q(s,a)] - \mathbb{E}{s,a \sim \mathcal{D}}[Q(s,a)] \right) + \frac{1}{2} \mathbb{E}_{s,a,s' \sim \mathcal{D}} \left[ (Q(s,a) - \hat{\mathcal{B}}^\pi Q(s,a))^2 \right]
]

where (\mu) is a broad distribution (e.g., uniform) and (\alpha) controls conservatism. CQL can be implemented in less than 20 lines of code on top of standard Q-learning [^33].

Aviation suitability: Strong for departure metering where state transitions are stochastic and reward is dense (per-step taxi delay). The conservative lower bound aligns with the safety-first aviation culture.

5.3 Implicit Q-Learning (IQL)

IQL avoids evaluating any out-of-distribution actions entirely by using expectile regression to estimate an upper expectile of the value function [^45][48]:

[
L_V(\psi) = \mathbb{E}{(s,a) \sim \mathcal{D}} \left[ L_2^\tau (Q{\hat{\theta}}(s,a) - V_\psi(s)) \right]
]

where (L_2^\tau(u) = |\tau - \mathbf{1}(u < 0)| u^2) is the asymmetric squared loss with expectile parameter (\tau). Policy extraction uses advantage-weighted behavioral cloning [^48].

Aviation suitability: Excellent for taxi speed optimization where you want to improve over the average historical behavior without risk of extrapolation. IQL's state-of-the-art D4RL performance and strong online fine-tuning capability make it the recommended primary algorithm.

5.4 Decision Transformer (DT)

DT frames RL as autoregressive sequence generation, conditioning on desired return-to-go (\hat{R}_t), past states, and actions to predict optimal future actions [^66][69]:

[
a_t = \text{Transformer}(\hat{R}_1, s_1, a_1, \hat{R}_2, s_2, a_2, \ldots, \hat{R}_t, s_t)
]

Aviation suitability: Best for long-horizon taxi operations with sparse terminal rewards (e.g., total taxi time only known at episode end). Meta's empirical study found DT is substantially better than CQL with sparse rewards and low-quality data, but requires more data overall [^21].

5.5 Recommended Algorithm Strategy

Phase	Algorithm	Rationale
Baseline	Behavioral Cloning (BC)	Simple supervised learning on expert demonstrations; establishes floor
Primary	IQL	Best balance of performance, simplicity, and fine-tuning potential [^48]
Conservative	CQL	Safety-oriented lower-bound estimation; cross-validate against IQL [^36]
Long-horizon	Decision Transformer	If terminal rewards dominate; leverage Transformer scaling [^69]
Simple alternative	TD3+BC	Minimal modification to standard TD3; surprisingly effective [^49]

Implementation library: d3rlpy [^93][99]—the most mature offline RL library, supporting CQL, IQL, BCQ, TD3+BC, and Decision Transformer with a scikit-learn-compatible API and PyTorch backend.

import d3rlpy

# Load airport taxi dataset (custom format)
dataset = d3rlpy.dataset.MDPDataset(
    observations=states,      # (N, state_dim)
    actions=actions,          # (N, action_dim)
    rewards=rewards,          # (N,)
    terminals=terminals       # (N,) boolean
)

# Configure IQL
iql = d3rlpy.algos.IQLConfig(
    actor_learning_rate=3e-4,
    critic_learning_rate=3e-4,
    value_learning_rate=3e-4,
    expectile=0.7,            # tau parameter
    weight_temp=3.0,          # advantage weighting temperature
    batch_size=256,
).create(device="cuda:0")

# Train offline
iql.fit(dataset, n_steps=500000, evaluators={"td_error": d3rlpy.metrics.TDErrorEvaluator(episodes=test_episodes)})

5.6 Safety-Constrained Offline RL

For deployment in aviation, standard offline RL is insufficient—explicit safety constraints are mandatory.

C2IQL (Constraint-Conditioned Implicit Q-Learning) extends IQL to the constrained setting by expanding the implicit value function update to handle cost constraints, then reconstructing non-discounted cumulative costs from discounted values [^16]. This is the most natural extension of IQL for safety-critical aviation use.

Lagrangian methods remain the workhorse for CMDP solving. The multiplier (\lambda) can be updated via gradient ascent: (\lambda_{k+1} = [\lambda_k + \eta \cdot (J^C(\pi_{\theta_k}) - d)]_+) or via PID control for smoother convergence [^47]. Empirical evidence shows automated GA updates can even exceed the performance of fixed optimal (\lambda^*) in complex tasks [^47].

Shielding provides a hard safety guarantee by constructing a shield from formal specifications (e.g., Linear Temporal Logic) that overrides unsafe actions before execution [^50]. For airport operations, a shield would enforce: "No two aircraft may occupy the same taxiway segment simultaneously" and "No aircraft may enter an active runway without clearance."

5.7 Handling Distribution Shift

Technique	How It Works	Used By
Conservative value estimation	Lower-bound Q-values for OOD actions	CQL [^36]
In-sample learning	Never query Q for unseen actions	IQL [^48]
Policy constraint	Restrict policy to data support	BCQ [^46], TD3+BC [^49]
Ensemble disagreement	Use Q-function ensemble variance as uncertainty	CQL-ensemble
Return conditioning	Condition on achievable returns only	Decision Transformer [^69]

6. Baselines and Comparisons

6.1 Rule-Based Heuristics

FIFO (First-In-First-Out): Aircraft are pushed back and routed in scheduled order. This is the simplest and most common real-world baseline.
Shortest Path First: Route each aircraft via the shortest taxiway path (Dijkstra/A* on the airport graph).
Time-Based Metering: Hold departures at gates until estimated taxi time matches the target (e.g., N-Control—release when taxiway aircraft count drops below N) [^34].

6.2 Queueing Theory Models

Model the runway as a server with aircraft as customers. M/M/1 or M/G/1 queue models provide analytical estimates of taxi delay as a function of arrival rate and service rate. Useful for validating whether RL policies achieve theoretically plausible improvements.

6.3 MILP / Optimization Scheduling

Mixed-Integer Linear Programming models optimize taxiway and runway schedules simultaneously. Lee (MIT, 2014) developed integrated MILP models for both taxiway and runway scheduling at Detroit Airport [^34]. MILP approaches can achieve 1-3% emissions reduction with 15-second replanning cycles [^31].

MILP limitations: Computational cost scales exponentially with traffic density; difficulty handling stochastic events; requires complete information about all future aircraft.

6.4 Supervised Learning Predictors

Train models to predict taxi time, congestion, or delay from features—but without prescriptive optimization. These serve as useful feature extractors or for comparison to demonstrate the value of the RL optimization component.

6.5 Fair Benchmarking Protocol

Requirement	Implementation
Same evaluation scenarios	Fix 100+ test episodes spanning low/medium/high traffic
Same metrics computation	Standardize taxi time, fuel, conflicts measurement
Multiple seeds	Report mean ± std over ≥5 random seeds
Statistical tests	Wilcoxon signed-rank test for pairwise algorithm comparison
OPE before deployment	Use Fitted Q-Evaluation (FQE) for off-policy evaluation of policies before any online testing [^67]

7. Evaluation Metrics

Metric	Definition	Target
Average taxi time (min)	Mean time from pushback to runway entry (departures) or touchdown to gate (arrivals)	≥10% reduction vs. FIFO
Fuel burn (kg/flight)	Estimated using ICAO fuel flow model: (F = T \times f \times N) where (T) = taxi time, (f) = fuel flow per engine, (N) = number of engines [^77][86]	≥15% reduction
CO₂ emissions (kg/flight)	3.16 × fuel burn (kg)	Proportional to fuel
Throughput (movements/hour)	Number of departures + arrivals per unit time	≥5% improvement
Taxi delay (min)	Actual taxi time minus unimpeded reference time (20th percentile) [^86]	≥25-44% reduction [^20]
Conflict rate (events/hour)	Number of separation violations or near-misses per operational hour	Zero-tolerance target
Schedule conformance (%)	Percentage of flights arriving at runway within ±20 seconds of planned time	≥97.8% (matching Tran et al. [^4])
Gate hold time (min)	Additional time aircraft spend at gate due to metering	Should not increase disproportionately

Robustness evaluation: Test the learned policy on at least 3 different airport layouts (e.g., EDDF Frankfurt, WSSS Singapore Changi, KJFK New York JFK) to assess transfer learning potential.

8. System Architecture

8.1 End-to-End Engineering Architecture

The system follows a standard ML platform architecture with aviation-specific components:

┌─────────────────────────────────────────────────────────────────┐
│                    DATA INGESTION LAYER                          │
│  OpenSky API → Kafka/Pulsar → Bronze (Raw ADS-B Parquet)        │
│  METAR Feed → Bronze (Weather JSON)                             │
│  Flight Schedule → Bronze (Schedule CSV)                        │
└────────────────────────┬────────────────────────────────────────┘
                         │
┌────────────────────────▼────────────────────────────────────────┐
│                    PROCESSING LAYER                              │
│  Bronze → Silver: Dedup, filter, resample, map-match            │
│  Silver → Gold: Feature engineering, episode construction       │
│  Orchestration: Airflow / Dagster / Prefect                     │
└────────────────────────┬────────────────────────────────────────┘
                         │
┌────────────────────────▼────────────────────────────────────────┐
│                    ML TRAINING LAYER                             │
│  Feature Store (Feast/Hopsworks) ← Gold tables                  │
│  d3rlpy Training Pipeline (IQL/CQL/DT)                          │
│  MLflow Experiment Tracking + Model Registry                    │
│  Safety Evaluation Module (CMDP constraint checking)            │
└────────────────────────┬────────────────────────────────────────┘
                         │
┌────────────────────────▼────────────────────────────────────────┐
│                    SERVING LAYER                                 │
│  Batch Inference: Daily policy evaluation on new data           │
│  Decision Support Dashboard (Streamlit/Grafana)                 │
│  Advisory Output: Recommended pushback times / speed profiles   │
└─────────────────────────────────────────────────────────────────┘

8.2 Lakehouse Data Stack (Bronze/Silver/Gold)

Following the Databricks medallion architecture [^92][101][^104]:

Bronze Layer — Raw, immutable, append-only:

bronze.adsb_state_vectors: Raw ADS-B messages (Parquet, partitioned by date/hour)
bronze.metar_weather: Raw METAR observations
bronze.flight_schedules: Airline schedule data
bronze.airport_config: Runway configuration logs

Silver Layer — Cleaned, validated, deduplicated:

silver.surface_trajectories: Map-matched trajectories with taxiway segment IDs
silver.taxi_episodes: Complete gate-to-runway/runway-to-gate episodes
silver.traffic_state: Per-timestep airport-wide traffic snapshot
silver.weather_interpolated: Time-interpolated weather at airport

Gold Layer — Feature-engineered, ML-ready:

gold.rl_transitions: ((s_t, a_t, r_t, s_{t+1}, d_t)) tuples for offline RL
gold.episode_metrics: Per-episode summary (taxi time, fuel, delays)
gold.feature_store_sync: Materialized features for serving
gold.evaluation_results: Model performance metrics over time

Storage format: Delta Lake on object storage (MinIO for self-hosted, S3/ADLS for cloud). Delta provides ACID transactions, schema evolution, and time travel—essential for reproducible ML experiments.

8.3 Feature Store

Use Feast (open-source) or Hopsworks to serve features consistently between training and inference:

Offline store: Parquet files in the Gold layer for historical feature retrieval during training
Online store: Redis/DynamoDB for low-latency feature serving during batch inference
Feature groups: aircraft_state_features, traffic_context_features, weather_features, planning_features

8.4 MLflow Experiment Tracking

MLflow provides the experiment management backbone [^91][97][^100]:

Experiments: Organize by algorithm (IQL, CQL, DT) and airport
Runs: Each training configuration (hyperparameters, dataset version, random seed)
Metrics: Log training loss, OPE scores (FQE), evaluation metrics (taxi time reduction, fuel savings)
Artifacts: Save trained model checkpoints, evaluation plots, dataset metadata
Model Registry: Version control for production-ready policies; stage transitions (Staging → Production)

import mlflow

mlflow.set_experiment("airport-taxi-optimization/IQL")

with mlflow.start_run(run_name="iql_eddf_v3"):
    mlflow.log_params({
        "algorithm": "IQL",
        "expectile": 0.7,
        "batch_size": 256,
        "n_steps": 500000,
        "airport": "EDDF",
        "dataset_version": "2024Q2"
    })
    
    # Training loop with d3rlpy
    iql.fit(dataset, n_steps=500000)
    
    mlflow.log_metrics({
        "avg_taxi_time_reduction_pct": 12.3,
        "avg_fuel_reduction_pct": 18.7,
        "conflict_rate": 0.0,
        "schedule_conformance_pct": 98.1
    })
    
    mlflow.pytorch.log_model(iql, "model")

8.5 Batch Inference Pipeline

The system operates in advisory mode (not autonomous control):

Daily batch: Process previous day's operations through Bronze→Silver→Gold pipeline
Policy evaluation: Run trained policy on reconstructed episodes to compute counterfactual metrics ("what would the RL policy have recommended?")
Prospective advisory: For next-day schedule, generate recommended pushback times and speed profiles
Dashboard update: Visualize comparison of actual vs. recommended operations

8.6 Decision Support Dashboard

Build with Streamlit or Grafana:

Real-time airport surface map: Aircraft positions, congestion heatmap
Recommended actions: Pushback hold times per terminal, speed advisories per taxiway
Performance metrics: Rolling 7-day taxi time, fuel burn, delay comparisons (actual vs. RL-recommended)
Historical analysis: Drill-down into individual episodes showing actual vs. optimal trajectories

9. Literature Review

9.1 Airport Taxi Optimization with RL

Paper	Year	Method	Airport	Key Result
Tran et al. [^1][4]	2023	PPO (speed control)	Singapore Changi	29.5% fuel reduction, 97.8% schedule conformance
Ali et al. [^10][20]	2022	DRL (departure metering)	Singapore Changi	44% taxi delay reduction (medium traffic), hotspot features improve convergence
Watteau et al. [^2]	2024	PPO (multi-agent routing)	Montreal, Toronto	Shortest routes with on-time arrival via graph-based agent
Szymanski et al. [^13]	2023	Single/Multi-agent RL	Generic	RL routing with genetic algorithm route assignment
Ma et al. [^68]	2022	Data-driven + optimization	Beijing PEK	5.1 min taxi-in reduction, 3.7 min taxi-out reduction via ADS-B analysis

9.2 Departure Metering and Pushback Control

Ali et al.'s work at NTU on DRL-based departure metering remains the benchmark. Their MDP framework models spot metering at each terminal with a centralized DM agent. The introduction of taxiway hotspot features to capture spatial-temporal congestion evolution was a key innovation that significantly improved convergence [^10][26]. Murca et al. (2017) provided a robust optimization approach to departure metering as an optimization baseline [^29].

9.3 Offline RL in Transportation

While offline RL has been extensively applied in robotics and autonomous driving [^18], its application to aviation ground operations remains nascent. The safety-critical nature of aviation aligns naturally with offline RL's non-exploratory paradigm. C2IQL represents the latest advance in safe offline RL, handling dynamic safety constraints via constraint conditioning [^16].

9.4 ADS-B Data Processing for Surface Operations

Szymanski et al.'s HMM-based map-matching algorithm (AIAA 2023) demonstrated 97-99% accuracy for reconstructing airport surface trajectories from ADS-B data [^17]. Schlosser et al. (JOAS 2024) extended surface trajectory analysis to include stochastic pavement roughness modeling using OpenSky Network data [^19]. The traffic Python library by Xavier Olive provides production-grade ADS-B trajectory processing with built-in airport surface support [^38].

9.5 Digital Twins for Aviation

NREL developed a digital twin framework for DFW Airport that combines SARIMA-based traffic demand forecasting with microscopic traffic simulation [^37]. This "digital twin intelligence platform" helps airport operations staff explore policy changes and infrastructure scenarios. Lu et al. (2025) applied digital twin technology to aircraft turnaround operations [^43].

10. Research Gaps and Opportunities

10.1 What Is Still Unsolved

Offline RL specifically for airport surface operations: No published work applies CQL, IQL, or Decision Transformer to airport taxi optimization. All existing RL work uses online PPO/DRL [^1][10].
Multi-airport generalization: All existing approaches train per-airport. Transfer learning of taxi policies across airports with different layouts is unexplored.
Joint optimization: Simultaneous optimization of pushback timing, taxi routing, and speed control within a single RL framework (current work addresses each in isolation).
Human-in-the-loop evaluation: No study has evaluated RL taxi advisories with real ATCOs or pilots.

10.2 Where Offline RL Has Clear Advantages

Data abundance: Decades of A-SMGCS/ADS-B data exist for major airports—far more than what's needed for offline RL training.
Safety-first: No online exploration risk; policies can be validated offline before deployment.
Counterfactual analysis: Offline RL naturally supports "what-if" analysis on historical operations.
Incremental deployment: Start with advisory (decision support) mode before any autonomous control.

10.3 Publishable Contribution Opportunities

"Offline RL for Airport Departure Metering: A CQL/IQL Approach with ADS-B Data" — First application of modern offline RL algorithms to airport surface optimization. Compare CQL, IQL, DT against PPO-based baselines from Tran et al. and Ali et al.
"Safety-Constrained Offline RL for Airport Surface Movement via CMDP" — Apply C2IQL or Lagrangian-constrained IQL with explicit separation and runway incursion constraints.
"ADS-B to MDP: An Open Pipeline for Airport Surface RL Datasets" — Release an open-source pipeline converting OpenSky Network ADS-B data into standardized offline RL datasets for multiple airports.
"Transfer Learning of Taxi Policies Across Airport Layouts" — Investigate whether offline RL policies trained at one airport generalize to others via graph neural network state representations.
"Decision Transformer for Long-Horizon Airport Surface Scheduling" — Leverage Transformer's sequence modeling for full-day airport surface scheduling with sparse end-of-day throughput rewards.

11. Implementation Roadmap

Phase 1: Data Foundation (Months 1-3)

Milestone	Deliverable	Tools
M1.1 Airport graph construction	OSM-based taxiway graph for target airport (e.g., EDDF)	`osmnx`, `networkx`
M1.2 ADS-B data pipeline	Bronze layer ingestion from OpenSky (6+ months of data)	`pyopensky`, `traffic`, Delta Lake
M1.3 Map-matching	HMM map-matching module achieving >95% accuracy	`hmmlearn`, custom HMM
M1.4 Episode construction	Silver→Gold pipeline producing RL episodes with labeled (s,a,r,s',d)	PySpark/Polars, Airflow
M1.5 EDA and validation	Analysis notebook confirming data quality and feature distributions	Jupyter, Matplotlib

Phase 2: Environment and Baselines (Months 3-5)

Milestone	Deliverable	Tools
M2.1 Custom airport simulator	Gymnasium-compatible env with graph navigation, conflict detection	Gymnasium, NetworkX
M2.2 Simulator calibration	Validate simulator dynamics against historical data (speed profiles, taxi times)	Statistical testing
M2.3 FIFO baseline	Implement and evaluate FIFO pushback policy	Custom
M2.4 MILP baseline	Implement simplified MILP scheduler for comparison	PuLP/Gurobi
M2.5 BC baseline	Train behavioral cloning on historical data	d3rlpy

Phase 3: Offline RL Training (Months 5-8)

Milestone	Deliverable	Tools
M3.1 IQL training	Train IQL on offline dataset with hyperparameter sweep	d3rlpy, MLflow [^93][97]
M3.2 CQL training	Train CQL for conservative comparison	d3rlpy
M3.3 DT training	Train Decision Transformer for long-horizon variant	HuggingFace Transformers
M3.4 Safety constraints	Implement C2IQL or Lagrangian IQL with CMDP constraints	Custom on d3rlpy
M3.5 OPE evaluation	Off-policy evaluation of all policies using FQE	d3rlpy OPE, SCOPE-RL [^82]

Phase 4: Evaluation and Analysis (Months 8-10)

Milestone	Deliverable	Tools
M4.1 Comparative analysis	Head-to-head evaluation of all methods on test scenarios	Custom evaluation framework
M4.2 Robustness testing	Evaluate on unseen traffic densities and weather conditions	Simulator perturbation
M4.3 Multi-airport test	Transfer evaluation to 2nd airport (e.g., LFPG Paris CDG)	Same pipeline, new data
M4.4 Paper draft	Write research paper for AIAA or IEEE TITS	LaTeX

Phase 5: Deployment Demo (Months 10-12)

Milestone	Deliverable	Tools
M5.1 Feature store	Production feature pipeline with Feast	Feast, Redis
M5.2 Batch inference	Daily batch prediction pipeline	Airflow, d3rlpy
M5.3 Dashboard	Interactive decision support dashboard	Streamlit, Plotly
M5.4 Documentation	Technical documentation and user guide	MkDocs
M5.5 Open-source release	Release data pipeline and baseline models	GitHub

12. Key Libraries and Tools Reference

Category	Tool	Purpose
ADS-B Data	`pyopensky` [^35]	OpenSky Network data access
Trajectory Processing	`traffic` [^38]	ADS-B trajectory analysis
Airport Graph	`osmnx`, `networkx`	Taxiway graph from OSM
Offline RL	`d3rlpy` [^93][99]	CQL, IQL, BCQ, TD3+BC, DT
RL Environments	`gymnasium`, `BlueSky-Gym` [^15]	Environment interface
Experiment Tracking	MLflow [^97][100]	Metrics, models, artifacts
Feature Store	Feast	Online/offline feature serving
Data Processing	PySpark, Polars, Delta Lake	Lakehouse ETL
Orchestration	Airflow / Dagster	Pipeline scheduling
Visualization	Streamlit, Grafana, Plotly	Dashboards
Simulation	BlueSky [^12], custom	Airport surface simulation
Fuel Estimation	OpenAP, ICAO Engine DB [^77][86]	Fuel burn calculation
OPE	SCOPE-RL [^82], d3rlpy OPE	Off-policy evaluation
Safety RL	OmniSafe, custom CMDP	Constrained RL [^47]

13. Concrete Recommendations

Start with IQL on OpenSky data for Frankfurt Airport (EDDF). Your proximity and domain knowledge of EDDF make it an ideal first target. Use pyopensky to extract 12 months of surface trajectories.
Build the map-matching pipeline first. This is the highest-risk component—without accurate trajectory reconstruction, all downstream RL is unreliable. Validate against known taxiway routes.
Use d3rlpy as the RL backbone. It provides all needed algorithms with minimal code, MLflow integration, and active maintenance [^93].
Implement the CMDP/safety layer before any deployment claims. Aviation regulators and stakeholders will immediately ask about safety guarantees. C2IQL or Lagrangian IQL provides the formal framework [^16][47].
Publish the data pipeline as an open-source contribution. There is no standard open dataset for airport surface RL. Creating one—even for a single airport—would be a significant community contribution and attract citations.
Target IEEE Transactions on Intelligent Transportation Systems or AIAA Journal for publication. Ali et al.'s departure metering work was published in IEEE TITS [^20]; Tran et al.'s in the Transaction of the Japan Society for Aeronautical and Space Sciences [^1].
Connect this to your PhD in AI + Logistics. Airport surface movement is a logistics optimization problem—aircraft routing on a constrained network with time windows, capacity constraints, and stochastic disruptions. Frame it as "logistics optimization via offline RL" for maximum relevance to your dissertation.

References

Towards Greener Airport Surface Operations: A Reinforcement Learning Approach for Autonomous Taxiing*
Optimizing Airport Ground Movements Using Multi-Agents Reinforcement Learning | AIAA Aviation Forum and ASCEND co-located Conference Proceedings - This paper presents an efficient methodology for optimizing aircraft ground trajectories at airports...
Delft University of Technology
Towards greener airport surface operations: a reinforcement learning approach for autonomous taxiing
BlueSky-Gym: Reinforcement Learning Environments for ... - von DJ Groot · Zitiert von: 8 — Built upon the. Gymnasium API and the BlueSky air traffic simulator,...
A Deep Reinforcement Learning Approach for Airport ... - von H Ali · Zitiert von: 34 — Based on step and cumulative reward signals, a centralised. DM agent i...
BlueSky ATC Simulator Project: an Open Data and ...
Single and Multi-Agent Reinforcement Learning Approach ... - von M Szymanski · 2023 · Zitiert von: 2 — Abstract. This paper presents a methodology for routing ai...
TUDelft-CNS-ATM/bluesky-gym - A gymnasium style environment for standardized Reinforcement Learning research in Air Traffic Manage...
C2IQL: Constraint-Conditioned Implicit Q-learning for Safe ... - Safe offline reinforcement learning aims to develop policies that maximize cumulative rewards while ...
Development of a Map-Matching Algorithm for the Analysis of Aircraft Ground Trajectories using ADS-B Data | AIAA AVIATION Forum - View Video Presentation: https://doi.org/10.2514/6.2023-3758.vid This paper presents the results of ...
Exploring Reinforcement Learning approaches for Safety ... - von S Khaitan · 2023 · Zitiert von: 1 — This method, while being able to learn from offline demonstr...
Journal of Open Aviation Science (2024), Vol.2
A Deep Reinforcement Learning Approach for Airport Departure Metering Under Spatial–Temporal Airside Interactions - Airport taxi delays adversely affect airports and airlines around the world leading to airside conge...
When should we prefer Decision Transformers for Offline ... - Offline reinforcement learning (RL) allows agents to learn effective, return-maximizing policies fro...
OpenSky Network Data
ADS-B Exchange: Serving the Flight Tracking Enthusiast - Join the Community to Unlock Early Access to New Releases & More What is ADSBx? The world’s largest ...
Deep reinforcement learning based airport departure metering - Airport taxi delays adversely affect airports and airlines around the world in terms of congestion, ...
OpenSky Network
A robust optimization approach for airport departure ... - von MCR Murça · 2017 · Zitiert von: 64 — A departure flight, for instance, undergoes a series of sta...
Conservative Q-Learning for Offline RL - Aviral Kumar1, Aurick Zhou1, George Tucker2, Sergey Levine1,2 1UC Berkeley 2Google Research, Brain T...
Real-time airport surface movement planning: Minimizing aircraft emissions ☆ - This paper presents a study towards the development of a real-time taxi movement planning system tha...
Conservative Q-Learning for Offline Reinforcement Learning
Airport surface traffic optimization and simulation in the presence of ... - Thesis: Ph. D., Massachusetts Institute of Technology, Department of Aeronautics and Astronautics, 2...
open-aviation/pyopensky: The Python interface ... - The Python interface for OpenSky database. Contribute to open-aviation/pyopensky development by crea...
Conservative Q-Learning for Offline Reinforcement Learning - arXiv - Effectively leveraging large, previously collected datasets in reinforcement learning (RL) is a key ...
Airport Surface Transportation Digital Twin Framework
Tests And Code Quality - A toolbox for processing and analysing air traffic data - xoolive/traffic
Harnessing digital twin technology for enhanced aircraft ... - von J Lu · 2025 · Zitiert von: 1 — The aircraft turnaround operation is a critical component of air ...
O R L WITH IMPLICIT Q-LEARNING - OpenReview
Batch-Constrained Q-learning (BCQ) | Offline RL
[Literature Review] An Empirical Study of Lagrangian Methods in Safe Reinforcement Learning - This paper presents an empirical study on the role and sensitivity of Lagrange multipliers ($\lambda...
Offline Reinforcement Learning with Implicit Q-Learning - arXiv - Offline reinforcement learning requires reconciling two conflicting aims: learning a policy that imp...
Improving TD3-BC: Relaxed Policy Constraint for Offline ... - The ability to discover optimal behaviour from fixed data sets has the potential to transfer the suc...
A Survey of Safe Reinforcement Learning and Constrained ... - This survey provides a mathematically rigorous overview of SafeRL formulations based on Constrained ...
[RL] Offline Reinforcement Learning - ... offline RL was BCQ, abbreviation for batch-constrained deep Q-Learning. ... Performance comparis...
GitHub - sfujim/BCQ: Author's PyTorch implementation of BCQ for continuous and discrete actions - Author's PyTorch implementation of BCQ for continuous and discrete actions - sfujim/BCQ
Decision Transformer: Reinforcement Learning via Sequence ... - This page contains metadata information for the record with PAR ID 10300400
D4RL Benchmark: Offline RL Evaluation - Explore D4RL, a standardized offline RL benchmark suite that rigorously evaluates data-driven algori...
[PDF] Decision Transformer: Reinforcement Learning via Sequence Modeling | Semantic Scholar - Despite its simplicity, Decision Transformer matches or exceeds the performance of state-of-the-art ...
Off Policy Evaluation - A collection of reference environments for offline reinforcement learning - Farama-Foundation/D4RL
Data-driven trajectory-based analysis and optimization of ... - von J Ma · 2022 · Zitiert von: 24 — This paper proposes a hybrid approach combining traffic analysis...
Decision Transformer: Reinforcement Learning via ...
D4RL: Building Better Benchmarks for Offline ... - In off-policy RL, the algorithm learns from experience collected online from an exploration or behav...
Advanced Surface Movement Guidance and Control System
Estimation of Aircraft Taxi-out Fuel Burn using Flight Data ... - von H Khadilkar · Zitiert von: 217 — The ICAO procedure for estimation of taxi-out fuel burn assumes...
Simulation of Urban MObility - Wikipedia
Quickstart#
Eclipse SUMO - Simulation of Urban MObility - An open source, highly portable, microscopic and continuous multi-modal traffic simulation package d...
Fuel Estimation For Operational Performance Benchmarking: Model
Track model development using MLflow - Azure Databricks - Learn about experiments and tracking machine learning training runs automatically using MLflow.
Building the Data Lake (Bronze, Silver, Gold Architecture) - Deep Technical Series: Building a Modern Data Warehouse and Data Lake A well-designed data lake is t...
takuseno/d3rlpy: An offline deep reinforcement learning ... - offline RL: d3rlpy supports state-of-the-art offline RL algorithms. Offline RL is extremely powerful...
MLflow Tracking - The MLflow Tracking is an API and UI for logging parameters, code versions, metrics, and output file...
d3rlpy: An Offline Deep Reinforcement Learning Library - von T Seno · 2022 · Zitiert von: 225 — In this paper, we introduce d3rlpy, an open-sourced offline d...
MLflow Model Registry - The MLflow Model Registry is a centralized model store, set of APIs and a UI designed to
What is the medallion lakehouse architecture? - The medallion architecture describes a series of data layers that denote the quality of data stored ...
What is a Medallion Architecture? - A medallion architecture is a data design pattern used to logically organize data in a lakehouse, wi...