Time-Discounted GAE
In semi-MDPs, each step has an associated duration. Instead of the usual value equation
\begin{equation} V(s_1) = r_1 + \gamma r_2 + \gamma^2 r_3 + ... \end{equation}
one discount based on step duration
\begin{equation} V_{\Delta t}(s_1) = \gamma^{\Delta t_1} r_1 + \gamma^{\Delta t_1 + \Delta t_2} r_2 + \gamma^{\Delta t_1 + \Delta t_2 + \Delta t_3} r_3 + ... \end{equation}
using the convention that reward is given at the end of a step.
The generalized advantage estimator can be rewritten accordingly. In our implementation, the exponential decay lambda is per-step (as opposed to timewise).
RLlib Version
RLlib is actively developed and can change significantly from version to version. For this script, the following version is used:
[1]:
from importlib.metadata import version
version("ray") # Parent package of RLlib
[1]:
'2.35.0'
Define the Environment
A simple single-satellite environment is defined, as in :doc:examples/rllib_training.
[2]:
import numpy as np
from bsk_rl import act, data, obs, sats, scene
from bsk_rl.sim import dyn, fsw
class ScanningDownlinkDynModel(
dyn.ContinuousImagingDynModel, dyn.GroundStationDynModel
):
# Define some custom properties to be accessed in the state
@property
def instrument_pointing_error(self) -> float:
r_BN_P_unit = self.r_BN_P / np.linalg.norm(self.r_BN_P)
c_hat_P = self.satellite.fsw.c_hat_P
return np.arccos(np.dot(-r_BN_P_unit, c_hat_P))
@property
def solar_pointing_error(self) -> float:
a = (
self.world.gravFactory.spiceObject.planetStateOutMsgs[self.world.sun_index]
.read()
.PositionVector
)
a_hat_N = a / np.linalg.norm(a)
nHat_B = self.satellite.sat_args["nHat_B"]
NB = np.transpose(self.BN)
nHat_N = NB @ nHat_B
return np.arccos(np.dot(nHat_N, a_hat_N))
class ScanningSatellite(sats.AccessSatellite):
observation_spec = [
obs.SatProperties(
dict(prop="storage_level_fraction"),
dict(prop="battery_charge_fraction"),
dict(prop="wheel_speeds_fraction"),
dict(prop="instrument_pointing_error", norm=np.pi),
dict(prop="solar_pointing_error", norm=np.pi),
),
obs.OpportunityProperties(
dict(prop="opportunity_open", norm=5700),
dict(prop="opportunity_close", norm=5700),
type="ground_station",
n_ahead_observe=1,
),
obs.Eclipse(norm=5700),
]
action_spec = [
act.Scan(duration=180.0),
act.Charge(duration=120.0),
act.Downlink(duration=60.0),
act.Desat(duration=60.0),
]
dyn_type = ScanningDownlinkDynModel
fsw_type = fsw.ContinuousImagingFSWModel
sat = ScanningSatellite(
"Scanner-1",
sat_args=dict(
# Data
dataStorageCapacity=5000 * 8e6, # bits
storageInit=lambda: np.random.uniform(0.0, 0.8) * 5000 * 8e6,
instrumentBaudRate=0.5 * 8e6,
transmitterBaudRate=-50 * 8e6,
# Power
batteryStorageCapacity=200 * 3600, # W*s
storedCharge_Init=lambda: np.random.uniform(0.3, 1.0) * 200 * 3600,
basePowerDraw=-10.0, # W
instrumentPowerDraw=-30.0, # W
transmitterPowerDraw=-25.0, # W
thrusterPowerDraw=-80.0, # W
panelArea=0.25,
# Attitude
imageAttErrorRequirement=0.1,
imageRateErrorRequirement=0.1,
disturbance_vector=lambda: np.random.normal(scale=0.0001, size=3), # N*m
maxWheelSpeed=6000.0, # RPM
wheelSpeeds=lambda: np.random.uniform(-3000, 3000, 3),
desatAttitude="nadir",
),
)
duration = 5 * 5700.0 # About 5 orbits
env_args = dict(
satellite=sat,
scenario=scene.UniformNadirScanning(value_per_second=1 / duration),
rewarder=data.ScanningTimeReward(),
time_limit=duration,
failure_penalty=-1.0,
terminate_on_time_limit=True,
)
RLlib Configuration
The configuration is mostly the same as in the standard example.
[3]:
import bsk_rl.utils.rllib # noqa To access "SatelliteTasking-RLlib"
from ray.rllib.algorithms.ppo import PPOConfig
N_CPUS = 3
training_args = dict(
lr=0.00003,
gamma=0.999,
train_batch_size=250,
num_sgd_iter=10,
model=dict(fcnet_hiddens=[512, 512], vf_share_layers=False),
lambda_=0.95,
use_kl_loss=False,
clip_param=0.1,
grad_clip=0.5,
reward_time="step_end",
)
config = (
PPOConfig()
.env_runners(num_env_runners=N_CPUS - 1, sample_timeout_s=1000.0)
.environment(
env="SatelliteTasking-RLlib",
env_config=env_args,
)
.reporting(
metrics_num_episodes_for_smoothing=1,
metrics_episode_collection_timeout_s=180,
)
.checkpointing(export_native_model_files=True)
.framework(framework="torch")
.api_stack(
enable_rl_module_and_learner=True,
enable_env_runner_and_connector_v2=True,
)
)
Rewards can also be distributed at the start of the step by setting reward_time="step_start".
The additional setting that must be configured is the appropriate learner class. This uses the d_ts key from the info dict to discount by the step length, not just the step count.
[4]:
from bsk_rl.utils.rllib.discounting import TimeDiscountedGAEPPOTorchLearner
config.training(learner_class=TimeDiscountedGAEPPOTorchLearner)
[4]:
<ray.rllib.algorithms.ppo.ppo.PPOConfig at 0x7f179571cbd0>
Training can then proceed as normal.
[5]:
import ray
from ray import tune
ray.init(
ignore_reinit_error=True,
num_cpus=N_CPUS,
object_store_memory=2_000_000_000, # 2 GB
)
# Run the training
tune.run(
"PPO",
config=config.to_dict(),
stop={"training_iteration": 2}, # Adjust the number of iterations as needed
)
# Shutdown Ray
ray.shutdown()
2026-02-03 17:27:56,373 INFO worker.py:1783 -- Started a local Ray instance.
2026-02-03 17:28:00,005 INFO tune.py:616 -- [output] This uses the legacy output and progress reporter, as Jupyter notebooks are not supported by the new engine, yet. For more information, please see https://github.com/ray-project/ray/issues/36949
/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/gymnasium/spaces/box.py:130: UserWarning: WARN: Box bound precision lowered by casting to float32
gym.logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/gymnasium/utils/passive_env_checker.py:164: UserWarning: WARN: The obs returned by the `reset()` method was expecting numpy array dtype to be float32, actual type: float64
logger.warn(
/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/gymnasium/utils/passive_env_checker.py:188: UserWarning: WARN: The obs returned by the `reset()` method is not within the observation space.
logger.warn(f"{pre} is not within the observation space.")
Tune Status
| Current time: | 2026-02-03 17:29:10 |
| Running for: | 00:01:10.09 |
| Memory: | 4.7/15.6 GiB |
System Info
Using FIFO scheduling algorithm.Logical resource usage: 3.0/3 CPUs, 0/0 GPUs
Trial Status
| Trial name | status | loc | iter | total time (s) | num_env_steps_sample d_lifetime | num_episodes_lifetim e | num_env_steps_traine d_lifetime |
|---|---|---|---|---|---|---|---|
| PPO_SatelliteTasking-RLlib_afd0a_00000 | TERMINATED | 10.1.0.194:5886 | 2 | 53.7951 | 8000 | 40 | 8000 |
(PPO pid=5886) Install gputil for GPU system monitoring.
(SingleAgentEnvRunner pid=5933) 2026-02-03 17:28:17,713 sats.satellite.Scanner-1 WARNING <15240.00> Scanner-1: failed battery_valid check
(SingleAgentEnvRunner pid=5933) BSK_WARNING: Excessive reaction wheel acceleration detected (9.64e+10 rad/s^2). This may be caused by using unlimited torque (useMaxTorque=False) with a small spacecraft inertia. Consider using torque limits or increasing spacecraft inertia.
(SingleAgentEnvRunner pid=5933) BSK_WARNING: Excessive reaction wheel acceleration detected (1.26e+10 rad/s^2). This may be caused by using unlimited torque (useMaxTorque=False) with a small spacecraft inertia. Consider using torque limits or increasing spacecraft inertia.
(SingleAgentEnvRunner pid=5933) BSK_WARNING: Excessive reaction wheel acceleration detected (1.12e+12 rad/s^2). This may be caused by using unlimited torque (useMaxTorque=False) with a small spacecraft inertia. Consider using torque limits or increasing spacecraft inertia.
(SingleAgentEnvRunner pid=5933) BSK_WARNING: Excessive reaction wheel acceleration detected (1.46e+11 rad/s^2). This may be caused by using unlimited torque (useMaxTorque=False) with a small spacecraft inertia. Consider using torque limits or increasing spacecraft inertia.
(SingleAgentEnvRunner pid=5933) BSK_WARNING: Excessive reaction wheel acceleration detected (1.10e+12 rad/s^2). This may be caused by using unlimited torque (useMaxTorque=False) with a small spacecraft inertia. Consider using torque limits or increasing spacecraft inertia.
(SingleAgentEnvRunner pid=5933) BSK_WARNING: Excessive reaction wheel acceleration detected (1.44e+11 rad/s^2). This may be caused by using unlimited torque (useMaxTorque=False) with a small spacecraft inertia. Consider using torque limits or increasing spacecraft inertia.
(SingleAgentEnvRunner pid=5933) BSK_WARNING: Excessive reaction wheel acceleration detected (1.10e+12 rad/s^2). This may be caused by using unlimited torque (useMaxTorque=False) with a small spacecraft inertia. Consider using torque limits or increasing spacecraft inertia.
(SingleAgentEnvRunner pid=5933) BSK_WARNING: Excessive reaction wheel acceleration detected (1.44e+11 rad/s^2). This may be caused by using unlimited torque (useMaxTorque=False) with a small spacecraft inertia. Consider using torque limits or increasing spacecraft inertia.
(SingleAgentEnvRunner pid=5933) BSK_WARNING: Excessive reaction wheel acceleration detected (1.10e+12 rad/s^2). This may be caused by using unlimited torque (useMaxTorque=False) with a small spacecraft inertia. Consider using torque limits or increasing spacecraft inertia.
(SingleAgentEnvRunner pid=5933) BSK_WARNING: Excessive reaction wheel acceleration detected (1.44e+11 rad/s^2). This may be caused by using unlimited torque (useMaxTorque=False) with a small spacecraft inertia. Consider using torque limits or increasing spacecraft inertia.
(SingleAgentEnvRunner pid=5933) 2026-02-03 17:28:23,931 sats.satellite.Scanner-1 WARNING <17400.00> Scanner-1: failed battery_valid check [repeated 5x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(SingleAgentEnvRunner pid=5933) 2026-02-03 17:28:30,293 sats.satellite.Scanner-1 WARNING <23280.00> Scanner-1: failed battery_valid check [repeated 6x across cluster]
(SingleAgentEnvRunner pid=5933) 2026-02-03 17:28:35,701 sats.satellite.Scanner-1 WARNING <8160.00> Scanner-1: failed battery_valid check [repeated 2x across cluster]
Trial Progress
| Trial name | env_runners | fault_tolerance | learners | num_agent_steps_sampled_lifetime | num_env_steps_sampled_lifetime | num_env_steps_trained_lifetime | num_episodes_lifetime | perf | timers |
|---|---|---|---|---|---|---|---|---|---|
| PPO_SatelliteTasking-RLlib_afd0a_00000 | {'num_agent_steps_sampled': {'default_agent': 4000}, 'episode_return_mean': 0.36961403508771945, 'episode_return_min': 0.36094736842105274, 'num_module_steps_sampled': {'default_policy': 4000}, 'episode_duration_sec_mean': 2.593059128499931, 'episode_len_max': 268, 'agent_episode_returns_mean': {'default_agent': 0.36961403508771945}, 'num_env_steps_sampled': 4000, 'num_episodes': 20, 'episode_len_mean': 264.0, 'module_episode_returns_mean': {'default_policy': 0.36961403508771945}, 'episode_len_min': 260, 'episode_return_max': 0.37828070175438616, 'num_agent_steps_sampled_lifetime': {'default_agent': 12000}, 'num_module_steps_sampled_lifetime': {'default_policy': 12000}, 'sample': np.float64(22.002958464301607), 'num_env_steps_sampled_lifetime': 16000, 'time_between_sampling': np.float64(6.277671662999978)} | {'num_healthy_workers': 2, 'num_in_flight_async_reqs': 0, 'num_remote_worker_restarts': 0} | {'default_policy': {'vf_explained_var': -1.0, 'mean_kl_loss': 0.010992036201059818, 'curr_entropy_coeff': 0.0, 'default_optimizer_learning_rate': 5e-05, 'num_trainable_parameters': 139013.0, 'policy_loss': 0.05150565132498741, 'num_non_trainable_parameters': 0.0, 'vf_loss': 1.8135100617655553e-05, 'vf_loss_unclipped': 1.8135100617655553e-05, 'entropy': 1.3599399328231812, 'total_loss': 0.05372219160199165, 'curr_kl_coeff': 0.20000000298023224, 'num_module_steps_trained': 4000}, '__all_modules__': {'num_module_steps_trained': 4000, 'total_loss': 0.05372219160199165, 'num_trainable_parameters': 139013.0, 'num_non_trainable_parameters': 0.0, 'num_env_steps_trained': 4000}} | {'default_agent': 8000} | 8000 | 8000 | 40 | {'cpu_util_percent': np.float64(46.94722222222222), 'ram_util_percent': np.float64(29.802777777777774)} | {'env_runner_sampling_timer': 23.379724504690028, 'learner_update_timer': 4.841205271960023, 'synch_weights': 0.005529624630014496, 'synch_env_connectors': 0.006047840560032682} |
(SingleAgentEnvRunner pid=5934) 2026-02-03 17:28:46,470 sats.satellite.Scanner-1 WARNING <11820.00> Scanner-1: failed battery_valid check [repeated 2x across cluster]
(SingleAgentEnvRunner pid=5933) BSK_WARNING: Excessive reaction wheel acceleration detected (7.69e+218 rad/s^2). This may be caused by using unlimited torque (useMaxTorque=False) with a small spacecraft inertia. Consider using torque limits or increasing spacecraft inertia.
(SingleAgentEnvRunner pid=5933) BSK_WARNING: Excessive reaction wheel acceleration detected (8.16e+203 rad/s^2). This may be caused by using unlimited torque (useMaxTorque=False) with a small spacecraft inertia. Consider using torque limits or increasing spacecraft inertia.
(SingleAgentEnvRunner pid=5933) BSK_WARNING: Excessive reaction wheel acceleration detected (4.10e+218 rad/s^2). This may be caused by using unlimited torque (useMaxTorque=False) with a small spacecraft inertia. Consider using torque limits or increasing spacecraft inertia.
(SingleAgentEnvRunner pid=5933) BSK_WARNING: Excessive reaction wheel acceleration detected (8.90e+219 rad/s^2). This may be caused by using unlimited torque (useMaxTorque=False) with a small spacecraft inertia. Consider using torque limits or increasing spacecraft inertia.
(SingleAgentEnvRunner pid=5933) BSK_WARNING: Excessive reaction wheel acceleration detected (9.45e+204 rad/s^2). This may be caused by using unlimited torque (useMaxTorque=False) with a small spacecraft inertia. Consider using torque limits or increasing spacecraft inertia.
(SingleAgentEnvRunner pid=5933) BSK_WARNING: Excessive reaction wheel acceleration detected (4.75e+219 rad/s^2). This may be caused by using unlimited torque (useMaxTorque=False) with a small spacecraft inertia. Consider using torque limits or increasing spacecraft inertia.
(SingleAgentEnvRunner pid=5933) BSK_WARNING: Excessive reaction wheel acceleration detected (8.78e+219 rad/s^2). This may be caused by using unlimited torque (useMaxTorque=False) with a small spacecraft inertia. Consider using torque limits or increasing spacecraft inertia.
(SingleAgentEnvRunner pid=5933) BSK_WARNING: Excessive reaction wheel acceleration detected (9.35e+204 rad/s^2). This may be caused by using unlimited torque (useMaxTorque=False) with a small spacecraft inertia. Consider using torque limits or increasing spacecraft inertia.
(SingleAgentEnvRunner pid=5933) BSK_WARNING: Excessive reaction wheel acceleration detected (4.70e+219 rad/s^2). This may be caused by using unlimited torque (useMaxTorque=False) with a small spacecraft inertia. Consider using torque limits or increasing spacecraft inertia.
(SingleAgentEnvRunner pid=5933) BSK_WARNING: Excessive reaction wheel acceleration detected (8.79e+219 rad/s^2). This may be caused by using unlimited torque (useMaxTorque=False) with a small spacecraft inertia. Consider using torque limits or increasing spacecraft inertia.
(SingleAgentEnvRunner pid=5933) BSK_WARNING: Excessive reaction wheel acceleration detected (9.35e+204 rad/s^2). This may be caused by using unlimited torque (useMaxTorque=False) with a small spacecraft inertia. Consider using torque limits or increasing spacecraft inertia.
(SingleAgentEnvRunner pid=5933) BSK_WARNING: Excessive reaction wheel acceleration detected (4.70e+219 rad/s^2). This may be caused by using unlimited torque (useMaxTorque=False) with a small spacecraft inertia. Consider using torque limits or increasing spacecraft inertia.
(SingleAgentEnvRunner pid=5933) BSK_WARNING: Excessive reaction wheel acceleration detected (8.79e+219 rad/s^2). This may be caused by using unlimited torque (useMaxTorque=False) with a small spacecraft inertia. Consider using torque limits or increasing spacecraft inertia.
(SingleAgentEnvRunner pid=5933) BSK_WARNING: Excessive reaction wheel acceleration detected (9.35e+204 rad/s^2). This may be caused by using unlimited torque (useMaxTorque=False) with a small spacecraft inertia. Consider using torque limits or increasing spacecraft inertia.
(SingleAgentEnvRunner pid=5933) BSK_WARNING: Excessive reaction wheel acceleration detected (4.70e+219 rad/s^2). This may be caused by using unlimited torque (useMaxTorque=False) with a small spacecraft inertia. Consider using torque limits or increasing spacecraft inertia.
(SingleAgentEnvRunner pid=5933) 2026-02-03 17:28:51,798 sats.satellite.Scanner-1 WARNING <12420.00> Scanner-1: failed battery_valid check [repeated 4x across cluster]
(SingleAgentEnvRunner pid=5934) 2026-02-03 17:28:58,956 sats.satellite.Scanner-1 WARNING <18480.00> Scanner-1: failed battery_valid check
(SingleAgentEnvRunner pid=5933) 2026-02-03 17:29:02,485 sats.satellite.Scanner-1 WARNING <21540.00> Scanner-1: failed battery_valid check
2026-02-03 17:29:10,129 INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/home/runner/ray_results/PPO_2026-02-03_17-28-00' in 0.0159s.
2026-02-03 17:29:10,262 INFO tune.py:1041 -- Total run time: 70.26 seconds (70.07 seconds for the tuning loop).