Training with Curriculum Learning

This example script demonstrates how to use curriculum learning (CL) and domain randomization (DR) during training with RLlib.

Three different examples of curriculums are shown, as well as the DR case. This example is part of the paper Improving Robustenss of Autonomous Spacecraft Scheduling Using Curriculum Learning and of a future publication. In CL, a sequence of different tasks with increasing difficulty are presented to the agent during training. Each task is seen as a different Markov decision process (MDP). For this problem, each task is characterized by a satellite with different battery capacity and exposed to different external torques, which would lead to different transition probabilities in the MDP.

Load Modules

[1]:
import numpy as np
from bsk_rl import act, data, obs, scene, sats
from bsk_rl.sim import dyn, fsw, world
from bsk_rl.gym import SatelliteTasking
from typing import Any, Callable, Optional, TypeVar
from bsk_rl.utils.rllib.callbacks import WrappedEpisodeDataCallbacks, EpisodeDataWrapper
from ray.rllib.algorithms.ppo import PPOConfig
import time
import ray
from ray import tune
from bsk_rl.sats import Satellite
from ray.tune.registry import register_env
from Basilisk.architecture import bskLogging

bskLogging.setDefaultLogLevel(bskLogging.BSK_WARNING)

SatObs = TypeVar("SatObs")
MultiSatObs = tuple[SatObs, ...]
SatArgRandomizer = Callable[[list[Satellite]], dict[Satellite, dict[str, Any]]]

Creating an Environment with CL

In this example, the SatelliteTasking environment is modified to allow changes to the spacecraft parameters during training. Two extra method set_task and get_task are introduced to set the difficulty and get the difficulty of the environment. Additionally, update_sat_params is used to change specific spacecraft arguments as a function of the difficulty and is called before each environment reset.

[2]:
class SatelliteTaskingCL(SatelliteTasking):

    def __init__(
        self,
        satellite: Satellite,
        *args,
        difficulty=0.0,
        CL_params={},
        **kwargs,
    ):

        super().__init__(
            satellite,
            *args,
            **kwargs,
        )

        self.difficulty = difficulty
        self.CL_params = CL_params

    def reset(
        self,
        seed: Optional[int] = None,
        options=None,
    ) -> tuple[MultiSatObs, dict[str, Any]]:

        self.update_sat_params()  # Update satellite parameters based on difficulty before resetting
        obs, info = super().reset(seed=seed, options=options)
        return obs, info

    def update_sat_params(self):
        """
        Update the satellite parameters based on the difficulty level.
        """
        if self.CL_params is not None:
            for satellite in self.satellites:
                for key, value in self.CL_params.items():
                    if key in satellite.sat_args_generator:
                        satellite.sat_args_generator[key] = value(self.difficulty)
                    else:
                        setattr(self, key, round(value(self.get_task())))

    def set_task(self, task):
        """
        Set the difficulty level.
        """
        self.difficulty = task

    def get_task(self):
        """
        Get the current difficulty level.
        """
        return self.difficulty

Registering the Custom Environment

Since a custom environment was created, it needs to be registered and made it compatible with RLlib.

[3]:
def _satellite_tasking_env_creator(env_config):
    """
    Create an environment compatible with RLlib.
    """

    if "episode_data_callback" in env_config:
        episode_data_callback = env_config.pop("episode_data_callback")
    else:
        episode_data_callback = None
    if "satellite_data_callback" in env_config:
        satellite_data_callback = env_config.pop("satellite_data_callback")
    else:
        satellite_data_callback = None

    return EpisodeDataWrapper(
        SatelliteTaskingCL(**env_config),
        episode_data_callback=episode_data_callback,
        satellite_data_callback=satellite_data_callback,
    )


register_env("SatelliteTaskingCL-RLlib", _satellite_tasking_env_creator)

Creating the Scanning Satellite

A nadir scanning satellite is created with personalized observation space properties, including the angle between the solar panels and the sun, and the angle between the instrument and nadir. A custom dynamics module is introduced to combine “GroundStationDynModel” and “ContinuousImagingDynModel”, allowing for scanning and downlink actions.

[4]:
def attitude_error_norm(sat) -> float:
    # Calculate this using the instrument unit vector and the spacecraft position
    # in inertial frame (-r_BN_P) and c_hat_P (get angle between then)
    r_BN_P_unit = sat.dynamics.r_BN_P / np.linalg.norm(sat.dynamics.r_BN_P)
    c_hat_P = sat.dynamics.satellite.fsw.c_hat_P  # Instrument unit vector in ECEF frame
    error_angle = np.arccos(np.dot(-r_BN_P_unit, c_hat_P))

    return error_angle / np.pi


def solar_angle_norm(sat) -> float:
    a = (
        sat.dynamics.world.gravFactory.spiceObject.planetStateOutMsgs[
            sat.dynamics.world.sun_index
        ]
        .read()
        .PositionVector
    )
    a_hat = a / np.linalg.norm(a)
    b = np.array([0, 0, -1])  # Solar panel opposite to instrument
    mat = np.transpose(sat.dynamics.BN)
    b_N = np.matmul(mat, b)
    error_angle = np.arccos(np.dot(b_N, a_hat))

    return error_angle / np.pi


class CustomDynamics(dyn.GroundStationDynModel, dyn.ContinuousImagingDynModel):
    pass


class ScanningSatellite(sats.AccessSatellite):
    observation_spec = [
        obs.SatProperties(
            dict(prop="wheel_speeds_fraction"),
            dict(prop="battery_charge_fraction"),
            dict(prop="storage_level_fraction"),
            dict(prop="attitude_error_norm", fn=attitude_error_norm),
            dict(prop="solar_angle_norm", fn=solar_angle_norm),
        ),
        obs.Eclipse(norm=5700.0),
        obs.OpportunityProperties(
            dict(prop="opportunity_open", norm=5700.0),
            dict(prop="opportunity_close", norm=5700.0),
            type="ground_station",
            n_ahead_observe=1,
        ),
    ]
    action_spec = [
        act.Scan(duration=180.0),  # Scan for 3 minute
        act.Charge(duration=180.0),  # Charge for 3 minutes
        act.Downlink(duration=180.0),  # Downlink for 3 minute
        act.Desat(duration=180.0),  # Desaturate for 3 minute
    ]
    dyn_type = CustomDynamics
    fsw_type = fsw.ContinuousImagingFSWModel

Defining Curriculum Function

The following functions are used to define how the satellite parameters vary as a function of the difficulty during training. For these cases, the difficulty is assumed to be between 0 and 1. Direct, inverse, and constant curriculums can be defined based on the initial and final levels.

[5]:
def capacity_fn(time_seed, init_val, final_val, difficulty):
    """
    Function to calculate the capacity of the a given satellite property (e.g. battery, storage, etc) based on the difficulty level.

    Args:
        time_seed (float, optional): Seed for random number generation. If None, CL will be used. Otherwise, DR will be used.
        init_val (float): Initial value of the capacity.
        final_val (float): Final value of the capacity.
        difficulty (float): Difficulty level.

    Returns:
        float: Capacity of the satellite.
    """

    if time_seed is not None:
        random_generator = np.random.default_rng(
            seed=int(time_seed * 100) * int(difficulty * 10000)
        )
        return random_generator.uniform(init_val, final_val)
    else:
        return init_val - (init_val - final_val) * difficulty


def capacity_init_fn(time_seed, init_val, final_val, difficulty, max_init, min_init):
    """
    Function to calculate the initial capacity of the a given satellite property (e.g. battery, storage, etc) based on the difficulty level.
    This function is necessary since the capacity is not constant and can change based on the difficulty level.

    Args:
        time_seed (float, optional): Seed for random number generation. If None, CL will be used. Otherwise, DR will be used.
        init_val (float): Initial value of the capacity.
        final_val (float): Final value of the capacity.
        difficulty (float): Difficulty level.
        max_init (float): Maximum initial value of the capacity.
        min_init (float): Minimum initial value of the capacity.
    Returns:
        float: Initial level of the given satellite property.
    """

    if time_seed is not None:
        random_generator = np.random.default_rng(
            seed=int(time_seed * 100) * int(difficulty * 10000)
        )
        capacity = random_generator.uniform(init_val, final_val)
        return np.random.uniform(min_init, max_init) * capacity
    else:
        capacity = init_val - (init_val - final_val) * difficulty
        return np.random.uniform(min_init, max_init) * capacity


def random_disturbance_vector(magnitude_disturbance, seed=None):
    """
    Function to generate a random disturbance vector with a given magnitude.

    Args:
        magnitude_disturbance (float): Magnitude of the disturbance vector.
        seed (int, optional): Seed for random number generation. Defaults to None.
    Returns:
        np.ndarray: Random disturbance vector with the given magnitude.
    """

    disturbance_rand_vector = np.random.normal(size=3)
    disturbance_rand_unit_vector = disturbance_rand_vector / np.linalg.norm(
        disturbance_rand_vector
    )
    disturbance_vector = disturbance_rand_unit_vector * magnitude_disturbance
    return disturbance_vector


def external_disturbance_fn(time_seed, init_val, final_val, difficulty):
    """
    Function to calculate the external disturbance vector based on the difficulty level.

    Args:
        time_seed (float, optional): Seed for random number generation. If None, CL will be used. Otherwise, DR will be used.
        init_val (float): Initial value of the disturbance vector.
        final_val (float): Final value of the disturbance vector.
        difficulty (float): Difficulty level.
    Returns:
        np.ndarray: External disturbance vector.
    """

    if time_seed is not None:
        random_generator = np.random.default_rng(
            seed=int(time_seed * 100) * int(difficulty * 10000)
        )
        disturbance_mag = random_generator.uniform(init_val, final_val)
        return random_disturbance_vector(disturbance_mag)
    else:
        disturbance_mag = init_val - (init_val - final_val) * difficulty
        return random_disturbance_vector(disturbance_mag)

Custom Callback to Enable CL

A custom Callback function is required to enable CL. The CLCallbacks reads the number of trained steps from the environment and determines the task (difficulty). Here, different functions could be used to implement more complex curriculums instead of a linear function, such as spring mass dynamics.

A custom episode_data_callback is also defined to collect information about the agent and the curriculum during training.

[6]:
class CLCallbacks(WrappedEpisodeDataCallbacks):

    def on_episode_start(
        self,
        *,
        episode,
        worker=None,
        env_runner=None,
        metrics_logger=None,
        base_env=None,
        env=None,
        policies=None,
        rl_module=None,
        env_index,
        **kwargs,
    ) -> None:

        try:
            n_steps = metrics_logger.peek("num_env_steps_sampled_lifetime")
            if n_steps is None:
                task = 0.0
            else:
                task = n_steps / 5_000_000  # 5M steps = 1.0 difficulty
        except KeyError:
            task = 0.0

        env.envs[env_index].unwrapped.set_task(task)


def episode_data_callback(env):
    reward = env.rewarder.cum_reward
    reward = sum(reward.values()) / len(reward)
    orbits = env.simulator.sim_time / (95 * 60)

    data_log = dict(
        reward=reward,
        # Are satellites dying, and how and when?
        alive=float(env.satellites[0].is_alive()),
        rw_status_valid=float(env.satellites[0].dynamics.rw_speeds_valid()),
        battery_status_valid=float(env.satellites[0].dynamics.battery_valid()),
        orbits_complete=orbits,
        # Is CL working? How is it varying during training?
        difficulty=env.get_task(),
        battery_capacity=env.satellites[0].dynamics.powerMonitor.storageCapacity,
        external_torque=np.linalg.norm(
            env.satellites[0].dynamics.extForceTorqueObject.extTorquePntB_B
        ),
    )
    if orbits > 0:
        data_log["reward_per_orbit"] = reward / orbits
    if not env.satellites[0].is_alive():
        data_log["orbits_complete_partial_only"] = orbits

    return data_log

Defining Satellite, Environment, and CL Options

Two different environment configurations are defined, the standard_90 and degraded_90, which can be used for training and testing. Additionally, different initialization ranges can be defined for the parameters during reset. Here, nominal corresponds to parameters being initialized in a range near their nominal operation values. In wide, parameters can vary from 0% to 100%.

Different CL and DR levels are also defined to be chosen from. Each case can include several different parameters from the spacecraft, each with different CL levels.

[7]:
sat_config = dict(
    standard_90=dict(
        # Nominal env parameters
        intervals=90,
        batteryStorageCapacity=400 * 3600,  # in Ws
        disturbance_vector_mag=0.0002,
        panelEfficiency=0.2,
    ),
    degraded_90=dict(
        # Degraded env parameters
        intervals=90,
        batteryStorageCapacity=400 * 3600 * 0.5,  # in Ws
        disturbance_vector_mag=0.0002 * 3,
        panelEfficiency=0.2 * 0.75,
    ),
    # Other sat parameters common to all
    sat_params=dict(
        imageAttErrorRequirement=0.1,  # norm of MRP ~ 20 degree
        imageRateErrorRequirement=0.1,  # norm of angular velocity (rad/s)
        dataStorageCapacity=5000 * 8e6,  # in bits
        instrumentPowerDraw=-30.0,  # in Watts
        instrumentBaudRate=0.5e6,  # bits per second
        transmitterPowerDraw=-25.0,  # in Watts
        transmitterBaudRate=-112.0e6,  # bits per second #size it to downlink in one downlink opportunity
        rwMechToElecEfficiency=0.0,
        rwElecToMechEfficiency=0.5,
        thrusterPowerDraw=-80.0,
        rwBasePower=10.0,
        maxWheelSpeed=6000,  # RPM
        desatAttitude="nadir",
        K=3.5,  # Derivative control gain (attitude)
        Ki=-1,  # Integral gain (turned off)
        P=17.5,  # Proportional gain (attitude))
    ),
)

init_range_options = dict(
    nominal=dict(
        battery_init_range=[0.375, 0.625],
        data_storage_init_range=[0, 1],
        reaction_wheel_init_range=[-4000, 4000],  # RPM
    ),
    wide=dict(
        battery_init_range=[0, 1],
        data_storage_init_range=[0, 1],
        reaction_wheel_init_range=[-6000, 6000],  # RPM
    ),
)

CL_options = dict(
    constant_BT_high=dict(
        battery={
            "name": "batteryStorageCapacity",
            "init_val": 0.40,
            "final_val": 0.40,
            "init_range_config": "battery_init_range",
            "name_init": "storedCharge_Init",
            "domain_randomization": False,
        },
        torque={
            "name": "disturbance_vector_mag",
            "var_name": "disturbance_vector",
            "init_val": 8.0,
            "final_val": 8.0,
            "domain_randomization": False,
        },
    ),
    direct_BT_high=dict(
        battery={
            "name": "batteryStorageCapacity",
            "init_val": 1.00,
            "final_val": 0.40,
            "init_range_config": "battery_init_range",
            "name_init": "storedCharge_Init",
            "domain_randomization": False,
        },
        torque={
            "name": "disturbance_vector_mag",
            "var_name": "disturbance_vector",
            "init_val": 1.0,
            "final_val": 8.0,
            "domain_randomization": False,
        },
    ),
    inverse_BT_high=dict(
        battery={
            "name": "batteryStorageCapacity",
            "init_val": 0.40,
            "final_val": 1.0,
            "init_range_config": "battery_init_range",
            "name_init": "storedCharge_Init",
            "domain_randomization": False,
        },
        torque={
            "name": "disturbance_vector_mag",
            "var_name": "disturbance_vector",
            "init_val": 8.0,
            "final_val": 1.0,
            "domain_randomization": False,
        },
    ),
    DR_BT_high=dict(
        battery={
            "name": "batteryStorageCapacity",
            "init_val": 0.40,
            "final_val": 1.00,
            "init_range_config": "battery_init_range",
            "name_init": "storedCharge_Init",
            "domain_randomization": True,
        },
        torque={
            "name": "disturbance_vector_mag",
            "var_name": "disturbance_vector",
            "init_val": 1.0,
            "final_val": 8.0,
            "domain_randomization": True,
        },
    ),
)

Choosing Curriculum for Training

Here, the direct_BT_high is selected with a nominal initialization range and standard environment with each episode lasting at most 90 steps.

[8]:
CL_params = {}
CL_enabled = True
CL_case = "direct_BT_high"
initialization_range = "nominal"
environment_mode = "standard_90"

sat = ScanningSatellite(
    "Scanner-1",
    sat_args=dict(
        **sat_config["sat_params"],
        batteryStorageCapacity=sat_config[environment_mode]["batteryStorageCapacity"],
        disturbance_vector=lambda: random_disturbance_vector(
            sat_config[environment_mode]["disturbance_vector_mag"]
        ),
        panelEfficiency=sat_config[environment_mode]["panelEfficiency"],
    ),
)

duration = (
    sat_config[environment_mode]["intervals"] * 180
)  # intervals of 180 seconds (3 minutes)

Assigning Curriculum Functions

After selecting the curriculum, the code below will populate the CL_params dictionary with functions, specifying how each of the parameters will vary during training.

[9]:
if CL_enabled:
    for key in CL_options[CL_case].keys():
        current_time = time.time()

        if CL_options[CL_case][key]["domain_randomization"] is False:
            current_time = None
        else:
            current_time = time.time()

        if key == "torque":
            capacity = sat_config[environment_mode][CL_options[CL_case][key]["name"]]
            init_val = CL_options[CL_case][key]["init_val"]
            final_val = CL_options[CL_case][key]["final_val"]
            CL_params[CL_options[CL_case][key]["var_name"]] = (
                lambda difficulty, capacity=capacity, init_val=init_val, final_val=final_val, time_seed=current_time: external_disturbance_fn(
                    time_seed,
                    capacity * init_val,
                    capacity * final_val,
                    difficulty,
                )
            )

        else:
            capacity = sat_config[environment_mode][CL_options[CL_case][key]["name"]]
            init_val = CL_options[CL_case][key]["init_val"]
            final_val = CL_options[CL_case][key]["final_val"]
            if "var_name" in CL_options[CL_case][key].keys():
                temp_name = CL_options[CL_case][key]["var_name"]
            else:
                temp_name = CL_options[CL_case][key]["name"]
            CL_params[temp_name] = (
                lambda difficulty, capacity=capacity, init_val=init_val, final_val=final_val, time_seed=current_time: capacity_fn(
                    time_seed,
                    capacity * init_val,
                    capacity * final_val,
                    difficulty,
                )
            )
            if "name_init" in CL_options[CL_case][key]:
                init_range = init_range_options[initialization_range][
                    CL_options[CL_case][key]["init_range_config"]
                ]
                init_val = CL_options[CL_case][key]["init_val"]
                final_val = CL_options[CL_case][key]["final_val"]
                CL_params[CL_options[CL_case][key]["name_init"]] = (
                    lambda difficulty, capacity=capacity, init_val=init_val, final_val=final_val, init_range=init_range, time_seed=current_time: capacity_init_fn(
                        time_seed,
                        capacity * init_val,
                        capacity * final_val,
                        difficulty,
                        init_range[1],
                        init_range[0],
                    )
                )

Training

Training is performed using ray tune. Usually, the num_env_steps_sampled_lifetime should be set similar to the number of training steps in CLCallbacks. Originally, the paper Improving Robustenss of Autonomous Spacecraft Scheduling Using Curriculum Learning used the APPO algorithm with generalized advantage estimation instead of PPO.

[10]:
N_CPUS = 3

env_args = dict(
    satellite=sat,
    scenario=scene.UniformNadirScanning(value_per_second=1 / duration),
    rewarder=data.ScanningTimeReward(),
    world_type=world.GroundStationWorldModel,
    time_limit=duration,
    failure_penalty=-1.0,
    difficulty=0.0,
    CL_params=CL_params,
)

training_args = dict(
    lr=0.00003,
    gamma=0.999,
    train_batch_size=250,  # originally 10,000
    num_sgd_iter=50,
    model=dict(fcnet_hiddens=[512, 512], vf_share_layers=False),
    lambda_=0.95,
    use_kl_loss=False,
    entropy_coeff=0.0,
    clip_param=0.2,
    grad_clip=0.5,
)

config = (
    PPOConfig()
    .training(**training_args)
    .env_runners(num_env_runners=N_CPUS - 1, sample_timeout_s=1000.0)
    .environment(
        env="SatelliteTaskingCL-RLlib",
        env_config=dict(**env_args, episode_data_callback=episode_data_callback),
    )
    .reporting(
        metrics_num_episodes_for_smoothing=1,
        metrics_episode_collection_timeout_s=180,
    )
    .checkpointing(export_native_model_files=True)
    .framework(framework="torch")
    .api_stack(
        enable_rl_module_and_learner=True,
        enable_env_runner_and_connector_v2=True,
    )
    .callbacks(CLCallbacks)
    # .evaluation(evaluation_interval=10, evaluation_duration=1, evaluation_parallel_to_training=True, evaluation_config={"env": unpack_config(env_class), "env_config": nominal_env_args, "explore":False}, evaluation_num_workers=1, always_attach_evaluation_results=True) #An evaluation environment can be configured with parameters different from the training environment by specifying the `nominal_env_args` argument. This is useful for evaluating the performance of the agent in a different environment than the one it was trained in.
)

ray.init(
    ignore_reinit_error=True,
    num_cpus=N_CPUS,
    object_store_memory=2_000_000_000,  # 2 GB
)

# Run the training
results = tune.run(
    "PPO",
    config=config.to_dict(),
    stop={
        "num_env_steps_sampled_lifetime": 750
    },  # Total number of steps to train the model. Originally 5M
    checkpoint_freq=10,
    checkpoint_at_end=True,
)

# Shutdown Ray
ray.shutdown()
2025-05-09 15:45:12,845 INFO worker.py:1783 -- Started a local Ray instance.
2025-05-09 15:45:13,658 INFO tune.py:616 -- [output] This uses the legacy output and progress reporter, as Jupyter notebooks are not supported by the new engine, yet. For more information, please see https://github.com/ray-project/ray/issues/36949
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/gymnasium/spaces/box.py:130: UserWarning: WARN: Box bound precision lowered by casting to float32
  gym.logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/gymnasium/utils/passive_env_checker.py:164: UserWarning: WARN: The obs returned by the `reset()` method was expecting numpy array dtype to be float32, actual type: float64
  logger.warn(
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/gymnasium/utils/passive_env_checker.py:188: UserWarning: WARN: The obs returned by the `reset()` method is not within the observation space.
  logger.warn(f"{pre} is not within the observation space.")

Tune Status

Current time:2025-05-09 15:45:37
Running for: 00:00:24.24
Memory: 4.2/15.6 GiB

System Info

Using FIFO scheduling algorithm.
Logical resource usage: 3.0/3 CPUs, 0/0 GPUs

Trial Status

Trial name status loc iter total time (s) num_env_steps_sample d_lifetime num_episodes_lifetim e num_env_steps_traine d_lifetime
PPO_SatelliteTaskingCL-RLlib_98d98_00000TERMINATED10.1.0.43:4291 3 13.53277508750
(PPO pid=4291) Install gputil for GPU system monitoring.

Trial Progress

Trial name env_runners fault_tolerance learners num_agent_steps_sampled_lifetime num_env_steps_sampled_lifetime num_env_steps_trained_lifetime num_episodes_lifetimeperf timers
PPO_SatelliteTaskingCL-RLlib_98d98_00000{'orbits_complete': np.float64(2.842105263157895), 'num_env_steps_sampled': 250, 'num_agent_steps_sampled': {'default_agent': 250}, 'module_episode_returns_mean': {'default_policy': 0.3127777777777777}, 'episode_return_mean': 0.3127777777777777, 'battery_capacity': np.float64(1439999.568), 'num_episodes': 4, 'num_env_steps_sampled_lifetime': 2250, 'sample': np.float64(4.053987212547473), 'rw_status_valid': np.float64(1.0), 'num_module_steps_sampled': {'default_policy': 250}, 'num_agent_steps_sampled_lifetime': {'default_agent': 1500}, 'num_module_steps_sampled_lifetime': {'default_policy': 1500}, 'agent_episode_returns_mean': {'default_agent': 0.3127777777777777}, 'alive': np.float64(1.0), 'episode_len_max': 90, 'episode_duration_sec_mean': 2.7957597205000013, 'episode_return_min': 0.29518518518518516, 'difficulty': np.float64(5.05e-05), 'battery_status_valid': np.float64(1.0), 'episode_len_mean': 90.0, 'episode_return_max': 0.3303703703703703, 'reward_per_orbit': np.float64(0.10497695473251026), 'episode_len_min': 90, 'external_torque': np.float64(0.00020000070000000003), 'reward': np.float64(0.29835555555555554), 'time_between_sampling': np.float64(0.576623813287938)}{'num_healthy_workers': 2, 'num_in_flight_async_reqs': 0, 'num_remote_worker_restarts': 0}{'default_policy': {'mean_kl_loss': 0.0, 'vf_loss_unclipped': 0.0006976892473176122, 'total_loss': 0.0024407142773270607, 'num_module_steps_trained': 250, 'vf_explained_var': 0.01143866777420044, 'gradients_default_optimizer_global_norm': 0.08669506758451462, 'num_trainable_parameters': 139013.0, 'vf_loss': 0.0006976892473176122, 'curr_entropy_coeff': 0.0, 'num_non_trainable_parameters': 0.0, 'policy_loss': 0.0017430232837796211, 'default_optimizer_learning_rate': 3e-05, 'entropy': 1.2671451568603516}, '__all_modules__': {'total_loss': 0.0024407142773270607, 'num_module_steps_trained': 250, 'num_trainable_parameters': 139013.0, 'num_env_steps_trained': 250, 'num_non_trainable_parameters': 0.0}}{'default_agent': 750} 750 750 8{'cpu_util_percent': np.float64(47.357142857142854), 'ram_util_percent': np.float64(26.800000000000004)}{'env_runner_sampling_timer': 4.080110184687766, 'learner_update_timer': 0.5082685389585274, 'synch_weights': 0.006149941521093824, 'synch_env_connectors': 0.0062135315313983485}
(PPO pid=4291) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/runner/ray_results/PPO_2025-05-09_15-45-13/PPO_SatelliteTaskingCL-RLlib_98d98_00000_0_2025-05-09_15-45-13/checkpoint_000000)
2025-05-09 15:45:37,931 INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/home/runner/ray_results/PPO_2025-05-09_15-45-13' in 0.0321s.
2025-05-09 15:45:38,306 INFO tune.py:1041 -- Total run time: 24.65 seconds (24.21 seconds for the tuning loop).

Checking Difficulty Over Training

After a few training steps, the difficulty started to increase

[11]:
results.results[list(results.results.keys())[0]]["env_runners"]["difficulty"]
[11]:
np.float64(5.05e-05)