Data & Reward

Data collection and reward calculation is given in bsk_rl.data.

Reward System Components

The reward system has three main components: GlobalReward, DataStore, and Data.

GlobalReward acts as a global critic for the environment, rewarding each agent’s performance. Has full knowledge of the environment and can provide rewards based on the global state of the environment, even if the agent does not have access to that information; for example, you may not want to reward an agent for imaging a target that has already been imaged by another agent, even if the agent does not know that the target has been imaged. Reward is generally calculated by processing the dictionary of new Data per-satellite generated at each step with the GlobalReward.calculate_reward method.

The DataStore handles each satellite’s local knowledge of the scenario and the data it generates. The data store gains data in three ways:

On environment reset, the GlobalReward calls initial_data to provide the initial knowledge of the scenario for each satellite. This may be empty or may contain some a priori knowledge, such as a list of targets that are desired to be imaged.
At the end of each step, the result of get_log_state is compared to the previous step’s result via compare_log_states. A unit of Data is returned. For example, the log state may be the level of each target’s buffer partition in the storage unit, so a change in a certain buffer level leads to a unit of data that indicates the corresponding target has been imaged.
At the end of each step, satellites communicate based on the Communication system being used. Satellites merge the contents of their data stores with any other satellite’s data store that they have communicated with.

Finally, Data can represent data generated by the satellite towards some goal (e.g. images of targets, time spend in a desireable mode, etc.) as well as information about the environment that is useful toward completing its mission (e.g. desired targets to image, what targets have already been imaged, etc.).

Implementing Data & Reward Types

See Base Data for full documentation of the reward system components to when implementing custom data and reward types.

Reward System Types

A variety of reward systems are available for use in the environment. The following table provides a summary of the available reward systems:

Type	Purpose	Compatibility
`NoReward`	Returns zero reward for every agent at every step.
`UniqueImageReward`	Returns reward corresponding to target priority the first time a target is imaged by any agent. Causes satellites to filter targets that are known to have been imaged already.	Should be used with `ImagingSatellite` and a `Target`-based scenario.
`ScanningTimeReward`	Returns reward based on time spend in the nadir-pointing scanning mode.	Should be used with the `UniformNadirScanning` scenario.

To select a reward system to use, pass an instance of GlobalReward to the data field of the environment constructor:

env = ConstellationTasking(
    ...,
    data=ScanningTimeReward(),
    ...
)

Multiple reward systems can be added to the environment by instead passing an iterable of reward systems to the data field of the environment constructor:

env = ConstellationTasking(
    ...,
    data=(ScanningTimeReward(), SomeOtherReward()),
    ...
)

On the backend, this creates a ComposedDataStore that handles the combination of multiple reward systems.

class GlobalReward[source]

Bases: ABC, Resetable

Base class for simulation-wide data management and rewarding.

The method calculate_reward must be overridden by subclasses. Other methods may be extended as necessary for housekeeping.

datastore_type: type[DataStore]

scenario: Scenario

link_scenario(scenario: Scenario) → None[source]

Link the data manager to the scenario.

Parameters:: scenario (Scenario) – The scenario that the data manager is being used with.
Return type:: None

reset_overwrite_previous() → None[source]

Overwrite attributes from previous episode.

Return type:: None

initial_data(satellite: Satellite) → Data[source]

Furnish the DataStore with initial data.

Parameters:: satellite (Satellite)
Return type:: Data

create_data_store(satellite: Satellite, **data_store_kwargs) → None[source]

Create a data store for a satellite.

Parameters:

satellite (Satellite) – Satellite to create a data store for.
data_store_kwargs – Additional keyword arguments to pass to the data store

Return type:

None

abstract calculate_reward(new_data_dict: dict[str, Data]) → dict[str, float][source]

Calculate step reward based on all satellite data from a step.

Returns a dictionary of rewards for each satellite based on the new data generated by each satellite during the previous step, in the form:

{"sat-1_id": 0.23, "sat-2_id": 0.0, ...}

Parameters:

new_data_dict (dict[str, Data]) –

A dictionary of new data generated by each satellite, in the form:

{"sat-1_id": data1, "sat-2_id": data2, ...}

Return type:

dict[str, float]

reward(new_data_dict: dict[str, Data]) → dict[str, float][source]

Call calculate_reward and log cumulative reward.

Parameters:: new_data_dict (dict[str, Data])
Return type:: dict[str, float]

class NoReward(*args, **kwargs)[source]

Bases: GlobalReward

Returns zero reward at every step.

This reward system is useful for debugging environments, but is not useful for training, since reward is always zero for every satellite.

datastore_type: alias of NoDataStore

calculate_reward(new_data_dict)[source]: Reward nothing.

class UniqueImageReward(reward_fn: ~typing.Callable = <function UniqueImageReward.<lambda>>)[source]

Bases: GlobalReward

GlobalReward for rewarding unique images.

This data system should be used with the ImagingSatellite and a scenario that generates targets, such as UniformTargets or CityTargets.

The satellites all start with complete knowledge of the targets in the scenario. Each target can only give one satellite a reward once; if any satellite has imaged a target, reward will never again be given for that target. The satellites filter known imaged targets from consideration for imaging to prevent duplicates. Communication can transmit information about what targets have been imaged in order to prevent reimaging.

Parameters:

scenario – GlobalReward.scenario
reward_fn (Callable) – Reward as function of priority.

datastore_type: alias of UniqueImageStore

initial_data(satellite: Satellite) → UniqueImageData[source]

Furnish data to the scenario.

Currently, it is assumed that all targets are known a priori, so the initial data given to the data store is the list of all targets.

Parameters:: satellite (Satellite)
Return type:: UniqueImageData

create_data_store(satellite: Satellite) → None[source]

Override the access filter in addition to creating the data store.

Parameters:: satellite (Satellite)
Return type:: None

calculate_reward(new_data_dict: dict[str, UniqueImageData]) → dict[str, float][source]

Reward each new unique image once.

Reward is evaluated based on self.reward_fn(target.priority).

Parameters:: new_data_dict (dict[str, UniqueImageData]) – Record of new images for each satellite
Returns:: Cumulative reward across satellites for one step
Return type:: reward

class ScanningTimeReward(reward_fn: Callable | None = None)[source]

Bases: GlobalReward

GlobalReward for rewarding time spent scanning nadir.

This class should be used with the UniformNadirScanning scenario and a satellite with ContinuousImagingFSWModel and the Scan action.

Time is computed based on the amount of data in the satellite’s buffer. In the basic configuration, this is the amount of time that the Scan action is enabled and pointing thresholds are met. However, if other models are used to prevent the accumulation of data, the satellite will not be rewarded for those times.

Parameters:: reward_fn (Callable | None) – Reward as function of time spend pointing nadir. By default, is set to the time spent scanning times scenario.value_per_second.

datastore_type: alias of ScanningTimeStore

calculate_reward(new_data_dict: dict[str, ScanningTime]) → dict[str, float][source]

Calculate reward based on reward_fn.

Parameters:: new_data_dict (dict[str, ScanningTime])
Return type:: dict[str, float]