Data & Reward
Data collection and reward calculation is given in bsk_rl.data
.
Reward System Components
The reward system has three main components: GlobalReward
, DataStore
,
and Data
.
GlobalReward
acts as a global critic for the environment, rewarding each
agent’s performance. Has full knowledge of the environment and can provide rewards
based on the global state of the environment, even if the agent does not have access
to that information; for example, you may not want to reward an agent for imaging a
target that has already been imaged by another agent, even if the agent does not know
that the target has been imaged. Reward is generally calculated by processing the
dictionary of new Data
per-satellite generated at each step
with the GlobalReward.calculate_reward
method.
The DataStore
handles each satellite’s local knowledge of the
scenario and the data it generates. The data store gains data in three ways:
On environment reset, the
GlobalReward
callsinitial_data
to provide the initial knowledge of the scenario for each satellite. This may be empty or may contain some a priori knowledge, such as a list of targets that are desired to be imaged.At the end of each step, the result of
get_log_state
is compared to the previous step’s result viacompare_log_states
. A unit ofData
is returned. For example, the log state may be the level of each target’s buffer partition in the storage unit, so a change in a certain buffer level leads to a unit of data that indicates the corresponding target has been imaged.At the end of each step, satellites communicate based on the Communication system being used. Satellites merge the contents of their data stores with any other satellite’s data store that they have communicated with.
Finally, Data
can represent data generated by the satellite
towards some goal (e.g. images of targets, time spend in a desireable mode, etc.) as well
as information about the environment that is useful toward completing its mission (e.g.
desired targets to image, what targets have already been imaged, etc.).
Implementing Data & Reward Types
See Base Data for full documentation of the reward system components to when implementing custom data and reward types.
Reward System Types
A variety of reward systems are available for use in the environment. The following table provides a summary of the available reward systems:
Type |
Purpose |
Compatibility |
Returns zero reward for every agent at every step. |
||
Returns reward corresponding to target priority the first time a target is imaged by any agent. Causes satellites to filter targets that are known to have been imaged already. |
Should be used with |
|
Returns reward based on time spend in the nadir-pointing scanning mode. |
Should be used with the |
To select a reward system to use, pass an instance of GlobalReward
to the data
field of the environment constructor:
env = ConstellationTasking(
...,
data=ScanningTimeReward(),
...
)
- class GlobalReward[source]
Bases:
ABC
,Resetable
Base class for simulation-wide data management and rewarding.
The method
calculate_reward
must be overridden by subclasses. Other methods may be extended as necessary for housekeeping.- link_scenario(scenario: Scenario) None [source]
Link the data manager to the scenario.
- Parameters:
scenario (Scenario) – The scenario that the data manager is being used with.
- Return type:
None
- reset_overwrite_previous() None [source]
Overwrite attributes from previous episode.
- Return type:
None
- create_data_store(satellite: Satellite, **data_store_kwargs) None [source]
Create a data store for a satellite.
- Parameters:
satellite (Satellite) – Satellite to create a data store for.
data_store_kwargs – Additional keyword arguments to pass to the data store
- Return type:
None
- abstract calculate_reward(new_data_dict: dict[str, Data]) dict[str, float] [source]
Calculate step reward based on all satellite data from a step.
Returns a dictionary of rewards for each satellite based on the new data generated by each satellite during the previous step, in the form:
{"sat-1_id": 0.23, "sat-2_id": 0.0, ...}
- Parameters:
new_data_dict (dict[str, Data]) –
A dictionary of new data generated by each satellite, in the form:
{"sat-1_id": data1, "sat-2_id": data2, ...}
- Return type:
dict[str, float]
- reward(new_data_dict: dict[str, Data]) dict[str, float] [source]
Call
calculate_reward
and log cumulative reward.- Parameters:
new_data_dict (dict[str, Data])
- Return type:
dict[str, float]
- class NoReward(*args, **kwargs)[source]
Bases:
GlobalReward
Returns zero reward at every step.
This reward system is useful for debugging environments, but is not useful for training, since reward is always zero for every satellite.
- datastore_type
alias of
NoDataStore
- class UniqueImageReward(reward_fn: ~typing.Callable = <function UniqueImageReward.<lambda>>)[source]
Bases:
GlobalReward
GlobalReward for rewarding unique images.
This data system should be used with the
ImagingSatellite
and a scenario that generates targets, such asUniformTargets
orCityTargets
.The satellites all start with complete knowledge of the targets in the scenario. Each target can only give one satellite a reward once; if any satellite has imaged a target, reward will never again be given for that target. The satellites filter known imaged targets from consideration for imaging to prevent duplicates. Communication can transmit information about what targets have been imaged in order to prevent reimaging.
- Parameters:
scenario – GlobalReward.scenario
reward_fn (Callable) – Reward as function of priority.
- datastore_type
alias of
UniqueImageStore
- initial_data(satellite: Satellite) UniqueImageData [source]
Furnish data to the scenario.
Currently, it is assumed that all targets are known a priori, so the initial data given to the data store is the list of all targets.
- Parameters:
satellite (Satellite)
- Return type:
- create_data_store(satellite: Satellite) None [source]
Override the access filter in addition to creating the data store.
- Parameters:
satellite (Satellite)
- Return type:
None
- calculate_reward(new_data_dict: dict[str, UniqueImageData]) dict[str, float] [source]
Reward each new unique image once.
Reward is evaluated based on
self.reward_fn(target.priority)
.- Parameters:
new_data_dict (dict[str, UniqueImageData]) – Record of new images for each satellite
- Returns:
Cumulative reward across satellites for one step
- Return type:
reward
- class ScanningTimeReward(reward_fn: Callable | None = None)[source]
Bases:
GlobalReward
GlobalReward for rewarding time spent scanning nadir.
This class should be used with the
UniformNadirScanning
scenario and a satellite withContinuousImagingFSWModel
and theScan
action.Time is computed based on the amount of data in the satellite’s buffer. In the basic configuration, this is the amount of time that the
Scan
action is enabled and pointing thresholds are met. However, if other models are used to prevent the accumulation of data, the satellite will not be rewarded for those times.- Parameters:
reward_fn (Callable | None) – Reward as function of time spend pointing nadir. By default, is set to the time spent scanning times
scenario.value_per_second
.
- datastore_type
alias of
ScanningTimeStore
- calculate_reward(new_data_dict: dict[str, ScanningTime]) dict[str, float] [source]
Calculate reward based on
reward_fn
.- Parameters:
new_data_dict (dict[str, ScanningTime])
- Return type:
dict[str, float]