The normal agents are rewarded based on the least distance of any of the agents to the landmark, and penalized based on the distance between the adversary and the target landmark. Under this reward structure, the agents cooperate to spread out across all the landmarks, so as to deceive the adversary. The framework of centralized trading with decentralized execution has been adopted in this paper. This implies that some extra information is used to ease dreaming, but that information is not used during the testing time. This framework can be naturally implemented using an actor-critic algorithm. Let me explain why. During training, the pretext for each agent uses extra information like state’s observed and actions taken by all the other regions. As for the actor, you’ll notice that there is one for each agent. Each actor has access to only its agent’s observation and actions. During execution time, only the actors are present, and hence, on observations and actions are used. Learning critic for each agent allows us to use a different reward structure for each. Hence, the algorithm can be used in all, cooperative, competitive, and mixed scenarios.