Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow overwritting action by environment #56

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

mryellow
Copy link
Contributor

@mryellow mryellow commented Sep 2, 2016

No description provided.

reward, terminal, state = self:takeAction(action)
reward, terminal, state, taken = self:takeAction(action)
if taken and taken ~= action then
self.actions[self.batchIdx] = taken
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do it this way around, looking more like OneStepQAgent.

if taken and taken ~= action then
  action = taken
end
self.actions[self.batchIdx] = action

@Kaixhin
Copy link
Owner

Kaixhin commented Sep 2, 2016

Code looks fine, but what is this trying to achieve? If the chosen action may not be deterministically executed in the environment, the agent should still treat it as if the chosen action were taken, otherwise it will not adapt to the real stochasticity in the application of its action. For an ALE example see Increasing the Action Gap.

@mryellow
Copy link
Contributor Author

mryellow commented Sep 2, 2016

Expert instruction is one application. Can forget whatever the net came up with and learn from experiences which have been overwritten with expert play, or some hand-crafted strategy.

Last time I used it was forcing robot to repeat turning on the spot actions at random intervals when epsilon was below a threshold, to escape corners and get some better experiences again.

@Kaixhin
Copy link
Owner

Kaixhin commented Sep 2, 2016

Got it - can you add a short 2nd paragraph to the custom docs to make people aware of this modification from the rlenvs API, along with a use-case as above?

@mryellow
Copy link
Contributor Author

mryellow commented Sep 2, 2016

Yeah no problem.

Describing it just now, my method has always been a hack of discarding the agents selection after doing the unneeded work. Could pass a variable back toward agent.observe() much like terminal is used at the start of the loop but defined at the end, something like next_action. Environment doesn't overwrite the current action, but can specify the next. Skipping the forward pass if it's defined.

Doesn't really matter for my robot use-case as I'm throwing away CPU cycles to run at real-time (10Hz) rather than fast as possible, guess it matters when comes to battery drain on hardware.

For real-time expert play, wonder if that next_action approach would introduce a little too much lag, with the teleop really needing to apply for the action to the current cycle.

@mryellow
Copy link
Contributor Author

mryellow commented Sep 2, 2016

lag

Not a lot happens before back to start of loop again. next_action might be as good as "do it 'now' given what I see this frame". It could work either way.

@mryellow
Copy link
Contributor Author

mryellow commented Sep 2, 2016

Closing in favour of #57

Avoids doing work only to then discard it.

@mryellow mryellow closed this Sep 2, 2016
@mryellow mryellow reopened this Sep 3, 2016
@mryellow
Copy link
Contributor Author

mryellow commented Sep 3, 2016

Was thinking of adding the extra return to validation agents, allowing environment to pass back any change during validation also. However nothing is done with the action apart from acting.

Could add the signature so the variable is already there if it's needed in future...

So far prioritised reply hasn't crashed with the code this way around, firing off the observe methods must play into setting up the distribution stratum at some point.

@@ -91,7 +91,7 @@ function AsyncAgent:takeAction(action)
self.stateBuffer:push(observation)
end

return reward, terminal, self.stateBuffer:readAll()
return reward, terminal, self.stateBuffer:readAll(), actionTaken
Copy link
Contributor Author

@mryellow mryellow Sep 3, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could add the offset actionOffset before returning so it doesn't need adding to compare actionTaken ~= action.

edit: Done

@mryellow
Copy link
Contributor Author

Think I had some additional documentation changes to do on this. Been distracted working on hardware and getting dev environment fixed up, once that's outta the way will get this cleaned up.

@mryellow
Copy link
Contributor Author

mryellow commented Sep 28, 2016

In the readme can you add a line about how this doesn’t affect validation async/non-async?

Will get to finishing this off soonish.

https://gitter.im/Kaixhin/Atari?at=57cbda0bd52261ec345029ba

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants