Computing Return-to-go¶
Let us load previous OAR dataframe:
>>> import pandas as pd
>>> df = pd.read_pickle("docs/source/golf_oar.pkl")
Return-to-go (rtg) are the discounted sum of reward from the current date until episode’s end. This is an important signal since it is what actions should maximize.
However, computing rtg for truncated episodes (as it is our case here) is a bit tricky since episodes never actually reach their end. By default our method of rtg computation assume such episodes are infinite and properly scale them so that rtg at the beginning of the episode and rtg near the truncation are comparable.
>>> from sara.oar import enrich_rtg
>>> df = enrich_rtg(df, discount=0.75)
>>> df.head()
signal obs0 act0 rew1 rtg0
key position move distance cumulative distance
episodes date
0 2000-01-01 10:00:00 0.176277 0.030472 -0.176277 -0.631686
2000-01-01 10:01:00 0.206749 -0.103998 -0.206749 -0.605225
2000-01-01 10:02:00 0.102750 0.075045 -0.102750 -0.523078
2000-01-01 10:03:00 0.177795 0.094056 -0.177795 -0.566191
2000-01-01 10:04:00 0.271852 -0.195104 -0.271852 -0.507397
As you can see, enrich_rtg simply add a new “rtg0, cumulative distance” column to the dataframe.
Because it is not easy to select discount, we provided a simple helper function that return the discount based on an horizon such that the rtg tail is relatively negligible beyond this horizon.
>>> from sara.oar import discount_from_horizon
>>> discount_from_horizon(10)
0.7411344491069477
Congrats, you have a dataframe enriched with rtg, let’s save it for next guide.
>>> df.to_pickle("docs/source/golf_enriched.pkl")