Computing Return-to-go

Let us load previous OAR dataframe:

>>> import pandas as pd
>>> df = pd.read_pickle("docs/source/golf_oar.pkl")

Return-to-go (rtg) are the discounted sum of reward from the current date until episode’s end. This is an important signal since it is what actions should maximize.

However, computing rtg for truncated episodes (as it is our case here) is a bit tricky since episodes never actually reach their end. By default our method of rtg computation assume such episodes are infinite and properly scale them so that rtg at the beginning of the episode and rtg near the truncation are comparable.

>>> from sara.oar import enrich_rtg
>>> df = enrich_rtg(df, discount=0.75)
>>> df.head()
signal                            obs0      act0      rew1                rtg0
key                           position      move  distance cumulative distance
episodes date
0        2000-01-01 10:00:00  0.176277  0.030472 -0.176277           -0.631686
         2000-01-01 10:01:00  0.206749 -0.103998 -0.206749           -0.605225
         2000-01-01 10:02:00  0.102750  0.075045 -0.102750           -0.523078
         2000-01-01 10:03:00  0.177795  0.094056 -0.177795           -0.566191
         2000-01-01 10:04:00  0.271852 -0.195104 -0.271852           -0.507397

As you can see, enrich_rtg simply add a new “rtg0, cumulative distance” column to the dataframe.

Because it is not easy to select discount, we provided a simple helper function that return the discount based on an horizon such that the rtg tail is relatively negligible beyond this horizon.

>>> from sara.oar import discount_from_horizon
>>> discount_from_horizon(10)
0.7411344491069477

Congrats, you have a dataframe enriched with rtg, let’s save it for next guide.

>>> df.to_pickle("docs/source/golf_enriched.pkl")