Reinforcement Learning 

On a basic level, Reinforcement Learning involves the iterative interplay between an Agent and an Environment:

  • Learning Controller - coordinates execution of the learning processes

  • Environment - the set of elements being trained

  • Agent - the set of elements causing Environment changes

  • Actions - processes to alter the Environment

  • States - numeric quantifiers of Environment aspects

  • Rewards - elements that cause Environment changes

A More Detailed Architecture

The diagram below shows the Reinforcement Learning architecture at a more detailed level.

Key elements include:

  • Learning Controller - coordinates execution of the learning processes

  • Episode - the processing of one set of timesteps

  • Episode Control - iterates through episodes

  • Timestep - the processing of one cycle of Agent-Environment interaction

  • Timestep Control - iterates through timesteps

  • Environment - the set of elements being trained

  • Agent - the set of elements causing Environment changes

  • Episode Timesteps - tracking of episode timesteps

  • Timestep Terminate - control of termination of a set of timesteps

  • Actions - processes to alter the Environment

  • States - numeric quantifiers of Environment aspects

  • Rewards - elements that cause Environment changes

  • Observer - monitor and modifier of States and Rewards

  • Policy - defines Agent Actions for specific States

  • Algorithm - methodology for updating Policy

  • Update - changes to Policy

In mathematical terms, it looks like this:

Python Example

The example uses the OpenAI Gym CartPole environment which trains against 4 state variables:

  • Cart Position

  • Cart Velocity

  • Pole Angle

  • Pole Velocity at Tip

Values of these state variables are shown below the code.

The objective is to maximize the absolute value of the Cart Position distance from the 0 point while preventing the Cart Pole from falling.


To download the code below, click here.

"""
reinforcement_learning_using_tensorforce.py
trains a model using reinforcement learning
"""

# Import needed libraries.
# TensorFlow and Tensorforce must be installed.
from tensorforce import Agent, Environment

# Define parameters.
environment_type = 'gym'
environment_level = 'CartPole'
environment_max_episode_timesteps = 40
agent_name = 'tensorforce'
agent_memory = 10000
agent_update_unit = 'timesteps'
agent_update_batch_size = 64
agent_optimizer_type = 'adam'
agent_optimizer_learning_rate = 3e-4
agent_policy_network = 'auto'
agent_objective = 'policy_gradient'
agent_reward_estimation_horizon = 20
number_of_episodes = 10

# Instantiate an environment to train.
environment = Environment.create(
    environment=environment_type,
    level=environment_level,
    max_episode_timesteps=environment_max_episode_timesteps
)

# Instantiate an agent for training.
agent = Agent.create(
    agent=agent_name,
    environment=environment,  # alternatively: states, actions, (max_episode_timesteps)
    memory=agent_memory,
    update=dict(unit=agent_update_unit, batch_size=agent_update_batch_size),
    optimizer=dict(type=agent_optimizer_type, learning_rate=agent_optimizer_learning_rate),
    policy=dict(network=agent_policy_network),
    objective=agent_objective,
    reward_estimation=dict(horizon=agent_reward_estimation_horizon)
)

# Train the environment using the agent timesteps within episodes.
best_cart_distance = 0
for episode in range(number_of_episodes):

    # Print the episode number.
    print('states within episode: ' + str(episode))

    # Initialize the environment episode.
    states = environment.reset()
    print(states)

    # Process episode timesteps.
    timesteps = 0
    terminate_timesteps = False
    while not terminate_timesteps:

        # Process an individual timestep.
        timesteps += 1
        actions = agent.act(states=states)
        states, terminate_timesteps, reward = environment.execute(actions=actions)
        agent.observe(terminal=terminate_timesteps, reward=reward)
        print(states)

    # Print the episode summary data.
    print('timesteps: ' + str(timesteps))
    cart_distance = abs(states[0])
    best_cart_distance = max(best_cart_distance, cart_distance)
    print('best cart final timesteps distance of all episodes: ' + str(best_cart_distance))
    print(' ')

# Close the environment and agent.
agent.close()
environment.close()
Results are below:

states within episode: 0
[-0.00026061 -0.01103473  0.04581026 -0.00464354]
[-0.0004813  -0.20678271  0.04571739  0.3021339 ]
[-0.00461695 -0.40252542  0.05176006  0.6088774 ]
[-0.01266746 -0.59833129  0.06393761  0.91740354]
[-0.02463409 -0.79425672  0.08228568  1.22947602]
[-0.04051922 -0.99033531  0.1068752   1.54676344]
[-0.06032593 -1.18656599  0.13781047  1.87079154]
[-0.08405725 -1.38289924  0.1752263   2.20288746]
[-0.11171523 -1.57922116  0.21928405  2.54411426]
timesteps: 8
best cart final timesteps distance of all episodes: 0.11171523316788975

states within episode: 1
[-0.0228279   0.02221037  0.0176807  -0.03375597]
[-0.0223837  -0.17316061  0.01700558  0.26445254]
[-0.02584691 -0.3685211   0.02229463  0.56245031]
[-0.03321733 -0.5639487   0.03354363  0.86207293]
[-0.04449631 -0.75951094  0.05078509  1.16511126]
[-0.05968652 -0.95525584  0.07408732  1.47327445]
[-0.07879164 -1.15120113  0.10355281  1.78814786]
[-0.10181566 -1.3473218   0.13931576  2.11114314]
[-0.1287621  -1.54353531  0.18153863  2.44343828]
[-0.15963281 -1.73968452  0.23040739  2.78590681]
timesteps: 9
best cart final timesteps distance of all episodes: 0.1596328054527629

states within episode: 2
[ 0.04953899 -0.03442587  0.02211645 -0.00488439]
[ 0.04885047  0.16037203  0.02201876 -0.29050808]
[ 0.05205792  0.3551732   0.0162086  -0.57616602]
[ 0.05916138  0.55006424  0.00468528 -0.86369906]
[ 0.07016266  0.7451221  -0.0125887  -1.15490516]
[ 0.08506511  0.94040593 -0.0356868  -1.45150867]
[ 0.10387323  1.13594769 -0.06471698 -1.75512426]
[ 0.12659218  1.33174091 -0.09981946 -2.06721278]
[ 0.153227    1.52772705 -0.14116372 -2.38902682]
[ 0.18378154  1.72377931 -0.18894425 -2.72154443]
[ 0.21825712  1.91968406 -0.24337514 -3.06539148]
timesteps: 10
best cart final timesteps distance of all episodes: 0.21825712417700763

states within episode: 3
[-0.03586108 -0.01346205 -0.02185208  0.01710818]
[-0.03613032 -0.20826391 -0.02150992  0.30281721]
[-0.0402956  -0.4030728  -0.01545357  0.58863952]
[-0.04835705 -0.59797498 -0.00368078  0.87641471]
[-0.06031655 -0.79304671  0.01384751  1.16793817]
[-0.07617749 -0.98834606  0.03720628  1.46493015]
[-0.09594441 -1.18390343  0.06650488  1.76899932]
[-0.11962248 -1.37971019  0.10188487  2.08159819]
[-0.14721668 -1.57570491  0.14351683  2.40396805]
[-0.17873078 -1.77175712  0.19159619  2.73707224]
[-0.21416592 -1.96764853  0.24633763  3.08151786]
timesteps: 10
best cart final timesteps distance of all episodes: 0.21825712417700763

states within episode: 4
[ 0.00462236  0.04730914  0.02341448 -0.02013573]
[ 0.00556854 -0.14814064  0.02301176  0.2798418 ]
[ 0.00260573 -0.34358317  0.0286086   0.57969284]
[-0.00426594 -0.5390941   0.04020245  0.88124901]
[-0.01504782 -0.73473842  0.05782743  1.1862947 ]
[-0.02974259 -0.93056064  0.08155333  1.49652884]
[-0.0483538  -1.12657383  0.1114839   1.81352152]
[-0.07088528 -1.32274677  0.14775433  2.13866262]
[-0.09734021 -1.5189889   0.19052759  2.47310036]
[-0.12771999 -1.71513291  0.23998959  2.81766921]
timesteps: 9
best cart final timesteps distance of all episodes: 0.21825712417700763

states within episode: 5
[-0.01108862  0.01700961 -0.0420929  -0.03565462]
[-0.01074843  0.21270912 -0.04280599 -0.34131552]
[-0.00649424  0.40841313 -0.0496323  -0.64718376]
[ 0.00167402  0.60419018 -0.06257598 -0.95507361]
[ 0.01375782  0.80009551 -0.08167745 -1.26674179]
[ 0.02975973  0.99616042 -0.10701229 -1.58384518]
[ 0.04968294  1.1923802  -0.13868919 -1.90789279]
[ 0.07353055  1.38870027 -0.17684705 -2.24018934]
[ 0.10130455  1.58500002 -0.22165083 -2.58176896]
timesteps: 8
best cart final timesteps distance of all episodes: 0.21825712417700763

states within episode: 6
[0.00122252 0.00248242 0.02053107 0.0383462 ]
[ 0.00127217 -0.19292784  0.021298    0.33743553]
[-0.00258638 -0.38834629  0.02804671  0.63675786]
[-0.01035331 -0.58384792  0.04078186  0.93813962]
[-0.02203027 -0.77949529  0.05954466  1.24335321]
[-0.03762017 -0.97532868  0.08441172  1.55407849]
[-0.05712675 -1.17135486  0.11549329  1.8718584 ]
[-0.08055385 -1.36753392  0.15293046  2.19804618]
[-0.10790452 -1.56376384  0.19689138  2.53374224]
[-0.1391798  -1.75986277  0.24756623  2.87972029]
timesteps: 9
best cart final timesteps distance of all episodes: 0.21825712417700763

states within episode: 7
[ 0.00587895 -0.04401325 -0.03611816 -0.00374312]
[ 0.00499869  0.15160758 -0.03619302 -0.30759942]
[ 0.00803084  0.34722605 -0.04234501 -0.61147339]
[ 0.01497536  0.54291347 -0.05457447 -0.9171871 ]
[ 0.02583363  0.73872914 -0.07291822 -1.22651024]
[ 0.04060821  0.93471011 -0.09744842 -1.54111947]
[ 0.05930242  1.13085948 -0.12827081 -1.86255214]
[ 0.0819196   1.32713296 -0.16552185 -2.19215195]
[ 0.10846226  1.52342322 -0.20936489 -2.53100466]
[ 0.13893073  1.71954197 -0.25998499 -2.87986343]
timesteps: 9
best cart final timesteps distance of all episodes: 0.21825712417700763

states within episode: 8
[-0.04839189 -0.01577572  0.04301474  0.02458942]
[-0.0487074  -0.2114873   0.04350652  0.33052768]
[-0.05293715 -0.40720068  0.05011708  0.63660685]
[-0.06108116 -0.60298442  0.06284921  0.94464197]
[-0.07314085 -0.79889413  0.08174205  1.25639184]
[-0.08911873 -0.99496177  0.10686989  1.57351671]
[-0.10901797 -1.19118373  0.13834022  1.89753041]
[-0.13284164 -1.38750688  0.17629083  2.22974412]
[-0.16059178 -1.58381257  0.22088571  2.57120028]
timesteps: 8
best cart final timesteps distance of all episodes: 0.21825712417700763

states within episode: 9
[-0.02113959  0.01149438 -0.02149935  0.03416971]
[-0.0209097  -0.18331277 -0.02081595  0.31999259]
[-0.02457595 -0.37813218 -0.0144161   0.60603895]
[-0.0321386  -0.57304962 -0.00229532  0.89414653]
[-0.04359959 -0.76814037  0.01558761  1.18610705]
[-0.0589624  -0.96346096  0.03930975  1.48363493]
[-0.07823162 -1.1590396   0.06898245  1.78833033]
[-0.10141241 -1.35486458  0.10474905  2.10163396]
[-0.1285097  -1.55087048  0.14678173  2.42477123]
[-0.15952711 -1.7469216   0.19527716  2.75868472]
[-0.19446554 -1.94279316  0.25045085  3.10395523]
timesteps: 10
best cart final timesteps distance of all episodes: 0.21825712417700763

References