Reinforcement Learning

On a basic level, Reinforcement Learning involves the iterative interplay between an Agent and an Environment:

Learning Controller - coordinates execution of the learning processes
Environment - the set of elements being trained
Agent - the set of elements causing Environment changes
Actions - processes to alter the Environment
States - numeric quantifiers of Environment aspects
Rewards - elements that cause Environment changes

A More Detailed Architecture

The diagram below shows the Reinforcement Learning architecture at a more detailed level.

Key elements include:

Learning Controller - coordinates execution of the learning processes
Episode - the processing of one set of timesteps
Episode Control - iterates through episodes
Timestep - the processing of one cycle of Agent-Environment interaction
Timestep Control - iterates through timesteps
Environment - the set of elements being trained
Agent - the set of elements causing Environment changes
Episode Timesteps - tracking of episode timesteps
Timestep Terminate - control of termination of a set of timesteps
Actions - processes to alter the Environment
States - numeric quantifiers of Environment aspects
Rewards - elements that cause Environment changes
Observer - monitor and modifier of States and Rewards
Policy - defines Agent Actions for specific States
Algorithm - methodology for updating Policy
Update - changes to Policy

Python Example

The example uses the OpenAI Gym CartPole environment which trains against 4 state variables:

Cart Position
Cart Velocity
Pole Angle
Pole Velocity at Tip

Values of these state variables are shown below the code.

The objective is to maximize the absolute value of the Cart Position distance from the 0 point while preventing the Cart Pole from falling.

To download the code below, click here.

"""
reinforcement_learning_using_tensorforce.py
trains a model using reinforcement learning
"""

# Import needed libraries.
# TensorFlow and Tensorforce must be installed.
from tensorforce import Agent, Environment

# Define parameters.
environment_type = 'gym'
environment_level = 'CartPole'
environment_max_episode_timesteps = 40
agent_name = 'tensorforce'
agent_memory = 10000
agent_update_unit = 'timesteps'
agent_update_batch_size = 64
agent_optimizer_type = 'adam'
agent_optimizer_learning_rate = 3e-4
agent_policy_network = 'auto'
agent_objective = 'policy_gradient'
agent_reward_estimation_horizon = 20
number_of_episodes = 10

# Instantiate an environment to train.
environment = Environment.create(
    environment=environment_type,
    level=environment_level,
    max_episode_timesteps=environment_max_episode_timesteps
)

# Instantiate an agent for training.
agent = Agent.create(
    agent=agent_name,
    environment=environment,  # alternatively: states, actions, (max_episode_timesteps)
    memory=agent_memory,
    update=dict(unit=agent_update_unit, batch_size=agent_update_batch_size),
    optimizer=dict(type=agent_optimizer_type, learning_rate=agent_optimizer_learning_rate),
    policy=dict(network=agent_policy_network),
    objective=agent_objective,
    reward_estimation=dict(horizon=agent_reward_estimation_horizon)
)

# Train the environment using the agent timesteps within episodes.
best_cart_distance = 0
for episode in range(number_of_episodes):

    # Print the episode number.
    print('states within episode: ' + str(episode))

    # Initialize the environment episode.
    states = environment.reset()
    print(states)

    # Process episode timesteps.
    timesteps = 0
    terminate_timesteps = False
    while not terminate_timesteps:

        # Process an individual timestep.
        timesteps += 1
        actions = agent.act(states=states)
        states, terminate_timesteps, reward = environment.execute(actions=actions)
        agent.observe(terminal=terminate_timesteps, reward=reward)
        print(states)

    # Print the episode summary data.
    print('timesteps: ' + str(timesteps))
    cart_distance = abs(states[0])
    best_cart_distance = max(best_cart_distance, cart_distance)
    print('best cart final timesteps distance of all episodes: ' + str(best_cart_distance))
    print(' ')

# Close the environment and agent.
agent.close()
environment.close()

Results are below:

states within episode: 0
[-0.00026061 -0.01103473  0.04581026 -0.00464354]
[-0.0004813  -0.20678271  0.04571739  0.3021339 ]
[-0.00461695 -0.40252542  0.05176006  0.6088774 ]
[-0.01266746 -0.59833129  0.06393761  0.91740354]
[-0.02463409 -0.79425672  0.08228568  1.22947602]
[-0.04051922 -0.99033531  0.1068752   1.54676344]
[-0.06032593 -1.18656599  0.13781047  1.87079154]
[-0.08405725 -1.38289924  0.1752263   2.20288746]
[-0.11171523 -1.57922116  0.21928405  2.54411426]
timesteps: 8
best cart final timesteps distance of all episodes: 0.11171523316788975

states within episode: 1
[-0.0228279   0.02221037  0.0176807  -0.03375597]
[-0.0223837  -0.17316061  0.01700558  0.26445254]
[-0.02584691 -0.3685211   0.02229463  0.56245031]
[-0.03321733 -0.5639487   0.03354363  0.86207293]
[-0.04449631 -0.75951094  0.05078509  1.16511126]
[-0.05968652 -0.95525584  0.07408732  1.47327445]
[-0.07879164 -1.15120113  0.10355281  1.78814786]
[-0.10181566 -1.3473218   0.13931576  2.11114314]
[-0.1287621  -1.54353531  0.18153863  2.44343828]
[-0.15963281 -1.73968452  0.23040739  2.78590681]
timesteps: 9
best cart final timesteps distance of all episodes: 0.1596328054527629

states within episode: 2
[ 0.04953899 -0.03442587  0.02211645 -0.00488439]
[ 0.04885047  0.16037203  0.02201876 -0.29050808]
[ 0.05205792  0.3551732   0.0162086  -0.57616602]
[ 0.05916138  0.55006424  0.00468528 -0.86369906]
[ 0.07016266  0.7451221  -0.0125887  -1.15490516]
[ 0.08506511  0.94040593 -0.0356868  -1.45150867]
[ 0.10387323  1.13594769 -0.06471698 -1.75512426]
[ 0.12659218  1.33174091 -0.09981946 -2.06721278]
[ 0.153227    1.52772705 -0.14116372 -2.38902682]
[ 0.18378154  1.72377931 -0.18894425 -2.72154443]
[ 0.21825712  1.91968406 -0.24337514 -3.06539148]
timesteps: 10
best cart final timesteps distance of all episodes: 0.21825712417700763

states within episode: 3
[-0.03586108 -0.01346205 -0.02185208  0.01710818]
[-0.03613032 -0.20826391 -0.02150992  0.30281721]
[-0.0402956  -0.4030728  -0.01545357  0.58863952]
[-0.04835705 -0.59797498 -0.00368078  0.87641471]
[-0.06031655 -0.79304671  0.01384751  1.16793817]
[-0.07617749 -0.98834606  0.03720628  1.46493015]
[-0.09594441 -1.18390343  0.06650488  1.76899932]
[-0.11962248 -1.37971019  0.10188487  2.08159819]
[-0.14721668 -1.57570491  0.14351683  2.40396805]
[-0.17873078 -1.77175712  0.19159619  2.73707224]
[-0.21416592 -1.96764853  0.24633763  3.08151786]
timesteps: 10
best cart final timesteps distance of all episodes: 0.21825712417700763

states within episode: 4
[ 0.00462236  0.04730914  0.02341448 -0.02013573]
[ 0.00556854 -0.14814064  0.02301176  0.2798418 ]
[ 0.00260573 -0.34358317  0.0286086   0.57969284]
[-0.00426594 -0.5390941   0.04020245  0.88124901]
[-0.01504782 -0.73473842  0.05782743  1.1862947 ]
[-0.02974259 -0.93056064  0.08155333  1.49652884]
[-0.0483538  -1.12657383  0.1114839   1.81352152]
[-0.07088528 -1.32274677  0.14775433  2.13866262]
[-0.09734021 -1.5189889   0.19052759  2.47310036]
[-0.12771999 -1.71513291  0.23998959  2.81766921]
timesteps: 9
best cart final timesteps distance of all episodes: 0.21825712417700763

states within episode: 5
[-0.01108862  0.01700961 -0.0420929  -0.03565462]
[-0.01074843  0.21270912 -0.04280599 -0.34131552]
[-0.00649424  0.40841313 -0.0496323  -0.64718376]
[ 0.00167402  0.60419018 -0.06257598 -0.95507361]
[ 0.01375782  0.80009551 -0.08167745 -1.26674179]
[ 0.02975973  0.99616042 -0.10701229 -1.58384518]
[ 0.04968294  1.1923802  -0.13868919 -1.90789279]
[ 0.07353055  1.38870027 -0.17684705 -2.24018934]
[ 0.10130455  1.58500002 -0.22165083 -2.58176896]
timesteps: 8
best cart final timesteps distance of all episodes: 0.21825712417700763

states within episode: 6
[0.00122252 0.00248242 0.02053107 0.0383462 ]
[ 0.00127217 -0.19292784  0.021298    0.33743553]
[-0.00258638 -0.38834629  0.02804671  0.63675786]
[-0.01035331 -0.58384792  0.04078186  0.93813962]
[-0.02203027 -0.77949529  0.05954466  1.24335321]
[-0.03762017 -0.97532868  0.08441172  1.55407849]
[-0.05712675 -1.17135486  0.11549329  1.8718584 ]
[-0.08055385 -1.36753392  0.15293046  2.19804618]
[-0.10790452 -1.56376384  0.19689138  2.53374224]
[-0.1391798  -1.75986277  0.24756623  2.87972029]
timesteps: 9
best cart final timesteps distance of all episodes: 0.21825712417700763

states within episode: 7
[ 0.00587895 -0.04401325 -0.03611816 -0.00374312]
[ 0.00499869  0.15160758 -0.03619302 -0.30759942]
[ 0.00803084  0.34722605 -0.04234501 -0.61147339]
[ 0.01497536  0.54291347 -0.05457447 -0.9171871 ]
[ 0.02583363  0.73872914 -0.07291822 -1.22651024]
[ 0.04060821  0.93471011 -0.09744842 -1.54111947]
[ 0.05930242  1.13085948 -0.12827081 -1.86255214]
[ 0.0819196   1.32713296 -0.16552185 -2.19215195]
[ 0.10846226  1.52342322 -0.20936489 -2.53100466]
[ 0.13893073  1.71954197 -0.25998499 -2.87986343]
timesteps: 9
best cart final timesteps distance of all episodes: 0.21825712417700763

states within episode: 8
[-0.04839189 -0.01577572  0.04301474  0.02458942]
[-0.0487074  -0.2114873   0.04350652  0.33052768]
[-0.05293715 -0.40720068  0.05011708  0.63660685]
[-0.06108116 -0.60298442  0.06284921  0.94464197]
[-0.07314085 -0.79889413  0.08174205  1.25639184]
[-0.08911873 -0.99496177  0.10686989  1.57351671]
[-0.10901797 -1.19118373  0.13834022  1.89753041]
[-0.13284164 -1.38750688  0.17629083  2.22974412]
[-0.16059178 -1.58381257  0.22088571  2.57120028]
timesteps: 8
best cart final timesteps distance of all episodes: 0.21825712417700763

states within episode: 9
[-0.02113959  0.01149438 -0.02149935  0.03416971]
[-0.0209097  -0.18331277 -0.02081595  0.31999259]
[-0.02457595 -0.37813218 -0.0144161   0.60603895]
[-0.0321386  -0.57304962 -0.00229532  0.89414653]
[-0.04359959 -0.76814037  0.01558761  1.18610705]
[-0.0589624  -0.96346096  0.03930975  1.48363493]
[-0.07823162 -1.1590396   0.06898245  1.78833033]
[-0.10141241 -1.35486458  0.10474905  2.10163396]
[-0.1285097  -1.55087048  0.14678173  2.42477123]
[-0.15952711 -1.7469216   0.19527716  2.75868472]
[-0.19446554 -1.94279316  0.25045085  3.10395523]
timesteps: 10
best cart final timesteps distance of all episodes: 0.21825712417700763

Reinforcement Learning

A More Detailed Architecture

Python Example

References