How About Reinforcement Learning Meets Transformers

Unleashing cutting-edge AI synergies

9 min readAug 15, 2024

Imagine a world where artificial intelligence not only understands your commands but adapts and evolves with each interaction, this is the frontier of reinforcement learning for transformers. In this cutting-edge realm, we are pushing the boundaries of machine learning, transforming static models into dynamic, self-improving systems that learn from their environment in real-time. Dive into how this revolutionary approach is set to redefine the possibilities of AI.

Table of Content

Introduction
System requirements
Reinforcement learning
Important techniques in pytorch for RL
Stable Baseline3
Transformer for reinforcement learning

Introduction

Reinforcement Learning is a subfield of machine learning that

focuses on how an agent can learn to behave in an environment by taking
actions that maximize some notion of cumulative reward. It is fundamentally
about learning to make decisions based on the consequences of previous

actions. Traditionally, reinforcement learning has been intertwined with
various types of algorithms and neural network architectures like
Convolutional Neural Networks and Recurrent Neural
Networks. These approaches have had considerable success in
fields like robotics, game theory and sequential decision making tasks. Recently, transformer architectures have been adapted to reinforcement

learning tasks. One such model is the decision transformer, which frames
reinforcement learning as a problem of ranking trajectories, thus shifting the
focus from traditional action value based methods to more direct methods of
estimating the optimal trajectories. Another emerging model is the trajectory

transformer, which leverages the ability of transformers to understand
sequence data, hence enhancing the efficiency of reinforcement learning
with its power to predict the entire sequence of future states, actions and
rewards.

System Requirements

Setting Up Environment

Install Anaconda on the local machine.
Create a virtual environment.
Install necessary packages in the virtual environment.
Configure and Start Jupyter Notebook.
Connect Google Colab with your local runtime environment.

Installing Anaconda On Local System

Go-to the anaconda download page. https://www.anaconda.com/products/distribution
Download the appropriate version for your computer.
Follow the instructions provided by the installer.
If the installer prompt you to add anaconda in system’s PATH variable please do it. This enables you to seamlessly use Anaconda’s features from the command line.
Check if installation is successful by typing the following command in the terminal.

conda --version

Creating a Virtual Environment

To create a virtual environment in Anaconda via the terminal, follow these steps.

Open the terminal on your local machine.
Type the following command and press Enter to create a new virtual environment. In the below code, the virtual environment name is torch_learn and the python version is 3.11.

conda create --name 
torch_learn 
python=3.11

3. Once the environment has been created, activate it by typing the following command.

conda activate transformer_learn

4. Install the necessary Package in your environment. Following are requirements for section 2. Install based on each section.

pip3 install transformers
pip3 install datasets
pip3 install git+https://github.com/huggingface/diffusers
pip3 install accelerate
pip3 install ftfy
pip3 install tensorboard
pip3 install Jinja2
pip install gym
pip install pandas
pip install yfinance
pip install stable-baseline3
pip install shimmy

Reinforcement Learning

RL is a type of machine learning where an agent learns how to behave in an
environment by performing certain actions and observing the results or
feedback from those actions. Let us illustrate through the example of stock
portfolio management.

Imagine you are managing a stock portfolio. In this situation, you, as the
portfolio manager the agent, interact with the complex world of the stock
market the environment by making choices like buying, selling, or holding
onto stocks. This environment is filled with different types of information:

technical data, fundamental data, recent news and overall market trends.
Based on the state of the environment, if a choice action leads to a good
result like making money from a stock sale or earning a dividend, it is

considered a good choice and should be repeated in similar situations later.
However, if a choice results in a bad outcome like a big loss in a stock’s
value or missing a chance to make profit, it is seen as a bad choice and

should be avoided in the future. Reinforcement learning is the tool that helps
learn the best strategy policy for making decisions, depending on the state
of the environment, to earn the most rewards. The reinforcement in reinforcement learning is the feedback or the rewards

and punishments, from the environment. Positive rewards reinforce the
actions that led to them, encouraging the agent to repeat those actions in the

future. Negative rewards or punishments discourage the actions that led to
them. Over time, through a lot of trial and error, the agent learns the best

strategy or policy to perform well in the environment. In a more technical language, reinforcement learning involves several key
components.

Agent — The learner or decision maker.
Environment — The context or world where the agent operates.
Actions — The set of all possible moves the agent can make.
States — The situation the agent finds itself in. It is a consequence of the

previous actions.
Reward — The feedback that the agent gets for each action. The agent’s
objective is to learn a policy that maximizes the cumulative reward

over time.

So, in reinforcement learning, the agent learns a policy, which is a mapping
from states to actions that maximize the expected sum of rewards.

Important Techniques in Pytorch For RL

Some important techniques in PyTorch for Reinforcement learning are
discussed in the following section.

Stable Baseline3

Stable Baselines3 is an open source library that provides high-quality

implementations of state-of-the-art RL algorithms in PyTorch. It is the
successor of Stable Baselines and Stable Baselines2, which were built with

TensorFlow.
The goal of Stable Baselines3 is to collect reliable implementations of RL

algorithms in one place with unified structure and standardized code. The
algorithms are made accessible via a common interface, making it easier to
both use and understand them.
The library includes implementations of many popular reinforcement
learning algorithms, such as Proximal Policy Optimization, Soft

Actor-Critic, Advantage Actor-Critic and Twin Delayed

DDGP.

Gymnasium

This library is a branch of OpenAI’s original Gym library, managed by its
maintainers. Gymnasium is a freely available Python library that allows for

the development and comparison of reinforcement learning algorithms. It
establishes a standard API for enabling communication between learning

algorithms, environments and offers a set of environments that comply
with this API.
It provides a wide variety of pre-defined environments for training and

testing reinforcement learning agents, including simulations of robotics,
classic control tasks, computer games and more. Here, is a breakdown of the main components.

Environments — OpenAI Gym provides a large set of environments
that simulates a variety of problems an RL agent needs to solve. These

environments adhere to a unified API, making it easier to develop
generic algorithms that can be applied across a range of scenarios. The
environments range from simple tasks like balancing a pole (CartPole)
or controlling a mountain car, playing Atari video games,

navigating 2D and 3D mazes and even playing board games like Go
and chess.
Spaces — Every gym environment comes with an action_space and an

observation_space. These spaces define the form of the agent’s
actions and observations. For example, in the CartPole environment,

the observation space represents the position and velocity of the cart
and pole, while the action space represents the possible forces applied
to the cart.
Steps — In each environment, an agent takes a step by calling the
step() function, which advances the environment by one step. This
function returns four values, the new observation, the reward, a done

flag indicating whether the episode has ended and extra information
that can be used for debugging.
Tasks — Each environment encapsulates a task, or a goal that an agent
needs to achieve. For instance, in the CartPole environment, the task
is to balance a pole on a cart for as long as possible.
Benchmarking — OpenAI Gym also provides tools for benchmarking,
which allow you to compare the performance of different algorithms
on the same tasks.

Project — Stock Market Trading With RL

Here, we will illustrate how to use gym and stable baseline3 to conduct a
reinforcement learning.

Objective — Development of a day trading model utilizing

reinforcement learning.
Tools — Gym, Stable-baselines3, and Yfinance.
Methodology — The environment incorporates Apple’s stock price over

the past 6 days. Our task is to develop a policy too.

Decide when to buy, hold or sell the stock.
Determine the quantity to buy or sell.

4. Reward — The reward is the subsequent value of the portfolio after the
action has been taken.

5. Solution — The attached notebook provides a complete solution,
covering model development and inference.

6. Exercise — Enhance the model to factor in multiple stocks. Establish a
policy too.

Determine which stocks to buy, sell or hold.
Decide on the exact amount of each stock to buy or sell.

Transformer For Reinforcement Learning

There are two major transformer architectures for reinforcement learning,

The decision transformer
The trajectory transformer.

Decision Transformers

At its heart, the decision transformer
uses a different approach compared to
usual RL methods. Instead of teaching a system how to choose the best

action to get the most reward something called a value function, the
decision transformer reformulates the problem as a sequence modeling
problem. Given a certain goal the desired return and information about

past actions and states, it tries to predict what actions should come next to
reach that goal. We will start with an examination of the decision
transformer, as illustrated in Figure 1.1. Here, are its primary elements.

Input — The decision transformer’s input consists of Return (Rt), State

(St) and Action (at) tuples. The most recent K-step RSA (Return, State, Action) tuples are presented as a sequence and embedded to
transform into a continuous vector representation.
Positional encoding — A process called positional encoding is utilized,

thus capturing the relative positions of the RSA elements within the
input sequence.
Transformer layers — The GPT-2 model processes the input in
autoregressive manner.
Linear layer for output — The culmination of the decision Transformer
structure is a linear layer. This layer maps the final decoder layer of
the transformer into the action space, subsequently producing a
sequence of actions to achieve the intended outcome.

Trajectory Transformer

The trajectory transformer shares a similarity with the decision transformer
in that they both approach the reinforcement learning task as a sequence

learning problem. However, there are important distinctions, particularly in
how the sequence represents the action, reward and state. Let us delve into
the details of the trajectory transformer. The architecture of the trajectory

transformer is illustrated in Figure 1.2.

Input

A trajectory, represented by τ is a sequence comprising T states, actions and
individual rewards. This sequence can be expressed as.

Here, states and actions are discretized independently. Given states of N
dimensions and actions of M dimensions, the trajectory τ is transformed into
a sequence of length T (N + M + 1).

In this context, each token’s subscript represents the timestep, while the
superscripts on states and actions denote their respective dimensions. For
instance, at a given step, the states span N dimensions, denoted as s1, …sN,

and the actions occupy an M-dimensional space.

The transformer model uses the GPT like structure with four decoders
layers.

Output:

The model is autoregressive and outputs the sequence of states, actions and
rewards.
For a deeper understanding of trajectory transformers, kit is recommended to
explore their GitHub page.

Conclusion

As we stand at the crossroads of AI innovation, the integration of reinforcement learning with transformers heralds a new era of possibilities. From understanding complex system requirements and mastering essential PyTorch techniques to leveraging Stable Baselines3 and Gymnasium, the journey through stock market trading with RL showcases its transformative impact. The fusion of decision transformers and trajectory transformers exemplifies the power of adaptive learning, reshaping how models interpret and interact with their environments. With these advanced tools and methodologies, the future of intelligent systems promises not only remarkable accuracy but also an unprecedented capacity for autonomous evolution. As we continue to explore and refine these ground breaking technologies, the potential for groundbreaking advancements in AI remains limitless.