Meta-Learning, OpenAI, and Sonic The Hedgehog


The goal of this research activity is to survey a resurging field within reinforcement learning called deep meta-reinforcement learning. OpenAI has just released a new research platform called Gym Retro which is being proposed as a new benchmark for meta-RL research. For the last five years the Atari Learning Environment within OpenAI’s gym has served as the proving ground for RL algorithms. Gym Retro utilizes a more modern console than Atari, the SEGA Genesis. The Genesis has 500 times more RAM than the Atari console leading to a much greater range of possible control inputs and better graphics. The Genesis levels consist of more complex and challenging levels, but have more similarity in their mechanics that the Atari domain. This poses the challenge for algorithms that can “learn how to learn” over multiple levels or environments. In typical RL research one architecture with the same hyperparameters is individually trained on each environment. This agent is then evaluated in the same environment where they it was trained, this often favors algorithms that are good at memorization and are loaded with hyperparameters. Retro Gym is providing a proving grounds for meta-learning augmented RL agents that can take advantage of the top level of the Knowledge Management pyramid — meta-knowledge or wisdom.

Introduction & Background

The concept of Meta-Learning is actually quite simple, and whether you are aware of it or not, you most likely employ a strong meta-learning process either in school or when learning new skills. Meta-learning is defined as: being aware of and taking control of one’s own learning process. For a familiar example, Bill a student at Cal Poly took a math test and completely failed the test. After receiving his results, he was determined to do much better on the next test. So Bill examined his learning process, a couple of missed homework assignments and a failure to show up at office hours caused a disconnect in his knowledge for the exam. By analyzing his learning process and actively making changes to alter his learning, Bill was able to ace the next math test he took.

Out of all the biological systems on this planet, humans excel in having higher order awareness of their learning process and applying it to achieve long term goals. Researchers know that meta-learning enables humans to constantly improve and more efficiently learn new skills, thus it is important to programmatically instill meta-learning capabilities in machine learning agents.

Reinforcement Learning (RL) is a subset of machine learning primarily concerned with optimal decision making. An RL problem consists of an agent and an environment, where the agent is learning to optimally operate in that environment through interaction with the environment and trial and error. Very similar to how a biological system would approach a new environment without any guidance. Another familiar example here is the idea of a young child in a kitchen without any guidance. The child may be intrigued by the glowing red stovetop, the child then touches the burner and experiences a painful burn. This negative experience will then reinforce knowledge within that child to avoid touching glowing hot objects. However, entering the pantry, the child finds a box of cookies and eats one, the sugar from the cookie in the stomach sends dopamine signals to the brain positively reinforcing the act of finding and eating a cookie. These experiences will continue to shape that child’s future interactions with the kitchen.

Agent-Environment Model for RL

The field of RL has undergone rapid advancements since the application of neural network approximation in the early 2010s. Super-human performance has been achieved across a growing number of problems once considered too difficult. Although still in infancy, RL can be applied to anything that can be modeled as a sequential decision making problem, which ends up being a lot of problems!

Applications of RL

  • Manufacturing
  • Autonomous Vehicle Control
  • Command Control Communication (C3) Systems
  • Public Utility Management
  • Inventory Management
  • Delivery Management
  • Finance Sector
  • Advertising
  • etc…

Meta-Learning Motivation

Above, I highlight the strengths of RL. This section pinpoints serious limitations of RL and how meta-learning can help address these issues.

A problem with RL is that agents are very brittle, meaning that one agent must be trained on one environment, and evaluated on that same environment. An agent trained on an Atari game cannot start managing the power supply of a neighborhood. Once training is complete, if the agent’s model is not saved, then all of that training and experience is lost. If the environment were to change a little bit towards the end of training the agent may start failing because they are prone to overfitting and memorization. Ultimately, the biggest limitation of current RL approaches is that they lack an awareness of the learning process.

RL agents augmented with meta-learning capabilities could overcome many of the limitations listed above. If an agent is aware of its learning process, the agent can then build multiple experiences across different environments. From this collection of experience the agent can begin to build a base of wisdom, which would not restart after training. This capability will help the agent be mindful of what has happened in the past and how that is affecting its current performance. The agent may even choose to completely change its learning process, maybe dynamically altering some hyperparameters, or swapping out for a different exploration strategy in order improve learning and maximize reward. A promising meta-learning capability is the idea of transfer learning which I will cover in more depth later on. Transfer learning is an ability that humans utilize very well, when learning a new skill like playing a new video game, we size up the problem and think, “Ok this is a shooter game, this button jumps in most shooter games i’ve played, this stick should help me move left and right, I probably should avoid bullets and dying.” this transfer of knowledge from previous domains to the current helps accelerate the learning process.

Meta-RL hopes to preserve the knowledge and wisdom gained by an agent across multiple environments. This would increase the efficiency and learning speed of the agent and help enable improvement of the learning process. Overall meta-RL is a step closer to generalizing the brittleness of RL and greatly expanding possible application.

Modern Meta-Learning & Knowledge Management

Meta-RL is a field of research concerned with programmatically instilling wisdom into an agent. Below is a description of the knowledge management pyramid with respect to Meta-RL.

  • Data is generated by an agent interacting with an environment
  • Information is generated by organizing and giving context to the data such as labelling actions and rewards, and organizing sequences of history (a1,r1,s1,a2,r2,s2,…)
  • Knowledge is extracted by understanding and learning patterns in information, agent improves by accruing knowledge about what the optimal action is given the current information of the environment.
  • Wisdom is distilled from a base of knowledge which is inherently very general. Thus, how does one go about programmatically instilling wisdom in an RL agent?
Knowledge Management Pyramid

The idea of transfer learning is that you can take a skill you already know and baseline that knowledge to start learning a new skill.


For example, if someone already knows how to snowboard and they start to learn how to surf, chances are they will learn to surf quicker than someone with no snowboarding knowledge. This is because as humans we don’t start learning every skill from scratch, every new ‘environment’ we approach there is an immense amount of preconceived knowledge or similar experiences that we baseline the new skill from. The snowboarder learning to surf is very familiar with shifting body weight on a board to influence the next moments position from all the time spent on the slopes! The snowboarder may also have a stronger sense of balance and body control learned from snowboarding that he will apply to surfing.


OpenAI demonstrated transfer learning capability in RL agents in a 3-d simulation environment called Mujoco. In the first 30 seconds of the video below sumo wrestler agents are trained to keep themselves in the ring, while forcing their opponents out. At 1:34–2:0 the “Sumo” agent is tested in a new environment without an opponent, unknown forces are applied on the agent and the wrestler displays its ability to keep itself balanced and in the ring even against a completely new challenge.

OpenAI Competitive Self-Play & Transfer Learning Video

Retro Competition & Results

In this section I will go over the Retro Competition in more detail and discuss the results of the competition.

OpenAI as a research platform and institution has been doing fantastic work in developing infrastructure to accelerate RL research. Famously, the OpenAI Gym provides access to 100+ RL environments where a similar API is used to seamlessly plug-and-play different RL algorithms in these environments. The Atari 2600 environments within the Gym have served as a general benchmark for RL starting in late 2013. The most recent development by OpenAI is the Retro Gym and the Retro Competition hosted from April 5th, 2018 — June 5th, 2018.

Sonic 2 Loading Screen

This Retro Gym benchmark provides a much more complex benchmark for RL agents in terms of complexity (number of actions) and state space (number of possible states / pixel combinations) than the Atari games, but it also has very similar mechanics between levels, i.e the player is always controlling Sonic. Whereas in Atari the player goes from controlling a paddle to a space ship to a dungeon explorer to a chicken all with very different mechanics. This similarity of game mechanics is not an accident. OpenAI selected Sonic for the Gym and competition because it is the perfect proving grounds for a Meta-RL algorithm to be tested.

This graph shows score over time, for 5 different RL algorithms. The red dotted line above denotes a human average. This graph shows that state-of-the-art RL is far from human level performance in Sonic.

OpenAI Retro SEGA — Sonic Competition Rules

  • Competitors train a meta-RL agent on a training set consisting of 30 levels.
  • OpenAI evaluates the agent on three unseen test levels for one million time steps.
  • The average of mean scores for all test levels is used to measure aggregate performance.
  1. Dharmaraja, 6284.90
  2. Students of Plato, 5815.51
  3. Mistake, 5554.85

The competition has ended this last Tuesday (6–5–18) and the top three teams are listed above. The score performance of all three teams is much higher than current state-of-the-art for RL or any meta-RL approaches but just under human-level performance. As of today (6–10–18) there is still no update from OpenAI on the competition winners or their approaches but stay tuned to this article and I will post an update section on how these teams implemented meta-learning capabilities into their Sonic agents.

Conclusion and Future work

In summary, meta-RL will:

  • Enhance the generality of reinforcement learning agents
  • Enable continuous improvement of agent learning process
  • Improve learning speed through transfer learning
  • Help solve multi-objective problems

Most of this article was a theoretical overview of meta-RL and the coverage of a competition to help spur some innovation in this field. Future work to help move the field closer to meta-RL is in designing hierarchical RL algorithm architectures to accommodate for modular components that implement meta-learning functionality and to be able to integrate those modules into an RL agents workflow. Neuroscientists and biology researchers need to continue exploring ourselves, discovering more about our biological learning process and how we can imitate that to improve RL agents. Finally, the study of reward signals and environment engineering for machine learning agents is crucial to building agents that can generalize their learning and become familiar with a diverse set of environments.

Eclectic Software Engineer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store