Friday, December 27, 2019

The Machine Learning Reproducibility Checklist

The Machine Learning Reproducibility Checklist  
 
https://www.cs.mcgill.ca/~jpineau/ReproducibilityChecklist.pdf


For all models and algorithms presented, check if you include:

  • A clear description of the mathematical setting, algorithm, and/or model
  •  An analysis of the complexity (time, space, sample size) of any algorithm. 
  • A link to a downloadable source code, with specification of all dependencies, including external libraries.
For any theoretical claim, check if you include:
  • A statement of the result.
  • A clear explanation of any assumptions.
  • A complete proof of the claim.  
For all figures and tables that present empirical results, check if you include: 

  • A complete description of the data collection process, including sample size.  
  • A link to a downloadable version of the dataset or simulation environment. An explanation of any data that were excluded, description of any pre-processing step. 
  • An explanation of how samples were allocated for training / validation / testing. 
  • The range of hyper-parameters considered, method to select the best hyper-parameter configuration, and specification of all hyper-parameters used to generate results. 
  • The exact number of evaluation runs. 
  • A description of how experiments were run. 
  • A clear definition of the specific measure or statistics used to report results. 
  • Clearly defined error bars. 
  • A description of results with central tendency(e.g. mean) & variation(e.g. std dev).  
  • A description of the computing infrastructure used

Friday, November 8, 2019

Uber’s Self-Driving Car Didn’t Know Pedestrians Could Jaywalk

https://www.wired.com/story/ubers-self-driving-car-didnt-know-pedestrians-could-jaywalk/


The software inside the Uber self-driving SUV that killed an Arizona woman last year was not designed to detect pedestrians outside of a crosswalk, according to new documents released as part of a federal investigation into the incident. That’s the most damning revelation in a trove of new documents related to the crash, but other details indicate that, in a variety of ways, Uber’s self-driving tech failed to consider how humans actually operate.
The National Transportation Safety Board, an independent government safety panel that more often probes airplane crashes and large truck incidents, posted documents on Tuesday regarding its 20-month investigation into the Uber crash. The panel will release a final report on the incident in two weeks. More than 40 of the documents, spanning hundreds of pages, dive into the particulars of the March 18, 2016 incident, in which the Uber testing vehicle, with 44-year-old Rafaela Vasquez in the driver's seat, killed a 49-year-old woman named Elaine Herzberg as she crossed a darkened road in the city of Tempe, Arizona. At the time, only one driver monitored the experimental car’s operation and software as it drove around Arizona. Video footage published in the weeks after the crash showed Vasquez reacting with shock during the moments just before the collision.
The new documents indicate that some mistakes were clearly related to Uber’s internal structure, what experts call “safety culture.” For one, the self-driving program didn’t include an operational safety division or safety manager.



The most glaring mistakes were software-related. Uber’s system was not equipped to identify or deal with pedestrians walking outside of a crosswalk. Uber engineers also appear to have been so worried about false alarms that they built in an automated one-second delay between a crash detection and action. In addition, the company chose to turn off a built-in Volvo braking system that the automaker later concluded might have dramatically reduced the speed at which the car hit Herzberg, or perhaps avoided the collision altogether. (Experts say the decision to turn off the Volvo system while Uber’s software did its work did make technical sense, because it would be unsafe for the car to have two software “masters.”)
Much of that explains why, despite the fact that the car detected Herzberg with more than enough time to stop, it was traveling at 43.5 mph when it struck her and threw her 75 feet. When the car first detected her presence, 5.6 seconds before impact, it classified her as a vehicle. Then it changed its mind to “other,” then to vehicle again, back to “other,” then to bicycle, then to “other” again, and finally back to bicycle.
It never guessed Herzberg was on foot for a simple, galling reason: Uber didn’t tell its car to look for pedestrians outside of crosswalks. “The system design did not include a consideration for jaywalking pedestrians,” the NTSB’s Vehicle Automation Report reads. Every time it tried a new guess, it restarted the process of predicting where the mysterious object—Herzberg—was headed. It wasn’t until 1.2 seconds before the impact that the system recognized that the SUV was going to hit Herzberg, that it couldn’t steer around her, and that it needed to slam on the brakes.
That triggered what Uber called “action suppression,” in which the system held off braking for one second while it verified “the nature of the detected hazard”—a second during which the safety operator, Uber’s most important and last line of defense, could have taken control of the car and hit the brakes herself. But Vasquez wasn’t looking at the road during that second. So with 0.2 seconds left before impact, the car sounded an audio alarm, and Vasquez took the steering wheel, disengaging the autonomous system. Nearly a full second after striking Herzberg, Vasquez hit the brakes.

In a statement, an Uber spokesperson said that the company “regrets the 2018 crash,” and emphasized that its Advanced Technologies Group has made changes to its safety program. According to Uber documents submitted to the NTSB as part of the investigation, Uber has changed its safety driver training in the 20 months since, and now puts two safety operators in each car. (Today, Uber tests self-driving cars in Pittsburgh, and will launch testing in Dallas this month.) The company has also changed the structure of its safety team and created a system where workers can anonymously report safety issues. “We deeply value the thoroughness of the NTSB’s investigation,” the spokesperson added.
Another factor in the crash was the Tempe road structure itself. Herzberg, wheeling a bicycle, crossed the street near a pathway that appeared purpose-built for walkers, but was 360 feet from the nearest crosswalk.
On November 19, the NTSB will hold a meeting on the incident in Washington, DC. Investigators will then release a comprehensive report on the crash, detailing what happened and who or what was at fault. Investigators will also make recommendations to federal regulators and to companies like Uber building the tech outlining how to prevent crashes like this in the future.
For Herzberg, of course, it’s too late. Her family settled a lawsuit with Uber just 11 days after the crash.






Thursday, September 26, 2019

At Tech’s Leading Edge, Worry About a Concentration of Power

https://www.nytimes.com/2019/09/26/technology/ai-computer-expense.html

At Tech’s Leading Edge, Worry About a Concentration of Power

By Steve Lohr

Sept. 26, 2019, 3:00 a.m. ET

Each big step of progress in computing — from mainframe to personal computer to internet to smartphone — has opened opportunities for more people to invent on the digital frontier.

But there is growing concern that trend is being reversed at tech’s new leading edge, artificial intelligence.

Computer scientists say A.I. research is becoming increasingly expensive, requiring complex calculations done by giant data centers, leaving fewer people with easy access to the computing firepower necessary to develop the technology behind futuristic products like self-driving cars or digital assistants that can see, talk and reason.

The danger, they say, is that pioneering artificial intelligence research will be a field of haves and have-nots. And the haves will be mainly a few big tech companies like Google, Microsoft, Amazon and Facebook, which each spend billions a year building out their data centers.
Interested in All Things Tech?

The Bits newsletter will keep you updated on the latest from Silicon Valley and the technology industry.

In the have-not camp, they warn, will be university labs, which have traditionally been a wellspring of innovations that eventually power new products and services.

“The huge computing resources these companies have pose a threat — the universities cannot compete,” said Craig Knoblock, executive director of the Information Sciences Institute, a research lab at the University of Southern California.

The research scientists’ warnings come amid rising concern about the power of the big tech companies. Most of the focus has been on the current generation of technology — search, online advertising, social media and e-commerce. But the scientists are worried about a barrier to exploring the technological future, when that requires staggering amounts of computing.

The modern data centers of the big tech companies are sprawling and secretive. The buildings are the size of a football field, or larger, housing rack upon rack with hundreds of thousands of computers. The doors are bulletproof. The walls are fireproof. Outsiders are rarely allowed in.

These are the engine rooms of cloud computing. They help deliver a cornucopia of entertainment and information to smartphones and laptops, and they enable millions of developers to write cloud-based software applications.

But artificial intelligence researchers, outside the big tech companies, see a worrying trend in their field. A recent report from the Allen Institute for Artificial Intelligence observed that the volume of calculations needed to be a leader in A.I. tasks like language understanding, game playing and common-sense reasoning has soared an estimated 300,000 times in the last six years.

All that computing fuel is needed to turbocharge so-called deep-learning software models, whose performance improves with more calculations and more data. Deep learning has been the primary driver of A.I. breakthroughs in recent years.

“When it’s successful, there is a huge benefit,” said Oren Etzioni, chief executive of the Allen Institute, founded in 2014 by Paul Allen, the billionaire co-founder of Microsoft. “But the cost of doing research is getting exponentially higher. As a society and an economy, we suffer if there are only a handful of places where you can be on the cutting edge.”

The evolution of one artificial intelligence lab, OpenAI, shows the changing economics, as well as the promise of deep-learning A.I. technology.

Founded in 2015, with backing from Elon Musk, OpenAI began as a nonprofit research lab. Its ambition was to develop technology at the frontier of artificial intelligence and share the benefits with the wider world. It was a vision that suggested the computing tradition of an inspired programmer, working alone on a laptop, coming up with a big idea.

This spring, OpenAI used its technology to defeat the world champion team of human players at a complex video game called Dota 2. Its software learned the game by constant trial and error over months, the equivalent of more than 45,000 years of game play.

The OpenAI scientists have realized they are engaged in an endeavor more like particle physics or weather simulation, fields demanding huge computing resources. Winning at Dota 2, for example, required spending millions of dollars renting access to tens of thousands of computer chips inside the cloud computing data centers run by companies like Google and Microsoft.
Image“As a society and an economy, we suffer if there are only a handful of places where you can be on the cutting edge,” said Oren Etzioni, the chief executive of the Allen Institute.
“As a society and an economy, we suffer if there are only a handful of places where you can be on the cutting edge,” said Oren Etzioni, the chief executive of the Allen Institute.CreditKyle Johnson for The New York Times

Earlier this year, OpenAI morphed into a for-profit company to attract financing and, in July, announced that Microsoft was making a $1 billion investment. Most of the money, OpenAI said, would be spent on the computing power it needed to pursue its goals, which still include widely sharing the benefits of A.I., after paying off their investors.

As part of OpenAI’s agreement with Microsoft, the software giant will eventually become the lab’s sole source of computing.

“If you don’t have enough compute,you can’t make a breakthrough,” said Ilya Sutskever, chief scientist of OpenAI.

Academics are also raising concerns about the power consumed by advanced A.I. software. Training a large, deep-learning model can generate the same carbon footprint as the lifetime of five American cars, including gas, three computer scientists at the University of Massachusetts, Amherst, estimated in a recent research paper. (The big tech companies say they buy as much renewable energy as they can, reducing the environmental impact of their data centers.)

Mr. Etzioni and his co-authors at the Allen Institute say that perhaps both concerns — about power use and the cost of computing — could be at least partially addressed by changing how success in A.I. technology is measured.

The field’s single-minded focus on accuracy, they say, skews research along too narrow a path.

Efficiency should also be considered. They suggest that researchers report the “computational price tag” for achieving a result in a project as well.

Since their “Green A.I.” paper was published in July, their message has resonated with many in the research community.

Henry Kautz, a professor of computer science at the University of Rochester, noted that accuracy is “really only one dimension we care about in theory and in practice.” Others, he said, include how much energy is used, how much data is required and how much skilled human effort is needed for A.I. technology to work.

A more multidimensional view, Mr. Kautz added, could help level the playing field between academic researchers and computer scientists at the big tech companies, if research projects relied less on raw computing firepower.

Big tech companies are pursuing greater efficiency in their data centers and their artificial intelligence software, which they say will make computing power more available to the outside developers and academics.

John Platt, a distinguished scientist in Google’s artificial intelligence division, points to its recent development of deep-learning models, EfficientNets, which are 10 times smaller and faster than conventional ones. “That democratizes use,” he said. “We want these models to be trainable and accessible by as many people as possible.”

The big tech companies have given universities many millions over the years in grants and donations, but some computer scientists say they should do more to close the gap between the A.I. research haves and have-nots. Today, they say, the relationship that tech giants have to universities is largely as a buyer, hiring away professors, graduate students and even undergraduates.

The companies would be wise to also provide substantial support for academic research including much greater access to their wealth of computing — so the competition for ideas and breakthroughs extends beyond corporate walls, said Ed Lazowska, a professor at the University of Washington.

A more supportive relationship, Mr. Lazowska argues, would be in their corporate self-interest. Otherwise, he said, “We’ll see a significant dilution of the ability of the academic community to produce the next generation of computer scientists who will power these companies.”

At the Allen Institute in Seattle, Mr. Etzioni said, the team will pursue techniques to improve the efficiency of artificial intelligence technology. “This is a big push for us,” he said.

But Mr. Etzioni emphasized that what he was calling green A.I. should be seen as “an opportunity for additional ingenuity, not a restraint” — or a replacement for deep learning, which relies on vast computing power, and which he calls red A.I.

Indeed, the Allen Institute has just reached an A.I. milestone by correctly answering more than 90 percent of the questions on a standard eighth-grade science test. That feat was achieved with the red A.I. tools of deep learning.

Steve Lohr covers technology and economics. He was a foreign correspondent for a decade, and in 2013, he was part of the team awarded the Pulitzer Prize for Explanatory Reporting. @SteveLohr

Thursday, September 12, 2019

Teaching AI to plan using language in a new open-source strategy game

https://ai.facebook.com/blog/-teaching-ai-to-plan-using-language-in-a-new-open-source-strategy-game/



When humans face a complex challenge, we create a plan composed of individual, related steps. Often, these plans are formed as natural language sentences. This approach enables us to achieve our goal and also adapt to new challenges, because we can leverage elements of previous plans to tackle new tasks, rather than starting from scratch each time.
Facebook AI has developed a new method of teaching AI to plan effectively, using natural language to break down complex problems into high-level plans and lower-level actions. Our system innovates by using two AI models — one that gives instructions in natural language and one that interprets and executes them — and it takes advantage of the structure in natural language in order to address unfamiliar tasks and situations. We’ve tested our approach using a new real-time strategy game called MiniRTSv2, and found it outperforms AI systems that simply try to directly imitate human gameplay.
We’re now sharing our results which will be presented at NeurIPS 2019 later this year, and open-sourcing MiniRTSv2 so other researchers can use it to build and test their own imitation and reinforcement learning systems.
Previously, the AI research community has found it challenging to bring this hierarchical decision-making process to AI systems. Doing so has meant that researchers had to manually specify how to break down a problem into macro-actions, which is difficult to scale and requires expertise. Alternatively, if the AI system has been trained to focus on the end task, it is likely to learn how to achieve success through a single composite action rather than a hierarchy of steps. Our work with MiniRTSv2 shows that a different natural language-based method can make progress against these challenges.
While this is foundational research, it suggests that by using language to represent plans, these systems can more efficiently generalize to a variety of tasks and adapt to new circumstances. We believe this can bring us closer to our long-term goal of building AI that can adapt and generalize in real-world settings.

Building MiniRTSv2, an open source, NLP-ready game environment

MiniRTSv2 is a streamlined strategy game designed specifically for AI research. In the game, a player commands archers, dragons, and other units in order to defeat an opponent.
In this sample MiniRTSv2 gameplay — recorded directly from the tool’s interface — all the instructions that appear below the map field are generated by an instructor model, while the corresponding in-game actions, such as building and attacking units, are carried out by a separate executor model.
Though MiniRTSv2 is intentionally simpler and easier to learn than commercial games such as DOTA 2 and StarCraft, it still allows for complex strategies that must account for large state and action spaces, imperfect information (areas of the map are hidden when friendly units aren’t nearby), and the need to adapt strategies to the opponent’s actions. Used as a training tool for AI, the game can help agents learn effective planning skills, whether through NLP-based techniques or other kinds of training, such as reinforcement and imitation learning.

Using language to generate high-level plans and assign low-level instructions

We used MiniRTSv2 to train AI agents to first express a high-level strategic plan as natural language instructions and then to act on that plan with the appropriate sequence of low-level actions in the game environment. This approach leverages natural language’s built-in benefits for learning to generalize to new tasks. Those include the expressive nature of language — different combinations of words can represent virtually any concept or action — as well as its compositional structure, which allows people to combine and rearrange words to create new sentences that others can then understand. We applied these features to the entire process of planning and execution, from the generation of strategy and instructions to the interface that bridges the different parts of the system’s hierarchical structure.
Our AI agent plays a real-time strategy game using two models. The instructor creates plans based on continually observing the game state and issues instructions in natural language to the executor. The executor grounds these instructions as actions, based on the current state of the game.
The AI agent that we built to test this approach consists of a two-level hierarchy — an instructor model that decides on a course of action and issues commands, and an executor model that carries out those instructions. We trained both models using a data set collected from human participants playing MiniRTSv2.
Those participants worked in instructor-executor pairs, with designated instructors issuing orders in the form of written text, and executors accessing the game’s controls to carry those orders out. The commands ranged from clear-cut directives, such as “build 1 dragon,” to general instructions, such as “attack.” We used these natural language interactions between players to generate a data set of 76,000 pairs of instructions and executions across 5,392 games

Leveraging the versatility of natural language to learn more generalized plans

Though MiniRTSv2 isn’t designed solely for NLP-related work, the game environment’s text interface allows us to explore ambiguous and context-dependent linguistic features that are relevant to building more versatile AI. For example, given the instruction “make two more cavalry and send them over with the other ones,” the executor model has to grasp that “the other ones” are existing cavalry, an inference that’s simple for most humans, but potentially challenging for AI. The agent also has to account for the kind of potentially confusing nuances that are common in natural language. The specific command “send idle peasant to mine mineral” should lead to the same action as the comparatively vague “back to mine,” which doesn’t specify which units should be moved.
At each time step within a given MiniRTSv2 game, our system relies on three encoders to turn inputs into feature vectors that the model can use. The observation encoder focuses on spatial inputs (where game objects appear on the map) and nonspatial inputs (such as the type of unit or building that a given game object represents); the instruction encoder generates vectors from a recent list of natural language instructions; and the auxiliary encoder learns vectors for the remaining global game attributes (such as the total amount of resources a player has).
But rather than clarifying phrasing or eliminating redundant permutations of the same order, we intentionally leave the human instruction examples (and corresponding executor actions) as they were delivered. The instructor model can’t formulate original sentences and has to select from examples from human play-throughs. This forces the agent to develop pragmatic inference, learning how to plan and execute based on natural language as humans actually use it, even when that usage is imprecise.
Training our system to not only generate latent language commands but also understand the context of those instructions resulted in a significant boost in performance over more traditional agents. Using MiniRTSv2, we pitted a number of different agents against an AI opponent that was trained to directly imitate human actions, without taking language into account. The results from these experiments showed that language consistently improved agents’ win rates. For example, our most sophisticated NLP-based agent, which uses a recurrent neural network (RNN) encoder to help differentiate similar orders, beat the non-language-based AI opponent 57.9 percent of the time. That’s substantially better than the imitation-based agent’s 41.2 percent win rate.
This is the first model to show improvements in planning by generating and executing latent natural language instructions. And though we employed a video game to evaluate our agents, the implications of this work go far beyond boosting the skills of game-playing AI bots, suggesting the long-term potential of employing language to improve generalization. Our evaluations showed that performance gains for NLP-based agents increased with larger instruction sets, as the models were able to use the compositional structure within language to better generalize across a wide range of examples.
And in addition to improving generalization, this approach has the significant side benefit of demonstrating how decision-making AI systems can be simultaneously high performance, versatile, and more interpretable. If an agent’s planning process is based on natural language, with sentences mapped directly to actions, understanding how a system arrived at a given action could be as simple as reading its internal transcript. The ability to quickly vet an AI’s behavior could be particularly useful for AI assistants, potentially allowing a user to fine-tune the system’s future actions.

Building language-based AI assistants through open science and collaboration

While our results have focused on using language as an aid for hierarchical decision-making, improving the ability of AI systems to utilize and understand natural language could pave the way for an even wider range of potential long-term benefits, such as assistants that are better at adapting to unfamiliar tasks and surroundings. Progress in this area might also yield systems that respond better to spoken or written commands, making devices and platforms more accessible to people who aren’t able to operate a touchscreen or mouse.
As promising as our results have been, the experimental task that we’re presenting, the NLP-based data set that we’ve created, and the MiniRTSv2 environment that we’ve updated are all novel contributions to the field. Exploring their full potential will require a substantial collective effort, which is why we’re inviting the wider AI community to use them. And these resources aren’t limited to one task — for example, since the MiniRTSv2 interface makes it easy to isolate the language activity from the recorded games, our data set of sample commands could be valuable for researchers training NLP systems, even if their work is unrelated to game performance or hierarchical decision-making. We look forward to seeing the results and insights that other researchers generate using these tools, as we continue to advance the application of language to improve the quality, versatility, and transparency of AI decision-making.

Thursday, August 29, 2019

Deep Reinforcement Learning

https://blog.dominodatalab.com/deep-reinforcement-learning/


Introduction

Recent feats in machine learning, like developing a program to defeat a human in a game Go, have been powered by reinforcement learning. Reinforcement learning is the process of training a program to attain a goal through trial and error by incentivizing it with a combination of rewards and penalties. An agent works in the confines of an environment to maximize its rewards. From games to simulating evolution, reinforcement learning has been used as a tool to explore emergent behavior. As Domino seeks to help data scientists accelerate their work, we reached out to AWP Pearson for permission to excerpt the chapter “Deep Reinforcement Learning” from the book, Deep Learning Illustrated: A Visual, Interactive Guide to Artificial Intelligence by Krohn, Beyleveld, and Bassens. Many thanks to AWP Pearson for providing the permissions to excerpt the work and enabling us to provide a complementary Domino project.
This project works through the classic Cart-Pole problem, that is balancing a pole attached to a moveable cart. The contents will cover
  • Using Keras and OpenAI Gym to create a deep learning model and provide the environment to train
  • Defining the agent
  • Modeling interactions between the agent and the environment

Chapter Introduction: Deep Reinforcement Learning

In Chapter 4 [in the book], we introduced the paradigm of reinforcement learning (as distinct from supervised and unsupervised learning), in which an agent (e.g., an algorithm) takes sequential actions within an environment. The environments—whether they be simulated or real world—can be extremely complex and rapidly changing, requiring sophisticated agents that can adapt appropriately in order to succeed at fulfilling their objective. Today, many of the most prolific reinforcement learning agents involve an artificial neural network, making them deep reinforcement learning algorithms.
In this chapter, we will
  • Cover the essential theory of reinforcement learning in general and, in particular, a deep reinforcement learning model called deep Q-learning
  • Use Keras to construct a deep Q-learning network that learns how to excel within simulated, video game environments
  • Discuss approaches for optimizing the performance of deep reinforcement learning agents
  • Introduce families of deep RL agents beyond deep Q-learning

Essential Theory of Reinforcement Learning

Recall from Chapter 4 (specifically, Figure 4.3)[in the book] that reinforcement learning is a machine learning paradigm involving:
  • An agent taking an action within an environment (let’s say the action is taken at some timestep t).
  • The environment returning two types of information to the agent:
    • Reward: This is a scalar value that provides quantitative feedback on the action that the agent took at timestep t. This could, for example, be 100 points as a reward for acquiring cherries in the video game Pac-Man. The agent’s objective is to maximize the rewards it accumulates, and so rewards are what reinforce productive behaviors that the agent discovers under particular environmental conditions.
    • State: This is how the environment changes in response to an agent’s action. During the forthcoming timestep (t + 1), these will be the conditions for the agent to choose an action in.
  • Repeating the above two steps in a loop until reaching some terminal state. This terminal state could be reached by, for example, attaining the maximum possible reward, attaining some specific desired outcome (such as a self-driving car reaching its programmed destination), running out of allotted time, using up the maximum number of permitted moves in a game, or the agent dying in a game.
Reinforcement learning problems are sequential decision-making problems. In Chapter 4 [in the book], we discussed a number of particular examples of these, including:
  • Atari video games, such as Pac-Man, Pong, and Breakout
  • Autonomous vehicles, such as self-driving cars and aerial drones
  • Board games, such as Go, chess, and shogi
  • Robot-arm manipulation tasks, such as removing a nail with a hammer

The Cart-Pole Game

In this chapter, we will use OpenAI Gym—a popular library of reinforcement learning environments (examples provided in Figure 4.13) [int the book]—to train an agent to play Cart-Pole, a classic problem among academics working in the field of control theory. In the Cart-Pole game:
  • The objective is to balance a pole on top of a cart. The pole is connected to the cart at a purple dot, which functions as a pin that permits the pole to rotate along the horizontal axis, as illustrated in Figure 13.1. [Note: An actual screen capture of the Cart-Pole game is provided in Figure 4.13a.]

  • The cart itself can only move horizontally, either to the left or to the right. At any given moment—at any given timestep—the cart must be moved to the left or to the right; it can’t remain stationary.
  • Each episode of the game begins with the cart positioned at a random point near the center of the screen and with the pole at a random angle near vertical.

  • As shown in Figure 13.2, an episode ends when either
    • The pole is no longer balanced on the cart—that is, when the angle of the
      pole moves too far away from vertical toward horizontal
    • The cart touches the boundaries—the far right or far left of the screen
  • In the version of the game that you’ll play in this chapter, the maximum number of timesteps in an episode is 200. So, if the episode does not end early (due to los- ing pole balance or navigating off the screen), then the game will end after 200 timesteps.
  • One point of reward is provided for every timestep that the episode lasts, so the maximum possible reward is 200 points.
The Cart-Pole game is a popular introductory reinforcement learning problem because it’s so simple. With a self-driving car, there are effectively an infinite number of possible environmental states: As it moves along a road, its myriad sensors—cameras, radar, lidar, accelerometers, microphones, and so on—stream in broad swaths of state information from the world around the vehicle, on the order of a gigabyte of data per second.[Note: Same principle as radar, but uses lasers instead of sound. bit.ly/GBpersec]. The Cart-Pole game, in stark contrast, has merely four pieces of state information:
  1. The position of the cart along the one-dimensional horizontal axis
  2. 2. The cart’s velocity
  3. The angle of the pole
  4. The pole’s angular velocity
Likewise, a number of fairly nuanced actions are possible with a self-driving car, such as accelerating, braking, and steering right or left. In the Cart-Pole game, at any given timestep t, exactly one action can be taken from only two possible actions: move left or move right.

Markov Decision Processes

Reinforcement learning problems can be defined mathematically as something called a Markov decision process. MDPs feature the so-called Markov property—an assumption that the current timestep contains all of the pertinent information about the state of the environment from previous timesteps. With respect to the Cart-Pole game, this means that our agent would elect to move right or left at a given timestep t by considering only the attributes of the cart (e.g., its location) and the pole (e.g., its angle) at that particular timestep t. [Note: The Markov property is assumed in many financial-trading strategies. As an example, a trading strategy might take into account the price of all the stocks listed on a given exchange at the end of a given trading day, while it does not consider the price of the stocks on any previous day.]

As summarized in Figure 13.3, the MDP is defined by five components:
  1. S is the set of all possible states. Following set-theory convention, each individual possible state (i.e., a particular combination of cart position, cart velocity, pole angle, and angular velocity) is represented by the lowercase s. Even when we consider the relatively simple Cart-Pole game, the number of possible recombinations of its four state dimensions is enormous. To give a couple of coarse examples, the cart could be moving slowly near the far-right of the screen with the pole balanced vertically, or the cart could be moving rapidly toward the left edge of the screen with the pole at a wide angle turning clockwise with pace.
  2. A is the set of all possible actions. In the Cart-Pole game, this set contains only two elements (left and right); other environments have many more. Each individual possible action is denoted as a.
  3. R is the distribution of reward given a state-action pair—some particular state paired with some particular action—denoted as (s, a). It’s a distribution in the sense of being a probability distribution: The exact same state-action pair (s, a) might randomly result in different amounts of reward r on different occasions. [Note: Although this is true in reinforcement learning in general, the Cart-Pole game in particular is a relatively simple environment that is fully deterministic. In the Cart-Pole game, the exact same state-action pair (s, a) will in fact result in the same reward every time. For the purposes of illustrating the principles of reinforcement learning in general, we use examples in this section that imply the Cart-Pole game is less deterministic than it really is.] The details of the reward distribution R—its shape, including its mean and variance—are hidden from the agent but can be glimpsed by taking actions within the environment. For example, in Figure 13.1, you can see that the cart is centered within the screen and the pole is angled slightly to the left. [Note: For the sake of simplicity, let’s ignore cart velocity and pole angular velocity for this example, because we can’t infer these state aspects from this static image.] We’d expect that pairing the action of moving left with this state s would, on average, correspond to a higher expected reward r relative to pairing the action of moving right with this state: Moving left in this state s should cause the pole to stand more upright, increasing the number of timesteps that the pole is kept balanced for, thereby tending to lead to a higher reward r. On the other hand, the move right in this state s would increase the probability that the pole would fall toward horizontal, thereby tending toward an early end to the game and a smaller reward r.
  4. P, like R, is also a probability distribution. In this case, it represents the probability of the next state (i.e., st+1) given a particular state-action pair (s, a) in the current timestep t. Like R, the P distribution is hidden from the agent, but again aspects of it can be inferred by taking actions within the environment. For example, in the Cart-Pole game, it would be relatively straightforward for the agent to learn that the left action corresponds directly to the cart moving leftward. [Note: As with all of the other artificial neural networks in this book, the ANNs within deep reinforcement learning agents are initialized with random starting parameters. This means that, prior to any learning (via, say, playing episodes of the Cart-Pole game), the agent has no awareness of even the simplest relationships between some state-action pair (s,a) and the next state st+1. For example, although it may be intuitive and obvious to a human player of the Cart-Pole game that the action left should cause the cart to move leftward, nothing is “intuitive” or “obvious” to a randomly initialized neural net, and so all relationships must be learned through gameplay.] More- complex relationships—for example, that the left action in the state s captured in Figure 13.1 tends to correspond to a more vertically oriented pole in the next state st+1—would be more difficult to learn and so would require more gameplay.
  5. γ (gamma) is a hyperparameter called the discount factor (also known as decay). To explain its significance, let’s move away from the Cart-Pole game for a moment and back to Pac-Man. The eponymous Pac-Man character explores a two-dimensional surface, gaining reward points for collecting fruit and dying if he gets caught by one of the ghosts that’s chasing him. As illustrated by Figure 13.4, when the agent considers the value of a prospective reward, it should value a reward that can be attained immediately (say, 100 points for acquiring cherries that are only one pixel’s distance away from Pac-Man) more highly than an equivalent reward that would require more timesteps to attain (100 points for cherries that are a distance of 20 pixels away). Immediate reward is more valuable than some distant reward, because we can’t bank on the distant reward: A ghost or some other hazard could get in Pac-Man’s way. [Note: The γ discount factor is analogous to the discounted cash flow calculations that are common in accounting: Prospective income a year from now is discounted relative to income expected today. Later in this chapter, we introduce concepts called value functions (V ) and Q-value functions (Q). Both V and Q incorporate γ because it prevents them from becoming unbounded (and thus computationally impossible) in games with an infinite number of possible future timesteps.] If we were to set γ = 0.9, then cherries one timestep away would be considered to be worth 90 points, [Note: 100 ×γt = 100 × 0.91 = 90] whereas cherries 20 timesteps away would be considered to be worth only 12.2 points. [Note: 100×γt =100×0.920 =12.16 ]

The Optimal Policy

The ultimate objective with an MDP is to find a function that enables an agent to take an appropriate action a (from the set of all possible actions A) when it encounters any particular state s from the set of all possible environmental states S. In other words, we’d like our agent to learn a function that enables it to map S to A. As shown in Figure 13.5, such a function is denoted by π and we call it the policy function.
The high-level idea of the policy function π, using vernacular language, is this: Regardless of the particular circumstance the agent finds itself in, what is the policy it should follow that will enable it to maximize its reward? For a more concrete definition of this reward-maximization idea, you are welcome to pore over this:

In this equation:

  • J(π) is called an objective function. This is a function that we can apply machine learning techniques to in order to maximize reward. [Note: The cost functions (a.k.a. loss functions) referred to throughout this book are examples of objective functions. Whereas cost functions return some cost value C, the objective function J(π) returns some reward value r. With cost functions, our objective is to minimize cost, so we apply gradient descent to them (as depicted by the valley- descending trilobite back in Figure 8.2). With the function J(π), in contrast, our objective is to maximize reward, and so we technically apply gradient ascent to it (conjuring up Figure 8.2 imagery, imagine a trilobite hiking to identify the peak of a mountain) even though the mathematics are the same as with gradient descent.]
  • π represents any policy function that maps S to A.
  • π∗ represents a particular, optimal policy (out of all the potential π policies) for mapping S to A. That is, π∗ is a function that—fed any state s—will return an action a that will lead to the agent attaining the max-imum possible discounted future reward.
  • To calculate the discounted future reward

over all future timesteps (i.e., t>0), we do the following.
    • Multiply the reward that can be attained in any given future timestep (rt) by the discount factor of that timestep (γt).
    • Accumulate these individual discounted future rewards
by summing them all up (using Σ ).

Essential Theory of Deep Q-Learning Networks

In the preceding section, we defined reinforcement learning as a Markov decision process. At the end of the section, we indicated that as part of an MDP, we’d like our agent— when it encounters any given state s at any given timestep t—to follow some optimal policy π∗ that will enable it to select an action a that maximizes the discounted future reward it can obtain. The issue is that—even with a rather simple reinforcement learn- ing problem like the Cart-Pole game—it is computationally intractable (or, at least, extremely computationally inefficient) to definitively calculate the maximum cumulative discounted future reward,

Because of all the possible future states S and all the possible actions A that could be taken in those future states, there are way too many possible future outcomes to take into consideration. Thus, as a computational shortcut, we’ll describe the Q-learning approach for estimating what the optimal action a in a given situation might be.

Value Functions

The story of Q-learning is most easily described by beginning with an explanation of value functions. The value function is defined by V π (s). It provides us with an indication of how valuable a given state s is if our agent follows its policy π from that state onward.
As a simple example, consider yet again the state s captured in Figure 13.1. [Note: As we did earlier in this chapter, let’s consider cart position and pole position only, because we can’t speculate on cart velocity or pole angular velocity from this still image.]  Assuming our agent already has some reasonably sensible policy π for balancing the pole, then the cumulative discounted future reward that we’d expect it to obtain in this state is prob- ably fairly large because the pole is near vertical. The value V π (s), then, of this particular state s is high.
On the other hand, if we imagine a state sh where the pole angle is approaching horizontal, the value of it—V π(sh)—is lower, because our agent has already lost control of the pole and so the episode is likely to terminate within the next few timesteps.

Q-Value Functions

The Q-value function builds on the value function by taking into account not only state: It considers the utility of a particular action when that action is paired with a given state—that is, it rehashes our old friend, the state-action pair symbolized by (s, a). Thus, where the value function is defined by V π (s), the Q-value function is defined by Qπ (s, a). [Note: The “Q” in Q-value stands for quality but you seldom hear practitioners calling these “quality-value functions.”]
Let’s return once more to Figure 13.1. Pairing the action left (let’s call this aL) with this state s and then following a pole-balancing policy π from there should generally correspond to a high cumulative discounted future reward. Therefore, the Q-value of this state-action pair (s, aL) is high.
In comparison, let’s consider pairing the action right (we can call it aR) with the state s from Figure 13.1 and then following a pole-balancing policy π from there. Although this might not turn out to be an egregious error, the cumulative discounted future reward would nevertheless probably be somewhat lower relative to taking the left action. In this state s, the left action should generally cause the pole to become more vertically oriented (enabling the pole to be better controlled and better balanced), whereas the rightward action should generally cause it to become somewhat more horizontally oriented—thus, less controlled, and the episode somewhat more likely to end early. All in all, we would expect the Q-value of (s, aL ) to be higher than the Q-value of (s, aR).

Estimating an Optimal Q-Value

When our agent confronts some state s, we would then like it to be able to calculate the optimal Q-value, denoted as Q∗ (s, a). We could consider all possible actions, and the action with the highest Q-value—the highest cumulative discounted future reward— would be the best choice.
In the same way that it is computationally intractable to definitively calculate the optimal policy π∗ (Equation 13.1) even with relatively simple reinforcement learning problems, so too is it typically computationally intractable to definitively calculate an optimal Q-value, Q∗(s, a). With the approach of deep Q-learning (as introduced in Chapter 4; see Figure 4.5), however, we can leverage an artificial neural network to estimate what the optimal Q-value might be. These deep Q-learning networks (DQNs for short) rely on this equation:

In this equation:
  • The optimal Q-value (Q∗ (s, a)) is being approximated.
  • The Q-value approximation function incorporates neural network model parameters (denoted by the Greek letter theta, θ) in addition to its usual state s and action a inputs. These parameters are the usual artificial neuron weights and biases that we have become familiar with since Chapter 6.
In the context of the Cart-Pole game, a DQN agent armed with Equation 13.2 can, upon encountering a particular state s, calculate whether pairing an action a (left or right) with this state corresponds to a higher predicted cumulative discounted future reward. If, say, left is predicted to be associated with a higher cumulative discounted future reward, then this is the action that should be taken. In the next section, we’ll code up a DQN agent that incorporates a Keras-built dense neural net to illustrate hands-on how this is done.
For a thorough introduction to the theory of reinforcement learning, 
including deep Q-learning networks, we recommend the recent edition of 
Richard Sutton (Figure 13.6) and Andrew Barto’s Reinforcement Learning: 
An Introduction,15 which is available free of charge at bit.ly/SuttonBarto.

Defining a DQN Agent

Our code for defining a DQN agent that learns how to act in an environment—in this particular case, it happens to be the Cart-Pole game from the OpenAI Gym library of environments—is provided within our Cartpole DQN Jupyter notebook. [Note: Our DQN agent is based directly on Keon Kim’s, which is available at his GitHub repository at bit.ly/keonDQN.] Its dependencies are as follows:
1
2
3
4
5
6
7
8
import random
import gym
import numpy as np
from collections import deque
from keras.models import Sequential from keras.layers import Dense
from keras.optimizers
import Adam
import os
The most significant new addition to the list is gym, the Open AI Gym itself. As usual, we discuss each dependency in more detail as we apply it. The hyperparameters that we set at the top of the notebook are provided in Example 13.1.
Example 13.1 Cart-Pole DQN hyperparameters
1
2
3
4
5
6
7
env = gym.make('CartPole-v0')
state_size = env.observation_space.shape[0] action_size = env.action_space.n
batch_size = 32
n_episodes = 1000
output_dir = 'model_output/cartpole/'
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
Let’s look at this code line by line:
  • We use the Open AI Gym make() method to specify the particular environment that we’d like our agent to interact with. The environment we choose is version zero (v0) of the Cart-Pole game, and we assign it to the variable env. On your own time, you’re welcome to select an alternative Open AI Gym environment, such as one of those presented in Figure 4.13.
From the environment, we extract two parameters:
    • state_size: the number of types of state information, which for the Cart-
      Pole game is 4 (recall that these are cart position, cart velocity, pole angle, and pole angular velocity).
    • action_size: the number of possible actions, which for Cart-Pole is 2 (left and right).
    • We set our mini-batch size for training our neural net to 32.
    • We set the number of episodes (rounds of the game) to 1000. As you’ll soon see, this is about the right number of episodes it will take for our agent to excel regularly at the Cart-Pole game. For more-complex environments, you’d likely need to increase this hyperparameter so that the agent has more rounds of gameplay to learn in.
    • We define a unique directory name ('model_output/cartpole/') into which we’ll output our neural network’s parameters at regular intervals. If the directory doesn’t yet exist, we use os.makedirs() to make it.
The rather large chunk of code for creating a DQN agent Python class—called DQNAgent—is provided in Example 13.2.
Example 13.2 A deep Q-learning agent
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
class DQNAgent:
def __init__(self, state_size, action_size):
               self.state_size = state_size
               self.action_size = action_size
               self.memory = deque(maxlen=2000)
               self.gamma = 0.95
               self.epsilon = 1.0
               self.epsilon_decay = 0.995
               self.epsilon_min = 0.01
               self.learning_rate = 0.001
               self.model = self._build_model()
 
def _build_model(self):
    model = Sequential()
    model.add(Dense(32, activation='relu',
                    input_dim=self.state_size))
    model.add(Dense(32, activation='relu'))
    model.add(Dense(self.action_size, activation='linear'))
   model.compile(loss='mse',
                 optimizer=Adam(lr=self.learning_rate))
return model
 
def remember(self, state, action, reward, next_state, done):
    self.memory.append((state, action,
                        reward, next_state, done))
 
def train(self, batch_size):
     minibatch = random.sample(self.memory, batch_size)
     for state, action, reward, next_state, done in minibatch:
         target = reward # if done
         if not done:
         target = (reward +
                  self.gamma *
                  np.amax(self.model.predict(next_state)[0]))
     target_f = self.model.predict(state)
     target_f[0][action] = target
     self.model.fit(state, target_f, epochs=1, verbose=0)
if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay
 
def act(self, state):
     if np.random.rand() <= self.epsilon:
         return random.randrange(self.action_size)
     act_values = self.model.predict(state)
     return np.argmax(act_values[0])
 
def save(self, name):
    self.model.save_weights(name)

Initialization Parameters

We begin Example 13.2 by initializing the class with a number of parameters:
  • state_size and action_size are environment-specific, but in the case of the Cart-Pole game are 4 and 2, respectively, as mentioned earlier.
  • memoryis for storing memories that can subsequentlyreplayed in order to train our DQN’s neural net. The memories are stored as elements of a data structure called adeque (pronounced “deck”), which is the same as a list except that—because we specified maxlen=2000—it only retains the 2,000 most recent memories. That is, whenever we attempt to append a 2,001st element onto the deque, its first element is removed, always leaving us with a list that contains no more than 2,000 elements.
  • gamma is the discount factor (a.k.a. decay rate) γ that we introduced earlier in this chapter (see Figure 13.4). This agent hyperparameter discounts prospective rewards in future timesteps. Effective γ values typically approach 1 (for example, 0.9, 0.95, 0.98, and 0.99). The closer to 1, the less we’re discounting future reward. [Note: Indeed, if you were to set γ = 1 (which we don’t recommend) you wouldn’t be discounting future reward at all.] Tuning the hyperparameters of reinforcement learning models such as γ can be a fiddly process; near the end of this chapter, we discuss a tool called SLM Lab for carrying it out effectively.
  • epsilon—symbolized by the Greek letter ε—is another reinforcement learning hyperparameter called exploration rate. It represents the proportion of our agent’s actions that are random (enabling it to explore the impact of such actions on the next state st+1 and the reward r returned by the environment) relative to how often we allow its actions to exploit the existing “knowledge” its neural net has accumulated through gameplay. Prior to having played any episodes, agents have no gameplay experience to exploit, so it is the most common practice to start it off exploring 100 percent of the time; this is why we set epsilon = 1.0.
  • As the agent gains gameplay experience, we very slowly decay its exploration rate so that it can gradually exploit the information it has learned (hopefully enabling it to attain more reward, as illustrated in Figure 13.7). That is, at the end of each episode the agent plays, we multiply its ε by epsilon_decay. Common options for this hyperparameter are 0.990, 0.995, and 0.999. [Note: Analogous to setting γ = 1, setting epsilon_decay = 1 would mean ε would not be decayed at all—that is, exploring at a continuous rate. This would be an unusual choice for this hyperparameter.]
  • epsilon_min is a floor (a minimum) on how low the exploration rate ε can decay to. This hyperparameter is typically set to a near-zero value such as 0.001, 0.01, or 0.02. We set it equal to 0.01, meaning that after ε has decayed to 0.01 (as it will in our case by the 911th episode), our agent will explore on only 1 percent of the actions it takes—exploiting its gameplay experience the other 99 percent of the time. [Note: If at this stage this exploration rate concept is somewhat unclear, it should become clearer as we examine our agent’s episode-by-episode results later on.]
  • learning_rate is the same stochastic gradient descent hyperparameter that we covered in Chapter 8.
  • Finally, _build_model()—by the inclusion of its leading underscore—is being suggested as a private method. This means that this method is recommended for use “internally” only—that is, solely by instances of the class DQNAgent.

Building the Agent’s Neural Network Model

The _build_model() method of Example 13.2 is dedicated to constructing and compiling a Keras-specified neural network that maps an environment’s state s to the agent’s Q-value for each available action a. Once trained via gameplay, the agent will then be able to use the predicted Q-values to select the particular action it should take, given a particular environmental state it encounters. Within the method, there is nothing you haven’t seen before in this book:
  • We specify a sequential model.
  • We add to the model the following layers of neurons.
  • The first hidden layer is dense, consisting of 32 ReLU neurons. Using the input_dim argument, we specify the shape of the network’s input layer, which is the dimensionality of the environment’s state information s. In the case of the Cart-Pole environment, this value is an array of length 4, with one element each for cart position, cart velocity, pole angle, and pole angular velocity. [Note: In environments other than Cart-Pole, the state information might be much more complex. For example, with an Atari video game environment like Pac-Man, state s would consist of pixels on a screen, which would be a two- or three-dimensional input (for monochromatic or full-color, respectively). In a case such as this, a better choice of first hidden layer would be a convolutional layer such as Conv2D (see Chapter 10).]
  • The second hidden layer is also dense, with 32 ReLU neurons. As mentioned earlier, we’ll explore hyperparameter selection—including how we home in on a particular model architecture—by discussing the SLM Lab tool later on in this chapter.
    • The output layer has dimensionality corresponding to the number of possible actions. [Note: Any previous models in this book with only two outcomes (as in Chapters 11 and 12) used a single sigmoid neuron. Here, we specify separate neurons for each of the outcomes, because we would like our code to generalize beyond the Cart-Pole game. While Cart-Pole has only two actions, many environments have more than two.] In the case of the Cart-Pole game, this is an array of length 2, with one element for left and the other for right. As with a regression model (see Example 9.8), with DQNs the z values are output directly from the neural net instead of being converted into a probability between 0 and 1. To do this, we specify the linear activation function instead of the sigmoid or softmax functions that have otherwise dominated this book.
  • As indicated when we compiled our regression model (Example 9.9), mean squared error is an appropriate choice of cost function when we use linear activation in the output layer, so we set the compile() method’s loss argument to mse. We return to our routine optimizer choice, Adam.

Remembering Gameplay

At any given timestep t—that is, during any given iteration of the reinforcement learning loop (refer back to Figure 13.3)—the DQN agent’s remember() method is run in order to append a memory to the end of its memory deque. Each memory in this deque consists of five pieces of information about timestep t:
  1. The state st that the agent encountered
  2. The action at that the agent took
  3. The reward rt that the environment returned to the agent
  4. The next_state st+1 that the environment also returned to the agent
  5. A Boolean flag done that is true if timestep t was the final iteration of the episode, and false otherwise

Training via Memory Replay

The DQN agent’s neural net model is trained by replaying memories of gameplay, as shown within the train() method of Example 13.2. The process begins by randomly sampling a minibatch of 32 (as per the agent’s batch_size parameter) memories from the memory deque (which holds up to 2,000 memories). Sampling a small subset of memories from a much larger set of the agent’s experiences makes model-training more efficient: If we were instead to use, say, the 32 most recent memories to train our model, many of the states across those memories would be very similar. To illustrate this point, consider a timestep t where the cart is at some particular location and the pole is near vertical. The adjacent timesteps (e.g., t − 1, t + 1, t + 2) are also likely to be at nearly the same location with the pole in a near-vertical orientation. By sampling from across a broad range of memories instead of temporally proximal ones, the model will be provided with a richer cornucopia of experiences to learn from during each round of training.
For each of the 32 sampled memories, we carry out a round of model training as follows: If done is True—that is, if the memory was of the final timestep of an episode— then we know definitively that the highest possible reward that could be attained from this timestep is equal to the reward rt. Thus, we can just set our target reward equal to reward.
Otherwise (i.e., if done is False) then we try to estimate what the target reward— the maximum discounted future reward—might be. We perform this estimation by starting with the known reward rt and adding to it the discounted [Note: That is, multiplied by gamma, the discount factor γ.] maximum future Q-value. Possible future Q-values are estimated by passing the next (i.e., future) state st+1 into the model’s predict() method. Doing this in the context of the Cart-Pole game returns two outputs: one output for the action left and the other for the action right. Whichever of these two outputs is higher (as determined by the NumPy amax function) is the maximum predicted future Q-value.
Whether target is known definitively (because the timestep was the final one in an episode) or it’s estimated using the maximum future Q-value calculation, we continue onward within the train() method’s for loop:
  • We run the predict() method again, passing in the current state st. As before, in the context of the Cart-Pole game this returns two outputs: one for the left action and one for the right. We store these two outputs in the variable target_f.
  • Whichever action at the agent actually took in this memory, we use target_f[0][action] = target to replace that target_f output with the target reward. [Note: We do this because we can only train the Q-value estimate based on actions that were actually taken by the agent: We estimated target based on next_state st+1 and we only know what st+1 was for the action at that was actually taken by the agent at timestep t. We don’t know what next state st+1 the environment might have returned had the agent taken a different action than it actually took.]
We train our model by calling the fit() method.
  • The model input is the current state st and its output is target_f, which incorporates our approximation of the maximum future discounted reward. By tuning the model’s parameters (represented by θ in Equation 13.2), we thus improve its capacity to accurately predict the action that is more likely to be associated with maximizing future reward in any given state.
  • In many reinforcement learning problems, epochs can be set to 1. Instead of recycling an existing training dataset multiple times, we can cheaply engage in more episodes of the Cart-Pole game (for example) to generate as many fresh training data as we fancy.
  • We set verbose=0 because we don’t need any model-fitting outputs at this stage to monitor the progress of model training. As we demonstrate shortly, we’ll instead monitor agent performance on an episode-by-episode basis.

Selecting an Action to Take

To select a particular action at to take at a given timestep t, we use the agent’s act() method. Within this method, the NumPy rand function is used to sample a random value between 0 and 1 that we’ll call v. In conjunction with our agent’s epsilon, epsilon_decay, and epsilon_min hyperparameters, this v value will determine for us whether the agent takes an exploratory action or an exploitative one:
  • If the random value v is less than or equal to the exploration rate ε, then a random exploratory action is selected using the randrange function. In early episodes, when ε is high, most of the actions will be exploratory. In later episodes, as ε decays further and further (according to the epsilon_decay hyperparameter), the agent will take fewer and fewer exploratory actions.
  • Otherwise—that is, if the random value v is greater than ε—the agent selects an action that exploits the “knowledge” the model has learned via memory replay. To exploit this knowledge, the state st is passed in to the model’s predict() method, which returns an activation output for each of the possible actions the agent could theoretically take. We use the NumPy argmax function to select the action at associated with the largest activation output. [Note: Recall that the activation is linear, and thus the output is not a probability; instead, it is the discounted future reward for that action.]
{Note: We introduced the exploratory and exploitative modes of action when discussing the initialization parameters for our DQNAgent class earlier, and they’re illustrated playfully in Figure 13.7.]

Saving and Loading Model Parameters

Finally, the save() and load() methods are one-liners that enable us to save and load the parameters of the model. Particularly with respect to complex environments, agent performance can be flaky: For long stretches, the agent may perform very well in a given environment, and then later appear to lose its capabilities entirely. Because of this flakiness, it’s wise to save our model parameters at regular intervals. Then, if the agent’s performance drops off in later episodes, the higher-performing parameters from some earlier episode can be loaded back up.

Interacting with an OpenAI Gym Environment

Having created our DQN agent class, we can initialize an instance of the class—which we name agent—with this line of code:
1
agent = DQNAgent(state_size, action_size)
The code in Example 13.3 enables our agent to interact with an OpenAI Gym environment, which in our particular case is the Cart-Pole game.
Example 13.3 DQN agent interacting with an OpenAI Gym environment
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
for e in range(n_episodes):
    state = env.reset()
    state = np.reshape(state, [1, state_size])
 
done = False
time = 0
while not done:
 
#     env.render()
    action = agent.act(state)
    next_state, reward, done, _ = env.step(action)
    reward = reward if not done else -10
    next_state = np.reshape(next_state, [1, state_size])
    agent.remember(state, action, reward, next_state, done) state = next_state
    if done:
        print("episode: {}/{}, score: {}, e: {:.2}"
              .format(e, n_episodes-1, time, agent.epsilon))
    time += 1
if len(agent.memory) > batch_size:
   agent.train(batch_size)
if e % 50 == 0:
    agent.save(output_dir + "weights_"
               + '{:04d}'.format(e) + ".hdf5")
Recalling that we had set the hyperparameter n_episodes to 1000, Example 13.3 consists of a big for loop that allows our agent to engage in these 1,000 rounds of game- play. Each episode of gameplay is counted by the variable e and involves:
  • We use env.reset() to begin the episode with a random state st. For the purposes of passing state into our Keras neural network in the orientation the model is expecting, we use reshape to convert it from a column into a row. [Note: We previously performed this transposition for the same reason back in Example 9.11.]
    • The env.render() line is commented out because if you are running this code via a Jupyter notebook within a Docker container, this line will cause an error. If, however, you happen to be running the code via some other means (e.g., in a Jupyter notebook without using Docker) then you can try uncommenting this line. If an error isn’t thrown, then a pop-up window should appear that renders the environment graphically. This enables you to watch your DQN agent as it plays the Cart-Pole game in real time, episode by episode. It’s fun to watch, but it’s by no means essential: It certainly has no impact on how the agent learns!
    • We pass the state st into the agent’act() method, and this returns the agent’s action at, which is either 0 (representing left) or 1 (right).
    • The action at is provided to the environment’s step() method, which returns the next_state st+1, the current reward rt, and an update to the Boolean flag done.
    • If the episode is done (i.e., done equals true), then we set reward to a negative value (-10). This provides a strong disincentive to the agent to end an episode early by losing control of balancing the pole or navigating off the screen. If the episode is not done (i.e., done is False), then reward is +1 for each additional timestep of gameplay.Nested within our thousand-episode loop is a while loop that iterates over the timesteps of a given episode. Until the episode ends (i.e., until done equals True), in each timestep t (represented by the variable time), we do the following.
    • In the same way that we needed to reorient state to be a row at the start of the episode, we use reshape to reorientnext_state to a row here.
  • We use our agent’s remember() method to save all the aspects of this timestep (the state st, the action at that was taken, the reward rt, the next state st+1, and the flag done) to memory.
    We set state equal to next_state in preparation for the next iteration of the loop, which will be timestep t + 1.
  • If the episode ends, then we print summary metrics on the episode (see Figures 13.8 and 13.9 for example outputs).
  • Add 1 to our timestep counter time.
  • If the use the agent’s train() method to train its neural net parameters by replaying its memories of gameplay.[Note: You can optionally move this training step up so that it’s inside the while loop. Each episode will take a lot longer because you’ll be training the agent much more often, but your agent will tend to solve the Cart-Pole game in far fewer episodes.]
  • Every 50 episodes, we use the agent’s save() method to store the neural net model’s parameters.
As shown in Figure 13.8, during our agent’s first 10 episodes of the Cart-Pole game, the scores were low. It didn’t manage to keep the game going for more than 42 timesteps (i.e., a score of 41).

During these initial episodes, the exploration rate ε began at 100 percent. By the 10th episode, ε had decayed to 96 percent, meaning that the agent was in exploitative mode (refer back to Figure 13.7) on about 4 percent of timesteps. At this early stage of training, however, most of these exploitative actions would probably have been effectively random anyway.
As shown in Figure 13.9, by the 991st episode our agent had mastered the Cart-Pole game.

It attained a perfect score of 199 in all of the final 10 episodes by keeping the game going for 200 timesteps in each one. By the 911th episode,28 the exploration rate ε had reached its minimum of 1 percent so during all of these final episodes the agent is in exploitative mode in about 99 percent of timesteps. From the perfect performance in these final episodes, it’s clear that these exploitative actions were guided by a neural net well trained by its gameplay experience from previous episodes.
As mentioned earlier in this chapter, deep reinforcement learning agents often dis- play finicky behavior. When you train your DQN agent to play the Cart-Pole game, you might find that it performs very well during some later episodes (attaining many consecutive 200-timestep episodes around, say, the 850th or 900th episode) but then it performs poorly around the final (1,000th) episode. If this ends up being the case, you can use the load() method to restore model parameters from an earlier, higher-performing phase.

Hyperparameter Optimization with SLM Lab

At a number of points in this chapter, in one breath we’d introduce a hyperparameter and then in the next breath we’d indicate that we’d later introduce a tool called SLM Lab for tuning that hyperparameter. Well, that moment has arrived! [Note: “SLM” is an abbreviation of strange loop machine, with the strange loop concept being related to ideas about the experience of human consciousness. See Hofstadter, R. (1979). Gödel, Escher, Bach. New York: Basic Books.]
SLM Lab is a deep reinforcement learning framework developed by Wah Loon Keng and Laura Graesser, who are California-based software engineers (at the mobile-gaming firm MZ and within the Google Brain team, respectively). The framework is available at github.com/kengz/SLM-Lab and has a broad range of implementations and functionality related to deep reinforcement learning:
  • It enables the use of many types of deep reinforcement learning agents, including DQN and others (forthcoming in this chapter).
  • It provides modular agent components, allowing you to dream up your own novel categories of deep RL agents.
  • You can straightforwardly drop agents into environments from a number of differ- ent environment libraries, such as OpenAI Gym and Unity (see Chapter 4).
  • Agents can be trained in multiple environments simultaneously. For example, a single DQN agent can at the same time solve the OpenAI Gym Cart-Pole game and the Unity ball-balancing game Ball2D.
  • You can benchmark your agent’s performance in a given environment against others’ efforts.
Critically, for our purposes, the SLM Lab also provides a painless way to experiment with various agent hyperparameters to assess their impact on an agent’s performance in a given environment. Consider, for example, the experiment graph shown in Figure 13.10. In this particular experiment, a DQN agent was trained to play the Cart-Pole game during a number of distinct trials. Each trial is an instance of an agent with particular, distinct hyperparameters trained for many episodes. Some of the hyperparameters varied between trials were as follows.

  • Dense net model architecture
    • [32]: a single hidden layer, with 32 neurons
    • [64]: also a single hidden layer, this time with 64 neurons
    • [32, 16]: two hidden layers; the first with 32 neurons and the second with 16
    • [64, 32]: also with two hidden layers, this time with 64 neurons in the first hidden layer and 32 in the second
  • Activation function across all hidden layers
    • Sigmoid
    • Tanh
    • ReLU
  • Optimizer learning rate (η), which ranged from zero up to 0.2
  • Exploration rate (ε) annealing, which ranged from 0 to 100 [Note: Annealing is an alternative to ε decay that serves the same purpose. With the epsilon and epsilon_min hyper- parameters set to fixed values (say, 1.0 and 0.01, respectively), variations in annealing will adjust epsilon_decay such that an ε of 0.01 will be reached by a specified episode. If, for example, annealing is set to 25 then ε will decay at a rate such that it lowers uniformly from 1.0 in the first episode to 0.01 after 25 episodes. If annealing is set to 50 then ε will decay at a rate such that it lowers uniformly from 1.0 in the first episode to 0.01 after 50 episodes.]
SLM Lab provides a number of metrics for evaluating model performance (some of which can be seen along the vertical axis of Figure 13.10):
  1. Strength: This is a measure of the cumulative reward attained by the agent.
  2. Speed: This is how quickly (i.e., over how many episodes) the agent was able to
    reach its strength.
  3. Stability: After the agent solved how to perform well in the environment, this is a measure of how well it retained its solution over subsequent episodes.
  4. Consistency: This is a metric of how reproducible the performance of the agent was across trials that had identical hyperparameter settings.
Fitness: An overall summary metric that takes into account the above four
metrics simultaneously. Using the fitness metric in the experiment captured by Figure 13.10, it appears that the following hyperparameter settings are optimal for this DQN agent playing the Cart-Pole game:
  • A single-hidden-layer neural net architecture, with 64 neurons in that single layer outperforming the 32-neuron model.
  • The tanh activation function for the hidden layer neurons.
  • A low learning rate (η) of ~0.02.
  • Trials with an exploration rate (ε) that anneals over 10 episodes outperform trials that anneal over 50 or 100 episodes.
Details of running SLM Lab are beyond the scope of our book, but the library is well documented at kengz.gitbooks.io/slm-lab.

Agents Beyond DQN

In the world of deep reinforcement learning, deep Q-learning networks like the one we built in this chapter are relatively simple. To their credit, not only are DQNs (comparatively) simple, but—relative to many other deep RL agents—they also make efficient use of the training samples that are available to them. That said, DQN agents do have drawbacks. Most notable are:
  1. If the possible number of state-action pairs is large in a given environment, then the Q-function can become extremely complicated, and so it becomes intractable to estimate the optimal Q-value, Q∗.
  2. Even in situations where finding Q∗ is computationally tractable, DQNs are not great at exploring relative to some other approaches, and so a DQN may not con- verge on Q∗ anyway.
Thus, even though DQNs are sample efficient, they aren’t applicable to solving all problems.
To wrap up this deep reinforcement learning chapter, let’s briefly introduce the types of agents beyond DQNs. The main categories of deep RL agents, as shown in Figure 13.11, are:
  • Value optimization: These include DQN agents and their derivatives (e.g., double DQN, dueling QN) as well as other types of agents that solve reinforcement learning problems by optimizing value functions (including Q-value functions).
  • Imitation learning: The agents in this category (e.g., behavioral cloning and conditional imitation learning algorithms) are designed to mimic behaviors that are taught to them through demonstration, by—for example—showing them how to place dinner plates on a dish rack or how to pour water into a cup. Although imitation learning is a fascinating approach, its range of applications is relatively small and we don’t discuss it further in this book.
  • Model optimization: Agents in this category learn to predict future states based on (s, a) at a given timestep. An example of one such algorithm is Monte Carlo tree search (MCTS), which we introduced with respect to AlphaGo in Chapter 4.
  • Policy optimization: Agents in this category learn policies directly, that is, they directly learn the policy function π shown in Figure 13.5. We’ll cover these in further detail in the next section.

Policy Gradients and the REINFORCE Algorithm

Recall from Figure 13.5 that the purpose of a reinforcement learning agent is to learn some policy function π that maps the state space S to the action space A. With DQNs, and indeed with any other value optimization agent, π is learned indirectly by estimating a value function such as the optimal Q-value, Q∗. With policy optimization agents, π is learned directly instead.
Policy gradient (PG) algorithms, which can perform gradient ascent32 on π directly, are exemplified by a particularly well-known reinforcement learning algorithm called REINFORCE.33 The advantage of PG algorithms like REINFORCE is that they are likely to converge on a fairly optimal solution,34 so they’re more widely applicable than value optimization algorithms like DQN. The trade-off is that PGs have low consistency. That is, they have higher variance in their performance relative to value optimization approaches like DQN, and so PGs tend to require a larger number of training samples.

The Actor-Critic Algorithm

As suggested by Figure 13.11, the actor-critic algorithm is an RL agent that combines the value optimization and policy optimization approaches. More specifically, as depicted in Figure 13.12, the actor-critic combines the Q-learning and PG algorithms. At a high level, the resulting algorithm involves a loop that alternates between:
  • Actor: a PG algorithm that decides on an action to take.
  • Critic: a Q-learning algorithm that critiques the action that the actor selected, providing feedback on how to adjust. It can take advantage of efficiency tricks in Q-learning, such as memory replay.

In a broad sense, the actor-critic algorithm is reminiscent of the generative 
adver- sarial networks of Chapter 12. GANs have a generator network in a loop 
with a discriminator network, with the former creating fake images that are 
evaluated by the latter. The actor-critic algorithm has an actor in a loop 
with a critic, with the former taking actions that are evaluated by the latter.
The advantage of the actor-critic algorithm is that it can solve a broader range of prob- lems than DQN, while it has lower variance in performance relative to REINFORCE. That said, because of the presence of the PG algorithm within it, the actor-critic is still somewhat sample inefficient.
While implementing REINFORCE and the actor-critic algorithm are beyond the scope of this book, you can use SLM Lab to apply them yourself, as well as to examine their underlying code.

Summary

In this chapter, we covered the essential theory of reinforcement learning, including Markov decision processes. We leveraged that information to build a deep Q-learning agent that solved the Cart-Pole environment. To wrap up, we introduced deep RL algorithms beyond DQN such as REINFORCE and actor-critic. We also described SLM Lab—a deep RL framework with existing algorithm implementations as well as tools for optimizing agent hyperparameters.
This chapter brings an end to Part III of this book, which provided hands-on applications of machine vision (Chapter 10), natural language processing (Chapter 11), art-generating models (Chapter 12), and sequential decision-making agents. In Part IV, the final part of the book, we will provide you with loose guidance on adapting these applications to your own projects and inclinations.