Chapter1 Basic Concepts

A grid-world example

Pasted image 20241028153903.png
good的定义要因task而不同。

State

Pasted image 20241028154016.png
状态空间在这里面就是集合！

Action

Pasted image 20241028154138.png
也是集合
注意ActionSpace是state的函数！

State transition

Pasted image 20241028154419.png
这是个游戏环境的话，状态转移可以随便定义
Pasted image 20241028154534.png
case1和case2是两种不同的forbidden area的定义，第一种是我们要研究的，因为更加复杂同时可能会有更加有意思的行为。
Pasted image 20241028154642.png
表格表示方法。
Pasted image 20241028154715.png
我们解释数学表达式中的第一行，state在s1，进行action中的a2，进入到s2的概率是1。

ppt中是一个确定性的例子，但是引入概率能描述一些随机性的行为，比如说吹过了一阵风让我们state在s1，进行action中的a2，进入到s2的概率是0.9。

Policy

Pasted image 20241028155359.png
Pasted image 20241028155601.png
在一个state采取一个action的概率。
Pasted image 20241028155649.png

Reward

Pasted image 20241028161027.png

Pasted image 20241028161236.png

Pasted image 20241028161249.png

Pasted image 20241028161354.png

Trajectory and return

Pasted image 20241028161602.png

Pasted image 20241028161643.png

Pasted image 20241028161754.png

Discounted return

Pasted image 20241028161849.png

Pasted image 20241028162028.png

Episode

Pasted image 20241028162245.png
episode是指完成了有限步之后的整个trajectory。
Pasted image 20241028162513.png

Markov decision process (MDP)

Pasted image 20241028162746.png
M:Markov property
D:Policy
P:Sets and Probability distribution
Pasted image 20241028163013.png
可以看出这种过程没有确定D:Policy因此就叫做Markov process