PPO x Family 决策智能入门公开课

欢迎来到 PPO x Family 系列决策智能入门公开课。本系列将深入理解深度强化学习算法 PPO，灵活运用单一 PPO 算法解决几乎所有常见的决策智能应用，帮助所有对深度强化学习技术感兴趣的人快速高效地创建应用原型，了解和学习最强大最易用的 PPO Family。

注：路过请点个 star ，2022年12月起持续更新中~

新闻

2023.06.07: PPO x Family 第八章（突破智能体终极界限）及课程大作业将于十月下旬正式上线
2023.06.01: [哔哩哔哩] PPO x Family 第七章（挖掘黑科技）正式上线
2023.04.06: [哔哩哔哩] PPO x Family 第六章（统筹多智能体）正式上线
2023.03.09: [哔哩哔哩] PPO x Family 第五章（探索时序建模）正式上线
2023.02.23: [哔哩哔哩] PPO x Family 第四章（解密稀疏奖励空间）正式上线
2023.01.16: [哔哩哔哩] PPO x Family 第三章（表征多模态观察空间）正式上线
2022.12.23: [哔哩哔哩] PPO x Family 第二章（解构复杂动作空间）正式上线
2022.12.23: PPO x Family "算法-代码" 注解文档网站上线传送门
2022.12.08: [哔哩哔哩] PPO x Family 第一章（开启决策AI探索之旅）正式上线
2022.12.06: [哔哩哔哩] PPO x Family 第一章微课视频：4分钟带你快速入门强化学习的万能钥匙
2022.12.05: [PaperWeekly] 给你一个 PPO × Family 课程，撑起整个决策 AI 宇宙
2022.12.01: [哔哩哔哩] PPO x Family 课程品牌宣传视频
2022.11.30: [机器之心] 集中一点，演化无限：PPO × Family决策智能入门公开课即日开讲
2022.11.30: [中国计算机学会CCF] 【CCF科普群星计划】决策智能入门公开课开课啦

课程大纲

# 内容导航 | 章节（视频课） | 算法理论资料 | 补充资料 | 习题 | 代码样例 | 应用样例 | |------|-----|----------|-------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| ---| | [第一章：开启决策AI探索之旅](https://www.bilibili.com/video/BV1cG4y137dJ) | [课程PPT](https://github.com/opendilab/PPOxFamily/blob/main/chapter1_overview/chapter1_lecture.pdf)
[课程文字稿](https://github.com/opendilab/PPOxFamily/blob/main/chapter1_overview/chapter1_manuscript.pdf) | [微课视频](https://www.bilibili.com/video/BV1e841157Um)
[策略梯度](https://github.com/opendilab/PPOxFamily/blob/main/chapter1_overview/chapter1_supp_pg.pdf)
[A2C](https://github.com/opendilab/PPOxFamily/blob/main/chapter1_overview/chapter1_supp_a2c.pdf)
[TRPO](https://github.com/opendilab/PPOxFamily/blob/main/chapter1_overview/chapter1_supp_trpo.pdf)
[符号表](https://github.com/opendilab/PPOxFamily/blob/main/common/notation.pdf)
[问答总结](https://github.com/opendilab/PPOxFamily/blob/main/chapter1_overview/chapter1_qa.pdf) | [习题](https://github.com/opendilab/PPOxFamily/blob/main/chapter1_overview/chapter1_homework.pdf)
[习题题解](https://github.com/opendilab/PPOxFamily/blob/main/chapter1_overview/chapter1_hw_solution.pdf) | [PG算法示例](https://github.com/opendilab/PPOxFamily/blob/main/chapter1_overview/pg_zh.py)
[A2C算法示例](https://github.com/opendilab/PPOxFamily/blob/main/chapter1_overview/a2c_zh.py)
[PPO算法示例](https://github.com/opendilab/PPOxFamily/blob/main/chapter1_overview/ppo_zh.py) | [应用混剪](https://www.bilibili.com/video/BV1vW4y1M7cH/?spm_id_from=333.337.search-card.all.click) | | [第二章：解构复杂动作空间](https://www.bilibili.com/video/BV1wv4y167w2) | [课程PPT](https://github.com/opendilab/PPOxFamily/blob/main/chapter2_action/chapter2_lecture.pdf)
[课程文字稿](https://github.com/opendilab/PPOxFamily/blob/main/chapter2_action/chapter2_manuscript.pdf) | [重参数化](https://github.com/opendilab/PPOxFamily/blob/main/chapter2_action/chapter2_supp_reparameterization.pdf)
[PPO与DDPG对比](https://github.com/opendilab/PPOxFamily/blob/main/chapter2_action/chapter2_supp_ppovsddpg.pdf)
[HyAR](https://github.com/opendilab/PPOxFamily/blob/main/chapter2_action/chapter2_supp_hyar.pdf)
[问答总结](https://github.com/opendilab/PPOxFamily/blob/main/chapter2_action/chapter2_qa.pdf) | [习题](https://github.com/opendilab/PPOxFamily/blob/main/chapter2_action/chapter2_homework.pdf)
[习题题解](https://github.com/opendilab/PPOxFamily/blob/main/chapter2_action/chapter2_hw_solution.pdf) | [离散动作示例](https://github.com/opendilab/PPOxFamily/blob/main/chapter2_action/discrete_tutorial_zh.py)
[连续动作示例](https://github.com/opendilab/PPOxFamily/blob/main/chapter2_action/continuous_tutorial_zh.py)
[混合动作示例](https://github.com/opendilab/PPOxFamily/blob/main/chapter2_action/hybrid_tutorial_zh.py)
[应用训练代码](https://github.com/opendilab/PPOxFamily/blob/main/chapter2_action/chapter2_application_demo.py) | [火箭回收等](https://github.com/opendilab/PPOxFamily/issues/4) | | [第三章：表征多模态动作空间](https://www.bilibili.com/video/BV1rK411r7Kg) | [课程PPT](https://github.com/opendilab/PPOxFamily/blob/main/chapter3_obs/chapter3_lecture.pdf)
[课程文字稿](https://github.com/opendilab/PPOxFamily/blob/main/chapter3_obs/chapter3_manuscript.pdf) | [表征学习](https://github.com/opendilab/PPOxFamily/blob/main/chapter3_obs/chapter3_supp_representation.pdf)
[PPG](https://github.com/opendilab/PPOxFamily/blob/main/chapter3_obs/chapter3_supp_ppg.pdf)
[不变性](https://github.com/opendilab/PPOxFamily/blob/main/chapter3_obs/chapter3_supp_invariance.pdf)
[问答总结](https://github.com/opendilab/PPOxFamily/blob/main/chapter3_obs/chapter3_qa.pdf) | [习题](https://github.com/opendilab/PPOxFamily/blob/main/chapter3_obs/chapter3_homework.pdf)
[习题题解](https://github.com/opendilab/PPOxFamily/blob/main/chapter3_obs/chapter3_hw_solution.pdf) | [编码方法示例](https://github.com/opendilab/PPOxFamily/blob/main/chapter3_obs/encoding.py)
[Wrapper示例](https://github.com/opendilab/PPOxFamily/blob/main/chapter3_obs/mario_wrapper.py)
[计算图示例](https://github.com/opendilab/PPOxFamily/blob/main/chapter3_obs/gradient.py)
[应用训练代码](https://github.com/opendilab/PPOxFamily/blob/main/chapter3_obs/chapter3_application_demo.py) | [软体机器人等](https://github.com/opendilab/PPOxFamily/issues/8) | | [第四章：解密稀疏奖励空间](https://www.bilibili.com/video/BV15j411F7ni) | [课程PPT](https://github.com/opendilab/PPOxFamily/blob/main/chapter4_reward/chapter4_lecture.pdf)
[课程文字稿](https://github.com/opendilab/PPOxFamily/blob/main/chapter4_reward/chapter4_manuscript.pdf) | [逆强化学习](https://github.com/opendilab/PPOxFamily/blob/main/chapter4_reward/chapter4_supp_irl.pdf)
[行为克隆BC](https://github.com/opendilab/PPOxFamily/blob/main/chapter4_reward/chapter4_supp_bc.pdf)
[问答总结](https://github.com/opendilab/PPOxFamily/blob/main/chapter4_reward/chapter4_qa.pdf) | [习题](https://github.com/opendilab/PPOxFamily/blob/main/chapter4_reward/chapter4_homework.pdf)
[习题解答](https://github.com/opendilab/PPOxFamily/blob/main/chapter4_reward/chapter4_hw_solution.pdf) | [ICM好奇心奖励](https://github.com/opendilab/PPOxFamily/blob/main/chapter4_reward/curiosity_icm.py)
[RND好奇心奖励](https://github.com/opendilab/PPOxFamily/blob/main/chapter4_reward/curiosity_rnd.py)
[Pop-Art示例](https://github.com/opendilab/PPOxFamily/blob/main/chapter4_reward/popart.py)
[价值缩放](https://github.com/opendilab/PPOxFamily/blob/main/chapter4_reward/value_rescale.py)
[应用训练代码](https://github.com/opendilab/PPOxFamily/blob/main/chapter4_reward/chapter4_application_demo.py) | [自动驾驶等](https://github.com/opendilab/PPOxFamily/issues/44) | | [第五章：探索时序建模](https://www.bilibili.com/video/BV1Uj411u7GA) | [课程PPT](https://github.com/opendilab/PPOxFamily/blob/main/chapter5_time/chapter5_lecture.pdf) | [随机性策略](https://github.com/opendilab/PPOxFamily/blob/main/chapter5_time/chapter5_supp_sto_det.pdf)
[RWKV](https://github.com/opendilab/PPOxFamily/blob/main/chapter5_time/chapter5_supp_rwkv.pdf)
[信念MDP](https://github.com/opendilab/PPOxFamily/blob/main/chapter5_time/chapter5_supp_belief.pdf)
[问答总结](https://github.com/opendilab/PPOxFamily/blob/main/chapter5_time/chapter5_qa.pdf) | [习题](https://github.com/opendilab/PPOxFamily/blob/main/chapter5_time/chapter5_homework.pdf)
[习题解答](https://github.com/opendilab/PPOxFamily/blob/main/chapter5_time/chapter5_hw_solution.pdf) | [LSTM示例](https://github.com/opendilab/PPOxFamily/blob/main/chapter5_time/lstm.py)
[GTrXL示例](https://github.com/opendilab/PPOxFamily/blob/main/chapter5_time/gtrxl.py)
[应用训练代码](https://github.com/opendilab/PPOxFamily/blob/main/chapter5_time/chapter5_application_demo.py) | [记忆型决策](https://github.com/opendilab/PPOxFamily/issues/48) | | [第六章：统筹多智能体](https://www.bilibili.com/video/BV1dg4y1g7BC) | [课程PPT](https://github.com/opendilab/PPOxFamily/blob/main/chapter6_marl/chapter6_lecture.pdf) | [HAPPO](https://github.com/opendilab/PPOxFamily/tree/main/chapter6_marl/chapter6_supp_happo.pdf)
[ACE](https://github.com/opendilab/PPOxFamily/blob/main/chapter6_marl/chapter6_supp_ace.pdf)
[值分解](https://github.com/opendilab/PPOxFamily/tree/main/chapter6_marl/chapter6_supp_value_dec.pdf)
[问答总结](https://github.com/opendilab/PPOxFamily/blob/main/chapter6_marl/chapter6_qa.pdf) | [习题](https://github.com/opendilab/PPOxFamily/blob/main/chapter6_marl/chapter6_homework.pdf)
[习题解答](https://github.com/opendilab/PPOxFamily/blob/main/chapter6_marl/chapter6_hw_solution.pdf) | [独立策略梯度](https://github.com/opendilab/PPOxFamily/tree/main/chapter6_marl/independentpg.py)
[多智能体策略梯度](https://github.com/opendilab/PPOxFamily/tree/main/chapter6_marl/mapg.py)
[多智能体PPO](https://github.com/opendilab/PPOxFamily/tree/main/chapter6_marl/mappo.py)
[HAPPO]
[应用训练代码](https://github.com/opendilab/PPOxFamily/blob/main/chapter6_marl/chapter6_application_demo.py) | [多智能体协作](https://github.com/opendilab/PPOxFamily/issues/62) | | [第七章：挖掘黑科技](https://www.bilibili.com/video/BV1ou4y1o7qY) | [课程PPT](https://github.com/opendilab/PPOxFamily/blob/main/chapter7_tricks/chapter7_lecture.pdf) | [优势函数估计](https://github.com/opendilab/PPOxFamily/blob/main/chapter7_tricks/chapter7_supp_adv.pdf)
[PPO离线版本](https://github.com/opendilab/PPOxFamily/blob/main/chapter7_tricks/chapter7_supp_ppo_offpolicy.pdf)
[熵](https://github.com/opendilab/PPOxFamily/blob/main/chapter7_tricks/chapter7_supp_entropy.pdf)
[问答总结](https://github.com/opendilab/PPOxFamily/blob/main/chapter7_tricks/chapter7_qa.pdf) | [习题](https://github.com/opendilab/PPOxFamily/blob/main/chapter7_tricks/chapter7_homework.pdf)
[习题解答](https://github.com/opendilab/PPOxFamily/blob/main/chapter7_tricks/chapter7_hw_solution.pdf) | [广义优势估计](https://github.com/opendilab/PPOxFamily/blob/main/chapter7_tricks/gae.py)
[重新计算](https://github.com/opendilab/PPOxFamily/blob/main/chapter7_tricks/recompute.py)
[梯度裁剪](https://github.com/opendilab/PPOxFamily/blob/main/chapter7_tricks/grad_clip_norm.py)
[正交初始化](https://github.com/opendilab/PPOxFamily/blob/main/chapter7_tricks/orthogonal_init.py)
[双重裁剪](https://github.com/opendilab/PPOxFamily/blob/main/chapter7_tricks/dual_clip.py)
[价值裁剪](https://github.com/opendilab/PPOxFamily/blob/main/chapter7_tricks/value_clip.py)
[应用训练代码](https://github.com/opendilab/PPOxFamily/blob/main/chapter7_tricks/chapter7_application_demo.py) | [学术基准环境](https://github.com/opendilab/PPOxFamily/issues/79) | | 第八章：突破终极界限 | | 大语言模型基于人类反馈的强化学习 | | [语言模型强化学习环境](https://github.com/opendilab/PPOxFamily/blob/main/chapter8_large/lm_env.py) | | # 课程特点

一个算法解决万千应用视频链接

算法理论与代码实现一一对应网站链接

项目结构

.
├── LICENSE
├── assets                       --> 相关图片素材（转载请注明来源）
├── chapter2_action              --> 课程第二章相关内容
└── chapter1_overview            --> 课程第一章相关内容
    ├── chapter1_manuscript.pdf  --> 课程第一章文字稿（对PPT的补充说明）
    ├── chapter1_lecture.pdf     --> 课程第一章PPT
    ├── chapter1_qa.pdf          --> 课程第一章答疑文稿
    ├── chapter1_homework.pdf    --> 课程第一章习题作业
    ├── chapter1_hw_solution.pdf   --> 课程第一章习题作业题解
    ├── chapter1_supp_trpo.pdf          --> 课程第一章补充材料（算法理论推导等）
    └── chapter1_demo_code.py    --> 课程第一章相关代码实现