Diffusion 中用到的 SDE/ODE 基础数学推导汇总 Summary of Basic SDE/ODE Math Derivations in Diffusion Models

10 min read 10 分钟阅读
需要引用本文?我们提供了标准的 BibTeX 格式,方便您在学术论文中引用。 Want to cite this post? Standard BibTeX format is available for your academic papers.

不知不觉做生成方向已经几年过去了,近期刚好有一些需要,整理了一个 diffusion 中用到的 SDE/ODE 基础数学推导的汇总。扩散模型虽然诞生已久,但仍热度不减,持续有新人加入。希望这些总结可以帮助到有需要的大家。

SDE 作为在校课程

属于哪门课

SDE 全称 Stochastic Differential Equations,是「随机微分方程」这门课。

有哪些前置课程

前置课程包括:

  • 基础课:数学分析(微积分)、线性代数
  • 概率方向:概率论、随机过程
  • 方程相关:常微分方程(ODE)、偏微分方程(PDE)
  • 物理课(作为类比):非平衡态统计力学。全部掌握确实难,好在我们可以边用边查。

SDE 经典教材 & 大致讲了什么

Evans 的 An Introduction to Stochastic Differential Equationspdf 链接)非常不错,在讲透 SDE 的前提下大大简化了对前置知识的依赖,对工业界的 SDE 使用者来说,可以当做权威参考资料使用。大概内容包括:

  • 布朗运动的定义、性质、构造
  • 随机积分、Ito 公式
  • 随机微分方程的定义、解的存在性和唯一性。大致经典的内容就包含这些,是一些自圆其说的基础理论,没有涉及到 SDE 在具体领域的应用或更深刻的性质。

教材里讲的够我们用么?需要补充什么

显然教材不够用。知道「SDE 是什么」只是一个开始,仍无法回答「怎么把 SDE 应用在生成模型」这个问题。从时间上来看,布朗运动是 Einstein 在 1905 年推导、Wiener 在 1920 年归纳成数学理论的;SDE 理论最重要的贡献者 伊藤清 (Kiyoshi Ito) 已经于 2008 年去世;扩散模型的概念提出是在 2015 年,普及于 2020 年。除了经典教材上,还有很多额外的知识点需要补充。下面举一些例子:

  • Langevin Dynamics(1908 年)
  • Fokker-Planck Equation(1917 年)
  • Reverse Ito SDE(1982 年)
  • Score Matching(2005 年)
  • ... ... 当然还有很多的现代的知识,比如 DDPM、DDIM、Flow Matching,这些就是我们这个年代的了。

后面我们一点点展开来看吧。

Forward SDE(正向 modeling)

一般我们研究的方程是这样的形式:

$$d\mathbf{x} = \mathbf{f}(\mathbf{x},t)dt + \mathbf{G}(\mathbf{x},t)d\mathbf{w}$$

其中 $\mathbf{w}$ 是维纳过程,$\mathbf{f}: \mathbb{R}^N \times [0, T] \to \mathbb{R}^N$,$\mathbf{G}: \mathbb{R}^N \times [0, T] \to \mathbb{R}^{N \times N}$。注意 $\mathbf{x}$ 也是关于 $t$ 变化的,是一个连续的随机变量序列 $\mathbf{x}(t)$,或者记为 $\mathbf{x}_t$,为了方便,通常就写成 $\mathbf{x}$ 就好。通常,在 $t \in [0, T]$ 的区间两端点上,$\mathbf{x}_0 \sim p_0(\mathbf{x}) = p_{data}(\mathbf{x})$ 和 $\mathbf{x}_T \sim p_T(\mathbf{x})$ 分别是样本分布和纯噪声分布(有些 paper 记号不是这样的,比如 Flow Matching,但是我们在本文档语境下就强行统一记号吧)

将研究对象定位为这个形式的依据:

  1. 在 Evans Page 77 关于 SDE 的定义是这个形式
  2. 在 经典论文 Reverse SDE (1982) 的 Eq-3.3 关于 SDE 的定义也是这个形式

所以,尽管常见的关于扩散模型的理论都采用了更简单的形式,比如 $d\mathbf{x} = \mathbf{f}(\mathbf{x}, t)dt + g(t)d\mathbf{w}$,甚至 $d\mathbf{x} = \mathbf{v}(\mathbf{x}, t)dt$,我们还是研究一种更通用的形式。这乍一看是变麻烦了,但实则是变简单了,可以方便从经典的推断中借鉴我们需要的手法。

边缘分布:Fokker-Planck 方程

随机微分方程 $d\mathbf{x} = \mathbf{f}(\mathbf{x}, t)dt + \mathbf{G}(\mathbf{x}, t)d\mathbf{w}$ 描述了随机变量 $\mathbf{x}$ 关于时间 $t$ 的关系。我们先介绍一个经典方程,称作 Fokker-Planck 方程。假设 $\mathbf{x}(t)$ 的概率密度函数为 $p_t(\mathbf{x})$,这通常称作边缘密度 (marginal density)。那么由上述 SDE 引导出的关于边缘密度 $p_t(\mathbf{x})$ 的方程为:

$$\frac{\partial p}{\partial t} = -\nabla \cdot [\mathbf{f}(\mathbf{x}, t)p(\mathbf{x}, t)] + \frac{1}{2}\nabla \cdot [\nabla \cdot (\mathbf{G}(\mathbf{x}, t)\mathbf{G}(\mathbf{x}, t)^T p(\mathbf{x}, t))]$$

这个方程就叫做 Fokker-Planck 方程。

这个化归的意义在于,把随机变量 $\mathbf{x}_t$ 随时间的演变,转化成边缘密度 $p_t$ 随时间的演变。为什么要做这个转变呢?因为我们通常采样都是以随机变量的形式,但是背后的数学原理很多是以概率密度函数的形式来进行描述。所以这个转变是必要的。

证明:

取一个任意的标量测试函数 $\phi(\mathbf{x}): \mathbb{R}^N \to \mathbb{R}$,不显式包含 $t$,且满足:光滑、支撑集有界。

然后我们关注函数 $\phi(\mathbf{x})$,下面将用两种方式计算它的「$\mathbb{E}$ & $\frac{\partial}{\partial t}$」的组合。

(1) 先计算期望,再求导数

其期望为 $\mathbb{E}[\phi(\mathbf{x}(t))] = \int_{\mathbb{R}^N} \phi(\mathbf{x})p(\mathbf{x}, t)d\mathbf{x}$

再求导得 $\frac{d}{dt}\mathbb{E}[\phi(\mathbf{x}(t))] = \int_{\mathbb{R}^N} \phi(\mathbf{x})\frac{\partial p(\mathbf{x}, t)}{\partial t}d\mathbf{x}$

(2) 先求导数,再求期望

计算 $d\phi$ 需要用到 Ito's 公式,见 Evans 的 Page 72 中间的那个公式

我们有:$d\phi = \frac{\partial \phi}{\partial t}dt + (\nabla \phi)^T d\mathbf{x} + \frac{1}{2}\text{Tr}\left(\mathbf{G}\mathbf{G}^T \nabla\nabla^T \phi\right)dt$

因为 $\phi(\mathbf{x})$ 不显式包含 $t$,所以第一项为 0,进而

$$d\phi = (\nabla \phi)^T d\mathbf{x} + \frac{1}{2}\text{Tr}\left(\mathbf{G}\mathbf{G}^T \nabla\nabla^T \phi\right)dt$$

将 Forward SDE $d\mathbf{x} = \mathbf{f}(\mathbf{x}, t)dt + \mathbf{G}(\mathbf{x}, t)d\mathbf{w}$ 带进去,得到

$$\begin{aligned} d\phi &= (\nabla \phi)^T (\mathbf{f}dt + \mathbf{G}d\mathbf{w}) + \frac{1}{2}\text{Tr}\left(\mathbf{G}\mathbf{G}^T \nabla\nabla^T \phi\right)dt \\ &= \left( (\nabla \phi)^T\mathbf{f} + \frac{1}{2}\text{Tr}\left(\mathbf{G}\mathbf{G}^T \nabla\nabla^T \phi\right) \right)dt + (\nabla \phi)^T\mathbf{G}d\mathbf{w} \end{aligned}$$

$$\mathbb{E}[d\phi] = \mathbb{E}\left[ \left( (\nabla \phi)^T\mathbf{f} + \frac{1}{2}\text{Tr}\left(\mathbf{G}\mathbf{G}^T \nabla\nabla^T \phi\right) \right)dt \right] + \mathbb{E}[(\nabla \phi)^T\mathbf{G}d\mathbf{w}]$$

根据布朗运动性质,第二项为 0,所以

$$\mathbb{E}[d\phi] = \mathbb{E}\left[ \left( (\nabla \phi)^T\mathbf{f} + \frac{1}{2}\text{Tr}\left(\mathbf{G}\mathbf{G}^T \nabla\nabla^T \phi\right) \right)dt \right]$$

$$\begin{aligned} \mathbb{E}\left[ \frac{d\phi}{dt} \right] &= \mathbb{E}\left[ (\nabla \phi)^T\mathbf{f} + \frac{1}{2}\text{Tr}\left(\mathbf{G}\mathbf{G}^T \nabla\nabla^T \phi\right) \right] \\ &= \int_{\mathbb{R}^N} (\nabla \phi)^T\mathbf{f} \cdot p(\mathbf{x}, t)d\mathbf{x} + \frac{1}{2}\int_{\mathbb{R}^N} \text{Tr}\left(\mathbf{G}\mathbf{G}^T \nabla\nabla^T \phi\right) \cdot p(\mathbf{x}, t)d\mathbf{x} \end{aligned}$$

(3) 考虑到前面两步的计算结果,我们有:

$$\int_{\mathbb{R}^N} \phi(\mathbf{x})\frac{\partial p(\mathbf{x}, t)}{\partial t}d\mathbf{x} = \int_{\mathbb{R}^N} (\nabla \phi)^T\mathbf{f} \cdot p(\mathbf{x}, t)d\mathbf{x} + \frac{1}{2}\int_{\mathbb{R}^N} \text{Tr}\left(\mathbf{G}\mathbf{G}^T \nabla\nabla^T \phi\right) \cdot p(\mathbf{x}, t)d\mathbf{x}$$

作为积分的核函数,我们不希望表达式里有 $\phi$ 的导数,所以通过分部积分法,将导数转移到 $p$ 上去

$$\int_{\mathbb{R}^N} (\nabla \phi)^T\mathbf{f} \cdot p(\mathbf{x}, t)d\mathbf{x} = - \int_{\mathbb{R}^N} \phi(\mathbf{x})(\nabla \cdot (p\mathbf{f}))d\mathbf{x}$$

$$\int_{\mathbb{R}^N} \text{Tr}\left(\mathbf{G}\mathbf{G}^T \nabla\nabla^T \phi\right) \cdot p(\mathbf{x}, t)d\mathbf{x} = \int_{\mathbb{R}^N} \phi(\mathbf{x})\nabla \cdot (\nabla \cdot (p\mathbf{G}\mathbf{G}^T))d\mathbf{x}$$

所以我们得到

$$\begin{aligned} \int_{\mathbb{R}^N} \phi(\mathbf{x})\frac{\partial p(\mathbf{x}, t)}{\partial t}d\mathbf{x} &= - \int_{\mathbb{R}^N} \phi(\mathbf{x})(\nabla \cdot (p\mathbf{f}))d\mathbf{x} + \frac{1}{2}\int_{\mathbb{R}^N} \phi(\mathbf{x})\nabla \cdot (\nabla \cdot (p\mathbf{G}\mathbf{G}^T))d\mathbf{x} \\ &= \int_{\mathbb{R}^N} \phi(\mathbf{x}) \left\{ -\nabla \cdot (p\mathbf{f}) + \frac{1}{2}\nabla \cdot (\nabla \cdot (p\mathbf{G}\mathbf{G}^T)) \right\} d\mathbf{x} \end{aligned}$$

(4) 考虑到上式对任意的测试函数 $\phi(\mathbf{x}): \mathbb{R}^N \to \mathbb{R}$(不显式包含 $t$,且满足:光滑、支撑集有界)均成立,那么其实可以把积分号拆掉,把 $\phi(\mathbf{x})$ 也拿掉,于是就得到了我们想要的 Fokker-Planck 方程:

$$\frac{\partial p(\mathbf{x}, t)}{\partial t} = -\nabla \cdot (p\mathbf{f}) + \frac{1}{2}\nabla \cdot (\nabla \cdot (p\mathbf{G}\mathbf{G}^T))$$

证毕。

Reverse SDE(反向 modeling)

将随机微分方程 $d\mathbf{x} = \mathbf{f}(\mathbf{x}, t)dt + \mathbf{G}(\mathbf{x}, t)d\mathbf{w}$ 变成所谓的 reverse 形式,是扩散模型最重要的一件任务。

它的逻辑是这样的:

上述正向方程的反向方程为:

$$d\mathbf{x} = \left[ \mathbf{f}(\mathbf{x}, t) - \nabla \cdot (\mathbf{G}(\mathbf{x}, t)\mathbf{G}(\mathbf{x}, t)^T) - \mathbf{G}(\mathbf{x}, t)\mathbf{G}(\mathbf{x}, t)^T \nabla_{\mathbf{x}} \log p_t(\mathbf{x}) \right] dt + \mathbf{G}(\mathbf{x}, t) d\mathbf{\bar{w}}$$

时间范围依然是 $t \in [0, T]$。另外注意到,反向 SDE 必须要用到 $\nabla_{\mathbf{x}} \log p_t(\mathbf{x})$,所以是需要借助预知的数据分布信息的,这就要求我们必须用网络来预测 $\nabla_{\mathbf{x}} \log p_t(\mathbf{x})$,否则 reverse 就肯定无从谈起。

需要重点注意的事情是:能做到精准 reverse 的方程不是唯一的。接着往下看,后面我们可以看到,能成功将噪声分布 reverse 回数据分布的方程有无穷多个。只是上面这个方程,比较经典,也比较常用。

这里我们证明一下上面的方程是一个 reverse SDE(依赖于 Fokker-Planck 方程):

证明:

这个验证过程比较绕,需要将时间反转。具体地,令 $s = T - t$,

在上述关于 $t$ 的方程

$$d\mathbf{x}_t = \left[ \mathbf{f}(\mathbf{x}, t) - \nabla \cdot (\mathbf{G}(\mathbf{x}, t)\mathbf{G}(\mathbf{x}, t)^T) - \mathbf{G}(\mathbf{x}, t)\mathbf{G}(\mathbf{x}, t)^T \nabla_{\mathbf{x}} \log p_t(\mathbf{x}) \right] dt + \mathbf{G}(\mathbf{x}, t) d\mathbf{\bar{w}}$$

中,我们令

$$\mathbf{m}(\mathbf{x}, t) = -\nabla \cdot (\mathbf{G}(\mathbf{x}, t)\mathbf{G}(\mathbf{x}, t)^T) - \mathbf{G}(\mathbf{x}, t)\mathbf{G}(\mathbf{x}, t)^T \nabla_{\mathbf{x}} \log p_t(\mathbf{x})$$

则上式变为

$$d\mathbf{x}_t = \left[ \mathbf{f}(\mathbf{x}, t) + \mathbf{m}(\mathbf{x}, t) \right] dt + \mathbf{G}(\mathbf{x}, t) d\mathbf{\bar{w}}$$

将上式转化为关于 $s$ 的方程(在原式中改写:$t \to T - s, dt \to -ds$,注意 $-ds$ 的负号不要忘了

$$\begin{aligned} d\mathbf{x}_s &= \left[ \mathbf{f}(\mathbf{x}, T - s) + \mathbf{m}(\mathbf{x}, T - s) \right] (-ds) + \mathbf{G}(\mathbf{x}, T - s) d\mathbf{\bar{w}} \\ &= \left[ -\mathbf{f}(\mathbf{x}, T - s) - \mathbf{m}(\mathbf{x}, T - s) \right] ds + \mathbf{G}(\mathbf{x}, T - s) d\mathbf{\bar{w}} \end{aligned}$$

为了简便,记为:

$$d\mathbf{x} = \mathbf{h}(\mathbf{x}, s)ds + \mathbf{K}(\mathbf{x}, s) d\mathbf{\bar{w}}$$

其中

$$\mathbf{h}(\mathbf{x}, s) = -\mathbf{f}(\mathbf{x}, T - s) - \mathbf{m}(\mathbf{x}, T - s)$$

$$\mathbf{K}(\mathbf{x}, s) = \mathbf{G}(\mathbf{x}, T - s)$$

此外,记 reverse SDE 的边缘分布为 $q(\mathbf{x}, s)$。

我们将从以下关系开始化归,最终证明上述的 $\mathbf{m}(\mathbf{x}, t)$ 是下述方程的一个解。对 $\forall \lambda \in [0, T]$:

$$q(\mathbf{x}, \lambda) = p(\mathbf{x}, T - \lambda)$$

$$\left. \frac{\partial p}{\partial t} \right|_{t=\lambda} = - \left. \frac{\partial q}{\partial s} \right|_{s=T-\lambda}$$

首先展开等号两侧的偏导项。

1. 左侧,根据 Fokker-Planck 方程,并令 $\mathbf{D} = \mathbf{G}(\mathbf{x}, \lambda)\mathbf{G}(\mathbf{x}, \lambda)^T$:

$$\begin{aligned} \left. \frac{\partial p}{\partial t} \right|_{t=\lambda} &= -\nabla \cdot [\mathbf{f}(\mathbf{x}, \lambda)p(\mathbf{x}, \lambda)] \\ &\quad + \frac{1}{2}\nabla \cdot [\nabla \cdot (\mathbf{G}(\mathbf{x}, \lambda)\mathbf{G}(\mathbf{x}, \lambda)^T p(\mathbf{x}, \lambda))] \\ &= -\nabla \cdot [\mathbf{f}(\mathbf{x}, \lambda)p(\mathbf{x}, \lambda)] + \frac{1}{2}\nabla \cdot [\nabla \cdot (\mathbf{D}p(\mathbf{x}, \lambda))] \end{aligned}$$

2. 右侧,同样根据 Fokker-Planck 方程:

$$\begin{aligned} - \left. \frac{\partial q}{\partial s} \right|_{s=T-\lambda} &= \nabla \cdot [\mathbf{h}(\mathbf{x}, T - \lambda)q(\mathbf{x}, T - \lambda)] \\ &\quad - \frac{1}{2}\nabla \cdot [\nabla \cdot (\mathbf{K}(\mathbf{x}, T - \lambda)\mathbf{K}(\mathbf{x}, T - \lambda)^T q(\mathbf{x}, T - \lambda))] \\ &= \nabla \cdot [\mathbf{h}(\mathbf{x}, T - \lambda)q(\mathbf{x}, T - \lambda)] \\ &\quad - \frac{1}{2}\nabla \cdot [\nabla \cdot (\mathbf{G}(\mathbf{x}, \lambda)\mathbf{G}(\mathbf{x}, \lambda)^T q(\mathbf{x}, T - \lambda))] \\ &= \nabla \cdot [\mathbf{h}(\mathbf{x}, T - \lambda)p(\mathbf{x}, \lambda)] \\ &\quad - \frac{1}{2}\nabla \cdot [\nabla \cdot (\mathbf{G}(\mathbf{x}, \lambda)\mathbf{G}(\mathbf{x}, \lambda)^T p(\mathbf{x}, \lambda))] \\ &= \nabla \cdot [\mathbf{h}(\mathbf{x}, T - \lambda)p(\mathbf{x}, \lambda)] - \frac{1}{2}\nabla \cdot [\nabla \cdot (\mathbf{D}p(\mathbf{x}, \lambda))] \\ &= \nabla \cdot [[-\mathbf{f}(\mathbf{x}, \lambda) - \mathbf{m}(\mathbf{x}, \lambda)]p(\mathbf{x}, \lambda)] - \frac{1}{2}\nabla \cdot [\nabla \cdot (\mathbf{D}p(\mathbf{x}, \lambda))] \end{aligned}$$

3. 检验 $\mathbf{m}(\mathbf{x}, t) = -\nabla \cdot (\mathbf{G}(\mathbf{x}, t)\mathbf{G}(\mathbf{x}, t)^T) - \mathbf{G}(\mathbf{x}, t)\mathbf{G}(\mathbf{x}, t)^T \nabla_{\mathbf{x}} \log p_t(\mathbf{x})$ 时 $\left. \frac{\partial p}{\partial t} \right|_{t=\lambda} = - \left. \frac{\partial q}{\partial s} \right|_{s=T-\lambda}$ 是否成立。此时:

$$\begin{aligned} - \left. \frac{\partial q}{\partial s} \right|_{s=T-\lambda} &= \nabla \cdot [[-\mathbf{f}(\mathbf{x}, \lambda) - \mathbf{m}(\mathbf{x}, \lambda)]p(\mathbf{x}, \lambda)] - \frac{1}{2}\nabla \cdot [\nabla \cdot (\mathbf{D}p(\mathbf{x}, \lambda))] \\ &= -\nabla \cdot [\mathbf{f}(\mathbf{x}, \lambda)p(\mathbf{x}, \lambda)] - \nabla \cdot [\mathbf{m}(\mathbf{x}, \lambda)p(\mathbf{x}, \lambda)] - \frac{1}{2}\nabla \cdot [\nabla \cdot (\mathbf{D}p(\mathbf{x}, \lambda))] \\ &= -\nabla \cdot [\mathbf{f}(\mathbf{x}, \lambda)p(\mathbf{x}, \lambda)] - \nabla \cdot \left[ \left( -\nabla \cdot \mathbf{D} - \mathbf{D}\frac{\nabla p}{p} \right) p(\mathbf{x}, \lambda) \right] \\ &\quad - \frac{1}{2}\nabla \cdot [\nabla \cdot (\mathbf{D}p(\mathbf{x}, \lambda))] \\ &= -\nabla \cdot [\mathbf{f}(\mathbf{x}, \lambda)p(\mathbf{x}, \lambda)] - \nabla \cdot \left[ -(\nabla \cdot \mathbf{D})p - \mathbf{D}(\nabla p) \right] \\ &\quad - \frac{1}{2}\nabla \cdot [\nabla \cdot (\mathbf{D}p(\mathbf{x}, \lambda))] \\ &= -\nabla \cdot [\mathbf{f}(\mathbf{x}, \lambda)p(\mathbf{x}, \lambda)] + \nabla \cdot \left[ (\nabla \cdot \mathbf{D})p + \mathbf{D}(\nabla p) \right] \\ &\quad - \frac{1}{2}\nabla \cdot [\nabla \cdot (\mathbf{D}p(\mathbf{x}, \lambda))] \\ &= -\nabla \cdot [\mathbf{f}(\mathbf{x}, \lambda)p(\mathbf{x}, \lambda)] + \nabla \cdot (\nabla \cdot (\mathbf{D}p)) - \frac{1}{2}\nabla \cdot [\nabla \cdot (\mathbf{D}p(\mathbf{x}, \lambda))] \\ &= -\nabla \cdot [\mathbf{f}(\mathbf{x}, \lambda)p(\mathbf{x}, \lambda)] + \frac{1}{2}\nabla \cdot [\nabla \cdot (\mathbf{D}p(\mathbf{x}, \lambda))] \\ &= \left. \frac{\partial p}{\partial t} \right|_{t=\lambda} \end{aligned}$$

可以看到,此时条件是满足的。显然顺带可以推出 $q(\mathbf{x}, \lambda) = p(\mathbf{x}, T - \lambda)$

证毕。

训练网络输出准确的 Score (Loss 推导)

在上面 Reverse SDE 中,我们看到一个量 $\nabla_{\mathbf{x}} \log p_t(\mathbf{x})$。有个问题需要注意,就是这个东西在真正做 reverse sampling 的时候,其实是未知的!这就导致了一个尴尬的死循环:

不知道 $\nabla_{\mathbf{x}} \log p_t(\mathbf{x})$,所以没有办法做 reverse sampling
做不了 reverse sampling,那就依然不知道 $p_t(\mathbf{x})$,顺带也不知道 $\nabla_{\mathbf{x}} \log p_t(\mathbf{x})$
依然不知道 $\nabla_{\mathbf{x}} \log p_t(\mathbf{x})$,那就依然没有办法做 reverse sampling
... ... 死循环

所以,我们必须要用网络来估计 $\nabla_{\mathbf{x}} \log p_t(\mathbf{x})$,否则这个 reverse 是铁定完不成的。

预测对象的选取

但是,我们为什么不直接预测 $p_t(\mathbf{x})$,而是要绕一下,去估计 $\nabla_{\mathbf{x}} \log p_t(\mathbf{x})$ 呢?原因如下:

  1. $p_t(\mathbf{x})$ 需要满足一个「强条件」,就是 $\int_{\mathbf{x}} p_t(\mathbf{x})d\mathbf{x} \equiv 1$。这个条件对于神经网络来说是无法满足的,这个条件是作用在整个训练集上,是个 NP hard 问题,几乎是做不到的。
  2. 即便是求个 $\log$,记 $r_t(\mathbf{x}) = \log p_t(\mathbf{x})$,也不行。毕竟 $\int_{\mathbf{x}} \exp\{p_t(\mathbf{x})\}d\mathbf{x} \equiv 1$ 也很强。
  3. 但是 $\nabla_{\mathbf{x}} \log p_t(\mathbf{x})$ 就没这个问题,它不需要满足任何全局归一化条件。所以任凭网络自由预测也不会违背什么。

定义 score function:

$$s(\mathbf{x}, t) = \nabla_{\mathbf{x}} \log p_t(\mathbf{x})$$

同时我们的目标也很明确了,让网络去预测一个 $s_\theta(\mathbf{x}, t)$,使得 $s_\theta(\mathbf{x}, t) \approx s(\mathbf{x}, t)$。

没有 Ground Truth 怎么办

另一个超大问题,其实是,$\nabla_{\mathbf{x}} \log p_t(\mathbf{x})$ 这东西的 Ground Truth 其实是没有的。

在训练中,我们有样本点 $\mathbf{x}_0$,但是并没有 $p_0(\mathbf{x})$。所以尽管我们按照已知的公式对 $\mathbf{x}_0$ 加噪得到 $\mathbf{x}_t$,我们还是不知道 $p_t(\mathbf{x})$。

但是!幸运地!

1. 我们知道条件分布 $p_t(\mathbf{x}_t | \mathbf{x}_0)$

一般是 $p_t(\mathbf{x}_t | \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; a\mathbf{x}_0, b\mathbf{I})$,对应于训练中的 add noise 过程。注意到 $\nabla_{\mathbf{x}} \log \{p_t(\mathbf{x}_t | \mathbf{x}_0)\} = -\frac{\mathbf{x}_t - a\mathbf{x}_0}{b}$,这个东西真的能写出来具体的表达式,所以我们可以在 loss 里面大胆使用。

2. 另一个重要的等式是:

$$\mathbb{E}_{\mathbf{x}_t} \left[ \| s_\theta(\mathbf{x}_t, t) - \nabla \log p_t(\mathbf{x}_t) \|_2^2 \right] = \mathbb{E}_{\mathbf{x}_0}\mathbb{E}_{\mathbf{x}_t|\mathbf{x}_0} \left[ \| s_\theta(\mathbf{x}_t, t) - \nabla \log p_t(\mathbf{x}_t | \mathbf{x}_0) \|_2^2 \right] + Const$$

其中 $Const$ 表示和 $s_\theta(\mathbf{x}_t, t)$ 无关的量。这个证明我们稍晚点在下面给出。

经过上面的化归,我们可以安心地让网络去估计 $\nabla_{\mathbf{x}} \log \{p_t(\mathbf{x}_t | \mathbf{x}_0)\}$ 了:

$$\theta^* = \arg\min_\theta \mathbb{E}_t \left\{ \lambda(t) \mathbb{E}_{\mathbf{x}_0}\mathbb{E}_{\mathbf{x}_t|\mathbf{x}_0} \left[ \| s_\theta(\mathbf{x}_t, t) - \nabla_{\mathbf{x}} \log \{p_t(\mathbf{x}_t | \mathbf{x}_0)\} \|_2^2 \right] \right\}$$

其中 $\lambda(t) > 0$,通常不怎么重要,随便设置大概都能训出来。

3. 其实 diffusion 训练框架的 loss 始终等价于这个形式,包括 DDPM、SMLD 等。

下面我们来证明一下上面那个等式:

$$\mathbb{E}_{\mathbf{x}_t} \left[ \| s_\theta(\mathbf{x}_t, t) - \nabla \log p_t(\mathbf{x}_t) \|_2^2 \right] = \mathbb{E}_{\mathbf{x}_0}\mathbb{E}_{\mathbf{x}_t|\mathbf{x}_0} \left[ \| s_\theta(\mathbf{x}_t, t) - \nabla \log p_t(\mathbf{x}_t | \mathbf{x}_0) \|_2^2 \right] + Const$$

证明过程很简单,只需要展开积分算一算就行了:

证明:

规定凡和 $s_\theta(\mathbf{x}_t, t)$ 无关的量一律记为 $Const$。此外需要用到边缘积分:$p_t(\mathbf{x}_t) = \int_{\mathbf{x}_0} p_0(\mathbf{x}_0)p_{t|0}(\mathbf{x}_t | \mathbf{x}_0)d\mathbf{x}_0$

从左侧开始,展开算一下:

$$\begin{aligned} &\mathbb{E}_{\mathbf{x}_t} \left[ \| s_\theta(\mathbf{x}_t, t) - \nabla \log p_t(\mathbf{x}_t) \|_2^2 \right] \\ &= \mathbb{E}_{\mathbf{x}_t} \left[ \| s_\theta(\mathbf{x}_t, t) \|_2^2 - 2 s_\theta(\mathbf{x}_t, t) \cdot \nabla \log p_t(\mathbf{x}_t) + \| \nabla \log p_t(\mathbf{x}_t) \|_2^2 \right] \\ &= \int_{\mathbf{x}_t} \left( \| s_\theta(\mathbf{x}_t, t) \|_2^2 - 2 s_\theta(\mathbf{x}_t, t) \cdot \nabla \log p_t(\mathbf{x}_t) + \| \nabla \log p_t(\mathbf{x}_t) \|_2^2 \right) p_t(\mathbf{x}_t)d\mathbf{x}_t \\ &= \int_{\mathbf{x}_t} \| s_\theta(\mathbf{x}_t, t) \|_2^2 p_t(\mathbf{x}_t)d\mathbf{x}_t + \int_{\mathbf{x}_t} \| \nabla \log p_t(\mathbf{x}_t) \|_2^2 p_t(\mathbf{x}_t)d\mathbf{x}_t - 2 \int_{\mathbf{x}_t} [s_\theta(\mathbf{x}_t, t) \cdot \nabla \log p_t(\mathbf{x}_t)] p_t(\mathbf{x}_t)d\mathbf{x}_t \\ &= \int_{\mathbf{x}_t} \| s_\theta(\mathbf{x}_t, t) \|_2^2 p_t(\mathbf{x}_t)d\mathbf{x}_t - 2 \int_{\mathbf{x}_t} [s_\theta(\mathbf{x}_t, t) \cdot \nabla \log p_t(\mathbf{x}_t)] p_t(\mathbf{x}_t)d\mathbf{x}_t + Const \\ &= \int_{\mathbf{x}_t} \| s_\theta(\mathbf{x}_t, t) \|_2^2 p_t(\mathbf{x}_t)d\mathbf{x}_t - 2 \int_{\mathbf{x}_t} s_\theta(\mathbf{x}_t, t) \cdot \nabla p_t(\mathbf{x}_t)d\mathbf{x}_t + Const \\ &= \int_{\mathbf{x}_t} \| s_\theta(\mathbf{x}_t, t) \|_2^2 p_t(\mathbf{x}_t)d\mathbf{x}_t - 2 \int_{\mathbf{x}_t} s_\theta(\mathbf{x}_t, t) \cdot \nabla \left[ \int_{\mathbf{x}_0} p_0(\mathbf{x}_0)p_{t|0}(\mathbf{x}_t | \mathbf{x}_0)d\mathbf{x}_0 \right] d\mathbf{x}_t + Const \\ &= \int_{\mathbf{x}_t} \| s_\theta(\mathbf{x}_t, t) \|_2^2 p_t(\mathbf{x}_t)d\mathbf{x}_t - 2 \int_{\mathbf{x}_t} s_\theta(\mathbf{x}_t, t) \cdot \left[ \int_{\mathbf{x}_0} p_0(\mathbf{x}_0)\nabla p_{t|0}(\mathbf{x}_t | \mathbf{x}_0)d\mathbf{x}_0 \right] d\mathbf{x}_t + Const \\ &= \int_{\mathbf{x}_t} \| s_\theta(\mathbf{x}_t, t) \|_2^2 p_t(\mathbf{x}_t)d\mathbf{x}_t - 2 \int_{\mathbf{x}_t} s_\theta(\mathbf{x}_t, t) \cdot \left[ \int_{\mathbf{x}_0} p_0(\mathbf{x}_0)[\nabla \log p_{t|0}(\mathbf{x}_t | \mathbf{x}_0)]p_{t|0}(\mathbf{x}_t | \mathbf{x}_0)d\mathbf{x}_0 \right] d\mathbf{x}_t + Const \\ &= \int_{\mathbf{x}_t} \| s_\theta(\mathbf{x}_t, t) \|_2^2 p_t(\mathbf{x}_t)d\mathbf{x}_t - 2 \int_{\mathbf{x}_t} \int_{\mathbf{x}_0} p_0(\mathbf{x}_0)p_{t|0}(\mathbf{x}_t | \mathbf{x}_0) \nabla \log p_{t|0}(\mathbf{x}_t | \mathbf{x}_0) \cdot s_\theta(\mathbf{x}_t, t) d\mathbf{x}_0 d\mathbf{x}_t + Const \\ &= \mathbb{E}_{\mathbf{x}_t} \left[ \| s_\theta(\mathbf{x}_t, t) \|_2^2 \right] - 2 \cdot \mathbb{E}_{\mathbf{x}_0}\mathbb{E}_{\mathbf{x}_t|\mathbf{x}_0} \left[ \nabla \log p_{t|0}(\mathbf{x}_t | \mathbf{x}_0) \cdot s_\theta(\mathbf{x}_t, t) \right] + Const \\ &= \mathbb{E}_{\mathbf{x}_0}\mathbb{E}_{\mathbf{x}_t|\mathbf{x}_0} \left[ \| s_\theta(\mathbf{x}_t, t) \|_2^2 \right] - 2 \cdot \mathbb{E}_{\mathbf{x}_0}\mathbb{E}_{\mathbf{x}_t|\mathbf{x}_0} \left[ \nabla \log p_{t|0}(\mathbf{x}_t | \mathbf{x}_0) \cdot s_\theta(\mathbf{x}_t, t) \right] + \mathbb{E}_{\mathbf{x}_0}\mathbb{E}_{\mathbf{x}_t|\mathbf{x}_0} \left[ \| \nabla \log p_{t|0}(\mathbf{x}_t | \mathbf{x}_0) \|_2^2 \right] + Const \\ &= \mathbb{E}_{\mathbf{x}_0}\mathbb{E}_{\mathbf{x}_t|\mathbf{x}_0} \left\{ \| s_\theta(\mathbf{x}_t, t) \|_2^2 - 2 \cdot [\nabla \log p_{t|0}(\mathbf{x}_t | \mathbf{x}_0) \cdot s_\theta(\mathbf{x}_t, t)] + \| \nabla \log p_{t|0}(\mathbf{x}_t | \mathbf{x}_0) \|_2^2 \right\} + Const \\ &= \mathbb{E}_{\mathbf{x}_0}\mathbb{E}_{\mathbf{x}_t|\mathbf{x}_0} \left[ \| s_\theta(\mathbf{x}_t, t) - \nabla \log p_{t|0}(\mathbf{x}_t | \mathbf{x}_0) \|_2^2 \right] + Const \end{aligned}$$

证毕。

Tip:Score 和常见推导中 $\epsilon$ 的关系

给出显然的结论:score 正比于 $\epsilon$

上述 tip 其实说明了,DDPM 的训练目标其实就是预测 score function。

正反方程的离散化 (Sampling)

SDE 描述了随机变量随时间的连续演化。但是在实际应用中,我们无法在连续时间上进行计算,因此需要将 SDE 离散化为一系列小的时间步长。Euler-Maruyama 方法是最简单也是最常用的 SDE 离散化方法。

在离散化之前,需要注意一个事实:

$$d\mathbf{w} \sim \mathcal{N}(\mathbf{0}, |dt| \cdot \mathbf{I})$$

所以可以认为:

$$d\mathbf{w} = \sqrt{|dt|} \cdot \mathbf{z} \text{,其中 } \mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$$

Forward SDE 的 Euler-Maruyama 离散化

取 $dt$ 为 $1$,取两个时间点为 $\{t, t+1\}$,则:

$$\begin{aligned} \mathbf{x}_{t+1} &= \mathbf{x}_t + \mathbf{f}(\mathbf{x}_t, t) \cdot 1 + \mathbf{G}(\mathbf{x}_t, t)\sqrt{1}\mathbf{z}_t \\ &= \mathbf{x}_t + \mathbf{f}(\mathbf{x}_t, t) + \mathbf{G}(\mathbf{x}_t, t)\mathbf{z}_t \end{aligned}$$

Reverse SDE 的 Euler-Maruyama 离散化

取 $dt$ 为 $-1$,取两个时间点为 $\{t, t-1\}$,则 Reverse SDE 可以离散化为:

$$\begin{aligned} \mathbf{x}_{t-1} &= \mathbf{x}_t + \left[ \mathbf{f}(\mathbf{x}_t, t) - \nabla \cdot (\mathbf{G}(\mathbf{x}_t, t)\mathbf{G}(\mathbf{x}_t, t)^T) - \mathbf{G}(\mathbf{x}_t, t)\mathbf{G}(\mathbf{x}_t, t)^T \nabla_{\mathbf{x}} \log p_t(\mathbf{x}) \right] \cdot (-1) + \mathbf{G}(\mathbf{x}_t, t)\sqrt{|-1|}\mathbf{z}_t \\ &= \mathbf{x}_t - \left[ \mathbf{f}(\mathbf{x}_t, t) - \nabla \cdot (\mathbf{G}(\mathbf{x}_t, t)\mathbf{G}(\mathbf{x}_t, t)^T) - \mathbf{G}(\mathbf{x}_t, t)\mathbf{G}(\mathbf{x}_t, t)^T \nabla_{\mathbf{x}} \log p_t(\mathbf{x}) \right] + \mathbf{G}(\mathbf{x}_t, t)\mathbf{z}_t \end{aligned}$$

当然还有很多其他的离散化方式,只要在极限形式下等价于 SDE,那应该就都是有意义的。

SDE 实战案例

简洁推导 DDPM

取 $\mathbf{f}(\mathbf{x}, t) = (\sqrt{1 - \beta_t} - 1)\mathbf{x}$ 以及 $\mathbf{G}(\mathbf{x}, t) = \sqrt{\beta_t}$

则 forward SDE 为:

$$d\mathbf{x} = (\sqrt{1 - \beta_t} - 1)\mathbf{x}dt + \sqrt{\beta_t}d\mathbf{w}$$

离散化后为:

$$\mathbf{x}_{t+1} = \sqrt{1 - \beta_t}\mathbf{x}_t + \sqrt{\beta_t}\mathbf{z}_t$$

随之 reverse SDE 为:

$$d\mathbf{x} = (\sqrt{1 - \beta_t}\mathbf{x} - \mathbf{x} - \beta_t \mathbf{s}_\theta(\mathbf{x}, t))dt + \sqrt{\beta_t}d\mathbf{\bar{w}}$$

离散化后为:

$$\mathbf{x}_{t-1} = (2 - \sqrt{1 - \beta_t})\mathbf{x}_t + \beta_t \mathbf{s}_\theta(\mathbf{x}, t) + \sqrt{\beta_t}\mathbf{z}_t$$

注意,这个公式和 DDPM 原版的公式不一样,但是:

  1. 在 $\beta_t \ll 1$ 的时候是等价的
  2. 两种 dynamics 都收敛于 $p_t(\mathbf{x})$

至此,对于离散化的理解和总结就差不多了。当然离散化的方法非常多,性质也不同,这会引入非常多的 sampler,在此不展开。

ODE 框架

随机变量 ODE 一般形式

我们并不是要研究所有的、通用的 ODE。在当前的语境下,我们要研究的是在 SDE 中满足 $\mathbf{G}(\mathbf{x}, t) = \mathbf{0}$ 的方程:

$$d\mathbf{x} = \mathbf{v}(\mathbf{x}, t)dt$$

在我们的语境里面,它本质上还是在建模随机变量随时间的变化,相应地我们还是可以对它做这些事情:

  • 推导 Fokker-Planck 方程,简化为 Liouville 如下:

    $$\frac{\partial p}{\partial t} = -\nabla \cdot [\mathbf{v}(\mathbf{x}, t)p(\mathbf{x}, t)]$$

  • 计算边缘分布(通过 Liouville 方程即可)

只不过对于 ODE 来说,没有正反方程之分了,因为轨迹是确定的。

我们在哪些语境下会见到 ODE?

Flow matching

众所周知 flow matching 的 forward 过程是个 ODE:

$$\frac{d\mathbf{x}}{dt} = \mathbf{v}(\mathbf{x}, t)$$

因为在整个 forward 的过程中,没有布朗运动的参与,只有漂移,没有扩散。非常典型的 ODE Forward。

SDE 转 ODE(概率流 ODE)

假如是 SDE,形如 $d\mathbf{x} = \mathbf{f}(\mathbf{x}, t)dt + \mathbf{G}(\mathbf{x}, t)d\mathbf{w}$。其实我们也可以将其转化为一个 ODE 来进行 forward,称为其概率流 ODE

$$d\mathbf{x} = \{\mathbf{f}(\mathbf{x}, t) - \frac{1}{2}\nabla \cdot (\mathbf{G}(\mathbf{x}, t)\mathbf{G}(\mathbf{x}, t)^T) - \frac{1}{2}\mathbf{G}(\mathbf{x}, t)\mathbf{G}(\mathbf{x}, t)^T \nabla \log p_t(\mathbf{x})\}dt$$

首先这个公式没有布朗运动,其次它的边缘密度和原 SDE 是一模一样的。证明如下(利用 Liouville 公式):

证明:

记概率流 ODE 的边缘密度为 $q(\mathbf{x}, t)$,此外我们记 $\mathbf{D} = \mathbf{G}(\mathbf{x}, \lambda)\mathbf{G}(\mathbf{x}, \lambda)^T$,根据 Liouville 公式:

$$\begin{aligned} \frac{\partial q}{\partial t} &= -\nabla \cdot \left[ [\mathbf{f}(\mathbf{x}, \lambda) - \frac{1}{2}\nabla \cdot \mathbf{D} - \frac{1}{2}\mathbf{D}\nabla \log p_t(\mathbf{x})] p(\mathbf{x}, \lambda) \right] \\ &= -\nabla \cdot [\mathbf{f}(\mathbf{x}, \lambda)p(\mathbf{x}, \lambda)] + \frac{1}{2}\nabla \cdot [(\nabla \cdot \mathbf{D})p] + \frac{1}{2}\nabla \cdot (\mathbf{D}\nabla p(\mathbf{x}, \lambda)) \\ &= -\nabla \cdot [\mathbf{f}(\mathbf{x}, \lambda)p(\mathbf{x}, \lambda)] + \frac{1}{2}\nabla \cdot (\nabla \cdot (\mathbf{D}p)) \\ &= \frac{\partial p}{\partial t} \end{aligned}$$

证毕。

注意:假如我们将 $\mathbf{f}(\mathbf{x}, t) - \frac{1}{2}\nabla \cdot (\mathbf{G}(\mathbf{x}, t)\mathbf{G}(\mathbf{x}, t)^T) - \frac{1}{2}\mathbf{G}(\mathbf{x}, t)\mathbf{G}(\mathbf{x}, t)^T \nabla \log p_t(\mathbf{x})$ 就看做 $\mathbf{v}(\mathbf{x}, t)$,那其实就是变成了 flow matching 的 forward,没有区别。

ODE 转 SDE

还有一点需要注意 ODE 甚至也可以转成 SDE,同时保持边缘密度。

给方程 $\frac{d\mathbf{x}}{dt} = \mathbf{v}(\mathbf{x}, t)$ 配上一个扩散项 $\mathbf{G}(\mathbf{x}, t)$,随之给出相应的 SDE:

$$d\mathbf{x} = \{\mathbf{v}(\mathbf{x}, t) + \frac{1}{2}\nabla \cdot (\mathbf{G}(\mathbf{x}, t)\mathbf{G}(\mathbf{x}, t)^T) + \frac{1}{2}\mathbf{G}(\mathbf{x}, t)\mathbf{G}(\mathbf{x}, t)^T \nabla \log p_t(\mathbf{x})\}dt + \mathbf{G}(\mathbf{x}, t)d\mathbf{w}$$

证明一句话就说完了:$\frac{d\mathbf{x}}{dt} = \mathbf{v}(\mathbf{x}, t)$ 其实就是上述方程的概率流 ODE,所以自然就有相同的边缘分布。

ODE 采样

当用 ODE forward 时,自然可用 ODE 本身做 sampling,无任何障碍,只需 Euler-Maruyama 离散化一下就行了。

值得注意的是,当用 SDE forward,我们甚至也可以用 ODE sampling,只要写用概率流 ODE 就行了。

比如 DDIM 就是一个例子,用 SDE 训练,但是用 ODE 做 sampling。

ODE 实战案例

简洁推导 Flow-GRPO

对于 Flow-GRPO 来说,采样想加入随机性,就必须得将 reverse 过程改造成 SDE。

参考原文:

Substituting Eq. 15 to Eq. 14, we arrive at the drift coefficients of the target forward SDE:

$$f_{\text{SDE}} = \boldsymbol{v}_t(\mathbf{x}_t, t) + \frac{\sigma_t^2}{2} \nabla \log p_t(\mathbf{x})$$

Hence, we can rewrite the forward SDE in Eq. 11 as:

$$d\mathbf{x}_t = \left( \boldsymbol{v}_t(\mathbf{x}_t) + \frac{\sigma_t^2}{2} \nabla \log p_t(\mathbf{x}_t) \right) dt + \sigma_t d\mathbf{w}_t$$

非常简单可以看出这就是上面我们将 ODE 转 SDE 的简化形式,其中 $\mathbf{G}(\mathbf{x}, t) = \sigma_t$

以这样的视角来看,很快就理解了。

结语

经过这些梳理之后,我们发现:上述知识是理解生成模型的钥匙,没有这些,一定读不懂相关的理论和论文。

引用这篇文章Cite this article

如果您觉得这篇文章对您有帮助,欢迎引用: If you find this article helpful, please consider citing it:

@article{feng2025sde,
  title   = {Summary of Basic SDE/ODE Math Derivations in Diffusion Models},
  author  = {Feng, Wanquan},
  year    = {2025},
  url     = {https://wanquanf.github.io/blog/posts/20250912_sde_derivation.html}
}

It has already been several years since I started working on generative models. Recently, out of necessity, I put together a summary of the fundamental mathematical derivations for SDEs and ODEs used in diffusion models. Although diffusion models have been around for a while, they remain highly popular, with newcomers constantly joining the field. I hope this summary proves helpful to those who need it.

SDE as a University Course

Which Course Does It Belong To?

SDE stands for Stochastic Differential Equations, which is typically taught as a course under the same name.

What Are the Prerequisite Courses?

The prerequisites include:

  • Foundational Courses: Mathematical Analysis (Calculus), Linear Algebra
  • Probability: Probability Theory, Stochastic Processes
  • Differential Equations: Ordinary Differential Equations (ODEs), Partial Differential Equations (PDEs)
  • Physics (for intuition): Non-equilibrium Statistical Mechanics. Mastering all of these is undoubtedly challenging, but fortunately, we can learn and reference them as needed.

Classic SDE Textbooks & Their Coverage

Evans's An Introduction to Stochastic Differential Equations (PDF link) is excellent. It thoroughly explains SDEs while greatly minimizing the reliance on prerequisite knowledge. For industry practitioners using SDEs, it serves as an authoritative reference. The contents primarily cover:

  • The definition, properties, and construction of Brownian motion
  • Stochastic integration and Ito's formula
  • The definition of stochastic differential equations, as well as the existence and uniqueness of their solutions. These classic textbooks cover the self-consistent foundational theories, but they generally do not delve into the application of SDEs in specific fields or their more advanced properties.

Are Textbooks Enough? What Else Do We Need?

Obviously, textbooks alone are not enough. Knowing "what an SDE is" is merely the beginning; it doesn't answer "how to apply SDEs to generative models." Chronologically, Brownian motion was derived by Einstein in 1905 and formalized into a mathematical theory by Wiener in 1920. Kiyoshi Ito, the most prominent contributor to SDE theory, passed away in 2008. Meanwhile, the concept of diffusion models was proposed in 2015 and widely popularized around 2020. Therefore, beyond classic textbooks, much modern knowledge must be supplemented. Here are a few examples:

  • Langevin Dynamics (1908)
  • Fokker-Planck Equation (1917)
  • Reverse Ito SDE (1982)
  • Score Matching (2005)
  • ... And, of course, there is a wealth of modern knowledge, such as DDPM, DDIM, and Flow Matching, which belong entirely to the current era of generative AI.

Let's unfold these concepts step by step.

Forward SDE (Forward Modeling)

Generally, the equation we study takes the following form:

$$d\mathbf{x} = \mathbf{f}(\mathbf{x},t)dt + \mathbf{G}(\mathbf{x},t)d\mathbf{w}$$

where $\mathbf{w}$ is the Wiener process, $\mathbf{f}: \mathbb{R}^N \times [0, T] \to \mathbb{R}^N$, and $\mathbf{G}: \mathbb{R}^N \times [0, T] \to \mathbb{R}^{N \times N}$. Note that $\mathbf{x}$ also varies with $t$, forming a continuous sequence of random variables $\mathbf{x}(t)$, often denoted as $\mathbf{x}_t$. For convenience, we will simply write it as $\mathbf{x}$. Typically, at the endpoints of the interval $t \in [0, T]$, $\mathbf{x}_0 \sim p_0(\mathbf{x}) = p_{data}(\mathbf{x})$ and $\mathbf{x}_T \sim p_T(\mathbf{x})$ represent the data distribution and the pure noise distribution, respectively. (Some papers, such as those on Flow Matching, use different notations, but we will standardize the notation throughout this document.)

The rationale for focusing on this specific form is twofold:

  1. The definition of an SDE on Page 77 of Evans's book uses this form.
  2. The definition of an SDE in Eq. 3.3 of the classic paper Reverse SDE (1982) also adopts this form.

Therefore, although popular theories on diffusion models often adopt simpler forms, such as $d\mathbf{x} = \mathbf{f}(\mathbf{x}, t)dt + g(t)d\mathbf{w}$, or even $d\mathbf{x} = \mathbf{v}(\mathbf{x}, t)dt$, we will study the more general form. While it might seem more complicated at first glance, it actually simplifies the process by allowing us to easily borrow essential techniques from classic mathematical derivations.

Marginal Distribution: Fokker-Planck Equation

The stochastic differential equation $d\mathbf{x} = \mathbf{f}(\mathbf{x}, t)dt + \mathbf{G}(\mathbf{x}, t)d\mathbf{w}$ describes the evolution of the random variable $\mathbf{x}$ over time $t$. We first introduce a classic equation known as the Fokker-Planck equation. Suppose the probability density function of $\mathbf{x}(t)$ is $p_t(\mathbf{x})$, typically referred to as themarginal density. The evolution of this marginal density $p_t(\mathbf{x})$, as induced by the aforementioned SDE, is governed by:

$$\frac{\partial p}{\partial t} = -\nabla \cdot [\mathbf{f}(\mathbf{x}, t)p(\mathbf{x}, t)] + \frac{1}{2}\nabla \cdot [\nabla \cdot (\mathbf{G}(\mathbf{x}, t)\mathbf{G}(\mathbf{x}, t)^T p(\mathbf{x}, t))]$$

This is known as the Fokker-Planck equation.

The significance of this formulation lies in transforming the temporal evolution of the random variable $\mathbf{x}_t$ into the temporal evolution of its marginal density $p_t$. Why is this transformation necessary? While we typically sample discrete random variables in practice, the underlying mathematical principles are largely described using probability density functions. Thus, this bridge is essential.

Proof:

Let us choose an arbitrary scalar test function $\phi(\mathbf{x}): \mathbb{R}^N \to \mathbb{R}$ that does not explicitly depend on $t$, and satisfies the conditions of being smooth and having compact support.

Focusing on the function $\phi(\mathbf{x})$, we will calculate the combination of its expectation $\mathbb{E}$ and time derivative $\frac{\partial}{\partial t}$ in two different ways.

(1) First take the expectation, then compute the derivative

Its expectation is given by: $\mathbb{E}[\phi(\mathbf{x}(t))] = \int_{\mathbb{R}^N} \phi(\mathbf{x})p(\mathbf{x}, t)d\mathbf{x}$

Taking the derivative yields: $\frac{d}{dt}\mathbb{E}[\phi(\mathbf{x}(t))] = \int_{\mathbb{R}^N} \phi(\mathbf{x})\frac{\partial p(\mathbf{x}, t)}{\partial t}d\mathbf{x}$

(2) First compute the derivative, then take the expectation

To calculate $d\phi$, we need to apply Ito's formula (refer to the equation in the middle of Page 72 in Evans's book).

We have: $d\phi = \frac{\partial \phi}{\partial t}dt + (\nabla \phi)^T d\mathbf{x} + \frac{1}{2}\text{Tr}\left(\mathbf{G}\mathbf{G}^T \nabla\nabla^T \phi\right)dt$

Since $\phi(\mathbf{x})$ does not explicitly contain $t$, the first term is 0, which leads to:

$$d\phi = (\nabla \phi)^T d\mathbf{x} + \frac{1}{2}\text{Tr}\left(\mathbf{G}\mathbf{G}^T \nabla\nabla^T \phi\right)dt$$

Substituting the Forward SDE $d\mathbf{x} = \mathbf{f}(\mathbf{x}, t)dt + \mathbf{G}(\mathbf{x}, t)d\mathbf{w}$ into the equation, we get:

$$\begin{aligned} d\phi &= (\nabla \phi)^T (\mathbf{f}dt + \mathbf{G}d\mathbf{w}) + \frac{1}{2}\text{Tr}\left(\mathbf{G}\mathbf{G}^T \nabla\nabla^T \phi\right)dt \\ &= \left( (\nabla \phi)^T\mathbf{f} + \frac{1}{2}\text{Tr}\left(\mathbf{G}\mathbf{G}^T \nabla\nabla^T \phi\right) \right)dt + (\nabla \phi)^T\mathbf{G}d\mathbf{w} \end{aligned}$$

$$\mathbb{E}[d\phi] = \mathbb{E}\left[ \left( (\nabla \phi)^T\mathbf{f} + \frac{1}{2}\text{Tr}\left(\mathbf{G}\mathbf{G}^T \nabla\nabla^T \phi\right) \right)dt \right] + \mathbb{E}[(\nabla \phi)^T\mathbf{G}d\mathbf{w}]$$

Due to the properties of Brownian motion, the second term evaluates to 0, hence:

$$\mathbb{E}[d\phi] = \mathbb{E}\left[ \left( (\nabla \phi)^T\mathbf{f} + \frac{1}{2}\text{Tr}\left(\mathbf{G}\mathbf{G}^T \nabla\nabla^T \phi\right) \right)dt \right]$$

Therefore,

$$\begin{aligned} \mathbb{E}\left[ \frac{d\phi}{dt} \right] &= \mathbb{E}\left[ (\nabla \phi)^T\mathbf{f} + \frac{1}{2}\text{Tr}\left(\mathbf{G}\mathbf{G}^T \nabla\nabla^T \phi\right) \right] \\ &= \int_{\mathbb{R}^N} (\nabla \phi)^T\mathbf{f} \cdot p(\mathbf{x}, t)d\mathbf{x} + \frac{1}{2}\int_{\mathbb{R}^N} \text{Tr}\left(\mathbf{G}\mathbf{G}^T \nabla\nabla^T \phi\right) \cdot p(\mathbf{x}, t)d\mathbf{x} \end{aligned}$$

(3) Combining the results from the previous two steps, we have:

$$\int_{\mathbb{R}^N} \phi(\mathbf{x})\frac{\partial p(\mathbf{x}, t)}{\partial t}d\mathbf{x} = \int_{\mathbb{R}^N} (\nabla \phi)^T\mathbf{f} \cdot p(\mathbf{x}, t)d\mathbf{x} + \frac{1}{2}\int_{\mathbb{R}^N} \text{Tr}\left(\mathbf{G}\mathbf{G}^T \nabla\nabla^T \phi\right) \cdot p(\mathbf{x}, t)d\mathbf{x}$$

As the kernel function of the integral, we prefer not to have derivatives of $\phi$ in the expression. Thus, we use integration by parts to transfer the derivatives onto $p$:

$$\int_{\mathbb{R}^N} (\nabla \phi)^T\mathbf{f} \cdot p(\mathbf{x}, t)d\mathbf{x} = - \int_{\mathbb{R}^N} \phi(\mathbf{x})(\nabla \cdot (p\mathbf{f}))d\mathbf{x}$$

$$\int_{\mathbb{R}^N} \text{Tr}\left(\mathbf{G}\mathbf{G}^T \nabla\nabla^T \phi\right) \cdot p(\mathbf{x}, t)d\mathbf{x} = \int_{\mathbb{R}^N} \phi(\mathbf{x})\nabla \cdot (\nabla \cdot (p\mathbf{G}\mathbf{G}^T))d\mathbf{x}$$

So we obtain:

$$\begin{aligned} \int_{\mathbb{R}^N} \phi(\mathbf{x})\frac{\partial p(\mathbf{x}, t)}{\partial t}d\mathbf{x} &= - \int_{\mathbb{R}^N} \phi(\mathbf{x})(\nabla \cdot (p\mathbf{f}))d\mathbf{x} + \frac{1}{2}\int_{\mathbb{R}^N} \phi(\mathbf{x})\nabla \cdot (\nabla \cdot (p\mathbf{G}\mathbf{G}^T))d\mathbf{x} \\ &= \int_{\mathbb{R}^N} \phi(\mathbf{x}) \left\{ -\nabla \cdot (p\mathbf{f}) + \frac{1}{2}\nabla \cdot (\nabla \cdot (p\mathbf{G}\mathbf{G}^T)) \right\} d\mathbf{x} \end{aligned}$$

(4) Given that the above equation holds for any test function $\phi(\mathbf{x}): \mathbb{R}^N \to \mathbb{R}$ (which does not explicitly contain $t$, is smooth, and has compact support), we can essentially remove the integral signs and the function $\phi(\mathbf{x})$. This yields the desired Fokker-Planck equation:

$$\frac{\partial p(\mathbf{x}, t)}{\partial t} = -\nabla \cdot (p\mathbf{f}) + \frac{1}{2}\nabla \cdot (\nabla \cdot (p\mathbf{G}\mathbf{G}^T))$$

Q.E.D.

Reverse SDE (Reverse Modeling)

Transforming the stochastic differential equation $d\mathbf{x} = \mathbf{f}(\mathbf{x}, t)dt + \mathbf{G}(\mathbf{x}, t)d\mathbf{w}$ into its so-called reverse form is one of the most critical tasks in diffusion models.

The logic proceeds as follows:

The reverse equation of the aforementioned forward equation is:

$$d\mathbf{x} = \left[ \mathbf{f}(\mathbf{x}, t) - \nabla \cdot (\mathbf{G}(\mathbf{x}, t)\mathbf{G}(\mathbf{x}, t)^T) - \mathbf{G}(\mathbf{x}, t)\mathbf{G}(\mathbf{x}, t)^T \nabla_{\mathbf{x}} \log p_t(\mathbf{x}) \right] dt + \mathbf{G}(\mathbf{x}, t) d\mathbf{\bar{w}}$$

The time domain remains $t \in [0, T]$. Note that the reverse SDE inherently requires the term $\nabla_{\mathbf{x}} \log p_t(\mathbf{x})$, which means it relies on prior knowledge of the data distribution. This dictates that we must use a neural network to predict $\nabla_{\mathbf{x}} \log p_t(\mathbf{x})$; otherwise, the reverse process would be impossible.

An important point to emphasize is:The equation that achieves an exact reverse process is not unique. As we will see later, there are infinitely many equations capable of successfully reversing the noise distribution back to the data distribution. The equation presented above is simply the most classic and commonly used one.

Here, we provide a proof that the above equation indeed constitutes a valid reverse SDE (relying on the Fokker-Planck equation):

Proof:

This verification process is somewhat convoluted as it involves time reversal. Specifically, let $s = T - t$.

In the aforementioned equation with respect to $t$:

$$d\mathbf{x}_t = \left[ \mathbf{f}(\mathbf{x}, t) - \nabla \cdot (\mathbf{G}(\mathbf{x}, t)\mathbf{G}(\mathbf{x}, t)^T) - \mathbf{G}(\mathbf{x}, t)\mathbf{G}(\mathbf{x}, t)^T \nabla_{\mathbf{x}} \log p_t(\mathbf{x}) \right] dt + \mathbf{G}(\mathbf{x}, t) d\mathbf{\bar{w}}$$

We define

$$\mathbf{m}(\mathbf{x}, t) = -\nabla \cdot (\mathbf{G}(\mathbf{x}, t)\mathbf{G}(\mathbf{x}, t)^T) - \mathbf{G}(\mathbf{x}, t)\mathbf{G}(\mathbf{x}, t)^T \nabla_{\mathbf{x}} \log p_t(\mathbf{x})$$

Thus, the equation simplifies to:

$$d\mathbf{x}_t = \left[ \mathbf{f}(\mathbf{x}, t) + \mathbf{m}(\mathbf{x}, t) \right] dt + \mathbf{G}(\mathbf{x}, t) d\mathbf{\bar{w}}$$

Next, we transform this into an equation with respect to $s$ (In the original equation, substitute: $t \to T - s, dt \to -ds$. Be careful not to omit the negative sign of $-ds$):

$$\begin{aligned} d\mathbf{x}_s &= \left[ \mathbf{f}(\mathbf{x}, T - s) + \mathbf{m}(\mathbf{x}, T - s) \right] (-ds) + \mathbf{G}(\mathbf{x}, T - s) d\mathbf{\bar{w}} \\ &= \left[ -\mathbf{f}(\mathbf{x}, T - s) - \mathbf{m}(\mathbf{x}, T - s) \right] ds + \mathbf{G}(\mathbf{x}, T - s) d\mathbf{\bar{w}} \end{aligned}$$

For convenience, we denote this as:

$$d\mathbf{x} = \mathbf{h}(\mathbf{x}, s)ds + \mathbf{K}(\mathbf{x}, s) d\mathbf{\bar{w}}$$

Where

$$\mathbf{h}(\mathbf{x}, s) = -\mathbf{f}(\mathbf{x}, T - s) - \mathbf{m}(\mathbf{x}, T - s)$$

$$\mathbf{K}(\mathbf{x}, s) = \mathbf{G}(\mathbf{x}, T - s)$$

Additionally, let the marginal distribution of the reverse SDE be $q(\mathbf{x}, s)$.

We will proceed by reducing the following relationship, ultimately proving that the aforementioned $\mathbf{m}(\mathbf{x}, t)$ is a solution to the equation below. For $\forall \lambda \in [0, T]$:

$$q(\mathbf{x}, \lambda) = p(\mathbf{x}, T - \lambda)$$

$$\left. \frac{\partial p}{\partial t} \right|_{t=\lambda} = - \left. \frac{\partial q}{\partial s} \right|_{s=T-\lambda}$$

First, we expand the partial derivative terms on both sides of the equation.

1. Left side: According to the Fokker-Planck equation, and letting $\mathbf{D} = \mathbf{G}(\mathbf{x}, \lambda)\mathbf{G}(\mathbf{x}, \lambda)^T$:

$$\begin{aligned} \left. \frac{\partial p}{\partial t} \right|_{t=\lambda} &= -\nabla \cdot [\mathbf{f}(\mathbf{x}, \lambda)p(\mathbf{x}, \lambda)] \\ &\quad + \frac{1}{2}\nabla \cdot [\nabla \cdot (\mathbf{G}(\mathbf{x}, \lambda)\mathbf{G}(\mathbf{x}, \lambda)^T p(\mathbf{x}, \lambda))] \\ &= -\nabla \cdot [\mathbf{f}(\mathbf{x}, \lambda)p(\mathbf{x}, \lambda)] + \frac{1}{2}\nabla \cdot [\nabla \cdot (\mathbf{D}p(\mathbf{x}, \lambda))] \end{aligned}$$

2. Right side: Similarly, using the Fokker-Planck equation:

$$\begin{aligned} - \left. \frac{\partial q}{\partial s} \right|_{s=T-\lambda} &= \nabla \cdot [\mathbf{h}(\mathbf{x}, T - \lambda)q(\mathbf{x}, T - \lambda)] \\ &\quad - \frac{1}{2}\nabla \cdot [\nabla \cdot (\mathbf{K}(\mathbf{x}, T - \lambda)\mathbf{K}(\mathbf{x}, T - \lambda)^T q(\mathbf{x}, T - \lambda))] \\ &= \nabla \cdot [\mathbf{h}(\mathbf{x}, T - \lambda)q(\mathbf{x}, T - \lambda)] \\ &\quad - \frac{1}{2}\nabla \cdot [\nabla \cdot (\mathbf{G}(\mathbf{x}, \lambda)\mathbf{G}(\mathbf{x}, \lambda)^T q(\mathbf{x}, T - \lambda))] \\ &= \nabla \cdot [\mathbf{h}(\mathbf{x}, T - \lambda)p(\mathbf{x}, \lambda)] \\ &\quad - \frac{1}{2}\nabla \cdot [\nabla \cdot (\mathbf{G}(\mathbf{x}, \lambda)\mathbf{G}(\mathbf{x}, \lambda)^T p(\mathbf{x}, \lambda))] \\ &= \nabla \cdot [\mathbf{h}(\mathbf{x}, T - \lambda)p(\mathbf{x}, \lambda)] - \frac{1}{2}\nabla \cdot [\nabla \cdot (\mathbf{D}p(\mathbf{x}, \lambda))] \\ &= \nabla \cdot [[-\mathbf{f}(\mathbf{x}, \lambda) - \mathbf{m}(\mathbf{x}, \lambda)]p(\mathbf{x}, \lambda)] - \frac{1}{2}\nabla \cdot [\nabla \cdot (\mathbf{D}p(\mathbf{x}, \lambda))] \end{aligned}$$

3. Verification: We check if $\left. \frac{\partial p}{\partial t} \right|_{t=\lambda} = - \left. \frac{\partial q}{\partial s} \right|_{s=T-\lambda}$ holds true when $\mathbf{m}(\mathbf{x}, t) = -\nabla \cdot (\mathbf{G}(\mathbf{x}, t)\mathbf{G}(\mathbf{x}, t)^T) - \mathbf{G}(\mathbf{x}, t)\mathbf{G}(\mathbf{x}, t)^T \nabla_{\mathbf{x}} \log p_t(\mathbf{x})$. Substituting this in:

$$\begin{aligned} - \left. \frac{\partial q}{\partial s} \right|_{s=T-\lambda} &= \nabla \cdot [[-\mathbf{f}(\mathbf{x}, \lambda) - \mathbf{m}(\mathbf{x}, \lambda)]p(\mathbf{x}, \lambda)] - \frac{1}{2}\nabla \cdot [\nabla \cdot (\mathbf{D}p(\mathbf{x}, \lambda))] \\ &= -\nabla \cdot [\mathbf{f}(\mathbf{x}, \lambda)p(\mathbf{x}, \lambda)] - \nabla \cdot [\mathbf{m}(\mathbf{x}, \lambda)p(\mathbf{x}, \lambda)] - \frac{1}{2}\nabla \cdot [\nabla \cdot (\mathbf{D}p(\mathbf{x}, \lambda))] \\ &= -\nabla \cdot [\mathbf{f}(\mathbf{x}, \lambda)p(\mathbf{x}, \lambda)] - \nabla \cdot \left[ \left( -\nabla \cdot \mathbf{D} - \mathbf{D}\frac{\nabla p}{p} \right) p(\mathbf{x}, \lambda) \right] \\ &\quad - \frac{1}{2}\nabla \cdot [\nabla \cdot (\mathbf{D}p(\mathbf{x}, \lambda))] \\ &= -\nabla \cdot [\mathbf{f}(\mathbf{x}, \lambda)p(\mathbf{x}, \lambda)] - \nabla \cdot \left[ -(\nabla \cdot \mathbf{D})p - \mathbf{D}(\nabla p) \right] \\ &\quad - \frac{1}{2}\nabla \cdot [\nabla \cdot (\mathbf{D}p(\mathbf{x}, \lambda))] \\ &= -\nabla \cdot [\mathbf{f}(\mathbf{x}, \lambda)p(\mathbf{x}, \lambda)] + \nabla \cdot \left[ (\nabla \cdot \mathbf{D})p + \mathbf{D}(\nabla p) \right] \\ &\quad - \frac{1}{2}\nabla \cdot [\nabla \cdot (\mathbf{D}p(\mathbf{x}, \lambda))] \\ &= -\nabla \cdot [\mathbf{f}(\mathbf{x}, \lambda)p(\mathbf{x}, \lambda)] + \nabla \cdot (\nabla \cdot (\mathbf{D}p)) - \frac{1}{2}\nabla \cdot [\nabla \cdot (\mathbf{D}p(\mathbf{x}, \lambda))] \\ &= -\nabla \cdot [\mathbf{f}(\mathbf{x}, \lambda)p(\mathbf{x}, \lambda)] + \frac{1}{2}\nabla \cdot [\nabla \cdot (\mathbf{D}p(\mathbf{x}, \lambda))] \\ &= \left. \frac{\partial p}{\partial t} \right|_{t=\lambda} \end{aligned}$$

We can see that the condition is indeed satisfied. It naturally follows that $q(\mathbf{x}, \lambda) = p(\mathbf{x}, T - \lambda)$.

Q.E.D.

Training the Network to Output Accurate Scores (Loss Derivation)

In the Reverse SDE above, we encounter the term $\nabla_{\mathbf{x}} \log p_t(\mathbf{x})$. A crucial issue arises here: during actual reverse sampling, this term is unknown! This leads to an awkward circular dependency:

Without knowing $\nabla_{\mathbf{x}} \log p_t(\mathbf{x})$, we cannot perform reverse sampling.
Without performing reverse sampling, we still don't know $p_t(\mathbf{x})$, and consequently, we don't know $\nabla_{\mathbf{x}} \log p_t(\mathbf{x})$.
Still not knowing $\nabla_{\mathbf{x}} \log p_t(\mathbf{x})$, we still cannot perform reverse sampling.
... ... An infinite loop.

Therefore, we must use a neural network to estimate $\nabla_{\mathbf{x}} \log p_t(\mathbf{x})$; otherwise, the reverse process is absolutely impossible to complete.

Choosing the Prediction Target

However, why don't we predict $p_t(\mathbf{x})$ directly, but instead take a detour to estimate $\nabla_{\mathbf{x}} \log p_t(\mathbf{x})$? The reasons are as follows:

  1. $p_t(\mathbf{x})$ must satisfy a "strong condition", which is $\int_{\mathbf{x}} p_t(\mathbf{x})d\mathbf{x} \equiv 1$. This condition is impossible for a neural network to satisfy naturally; it operates over the entire training set, making it an NP-hard problem and practically unachievable.
  2. Even taking the logarithm, denoting $r_t(\mathbf{x}) = \log p_t(\mathbf{x})$, doesn't work. After all, $\int_{\mathbf{x}} \exp\{p_t(\mathbf{x})\}d\mathbf{x} \equiv 1$ is still a very strong constraint.
  3. But $\nabla_{\mathbf{x}} \log p_t(\mathbf{x})$ does not have this problem. It does not need to satisfy any global normalization conditions. Thus, the network can predict it freely without violating any constraints.

Definition of the score function:

$$s(\mathbf{x}, t) = \nabla_{\mathbf{x}} \log p_t(\mathbf{x})$$

Our goal is now very clear: train a network to predict a term $s_\theta(\mathbf{x}, t)$ such that $s_\theta(\mathbf{x}, t) \approx s(\mathbf{x}, t)$.

What to do without Ground Truth?

Another massive problem is that there is actually no Ground Truth for $\nabla_{\mathbf{x}} \log p_t(\mathbf{x})$.

During training, we have data samples $\mathbf{x}_0$, but we do not have $p_0(\mathbf{x})$. So even if we add noise to $\mathbf{x}_0$ to get $\mathbf{x}_t$ according to known formulas, we still don't know $p_t(\mathbf{x})$.

But! Fortunately!

1. We know the conditional distribution $p_t(\mathbf{x}_t | \mathbf{x}_0)$

Typically, $p_t(\mathbf{x}_t | \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; a\mathbf{x}_0, b\mathbf{I})$, corresponding to the add noise process during training. Notice that $\nabla_{\mathbf{x}} \log \{p_t(\mathbf{x}_t | \mathbf{x}_0)\} = -\frac{\mathbf{x}_t - a\mathbf{x}_0}{b}$. We can actually write out the specific expression for this, so we can boldly use it in the loss function.

2. Another important equation is:

$$\mathbb{E}_{\mathbf{x}_t} \left[ \| s_\theta(\mathbf{x}_t, t) - \nabla \log p_t(\mathbf{x}_t) \|_2^2 \right] = \mathbb{E}_{\mathbf{x}_0}\mathbb{E}_{\mathbf{x}_t|\mathbf{x}_0} \left[ \| s_\theta(\mathbf{x}_t, t) - \nabla \log p_t(\mathbf{x}_t | \mathbf{x}_0) \|_2^2 \right] + Const$$

where $Const$ denotes terms independent of $s_\theta(\mathbf{x}_t, t)$. We will provide the proof for this shortly below.

Through this reduction, we can confidently let the network estimate $\nabla_{\mathbf{x}} \log \{p_t(\mathbf{x}_t | \mathbf{x}_0)\}$:

$$\theta^* = \arg\min_\theta \mathbb{E}_t \left\{ \lambda(t) \mathbb{E}_{\mathbf{x}_0}\mathbb{E}_{\mathbf{x}_t|\mathbf{x}_0} \left[ \| s_\theta(\mathbf{x}_t, t) - \nabla_{\mathbf{x}} \log \{p_t(\mathbf{x}_t | \mathbf{x}_0)\} \|_2^2 \right] \right\}$$

where $\lambda(t) > 0$, which is usually not very critical; setting it arbitrarily will likely still allow the model to train successfully.

3. In fact, the loss of diffusion training frameworks is always equivalent to this form, including DDPM, SMLD, etc.

Now let's prove the equation mentioned above:

$$\mathbb{E}_{\mathbf{x}_t} \left[ \| s_\theta(\mathbf{x}_t, t) - \nabla \log p_t(\mathbf{x}_t) \|_2^2 \right] = \mathbb{E}_{\mathbf{x}_0}\mathbb{E}_{\mathbf{x}_t|\mathbf{x}_0} \left[ \| s_\theta(\mathbf{x}_t, t) - \nabla \log p_t(\mathbf{x}_t | \mathbf{x}_0) \|_2^2 \right] + Const$$

The proof process is very simple; it only requires expanding the integrals and calculating:

Proof:

We stipulate that any quantity independent of $s_\theta(\mathbf{x}_t, t)$ is uniformly denoted as $Const$. In addition, we need to use the marginal integral: $p_t(\mathbf{x}_t) = \int_{\mathbf{x}_0} p_0(\mathbf{x}_0)p_{t|0}(\mathbf{x}_t | \mathbf{x}_0)d\mathbf{x}_0$

Starting from the left side, expand and calculate:

$$\begin{aligned} &\mathbb{E}_{\mathbf{x}_t} \left[ \| s_\theta(\mathbf{x}_t, t) - \nabla \log p_t(\mathbf{x}_t) \|_2^2 \right] \\ &= \mathbb{E}_{\mathbf{x}_t} \left[ \| s_\theta(\mathbf{x}_t, t) \|_2^2 - 2 s_\theta(\mathbf{x}_t, t) \cdot \nabla \log p_t(\mathbf{x}_t) + \| \nabla \log p_t(\mathbf{x}_t) \|_2^2 \right] \\ &= \int_{\mathbf{x}_t} \left( \| s_\theta(\mathbf{x}_t, t) \|_2^2 - 2 s_\theta(\mathbf{x}_t, t) \cdot \nabla \log p_t(\mathbf{x}_t) + \| \nabla \log p_t(\mathbf{x}_t) \|_2^2 \right) p_t(\mathbf{x}_t)d\mathbf{x}_t \\ &= \int_{\mathbf{x}_t} \| s_\theta(\mathbf{x}_t, t) \|_2^2 p_t(\mathbf{x}_t)d\mathbf{x}_t + \int_{\mathbf{x}_t} \| \nabla \log p_t(\mathbf{x}_t) \|_2^2 p_t(\mathbf{x}_t)d\mathbf{x}_t - 2 \int_{\mathbf{x}_t} [s_\theta(\mathbf{x}_t, t) \cdot \nabla \log p_t(\mathbf{x}_t)] p_t(\mathbf{x}_t)d\mathbf{x}_t \\ &= \int_{\mathbf{x}_t} \| s_\theta(\mathbf{x}_t, t) \|_2^2 p_t(\mathbf{x}_t)d\mathbf{x}_t - 2 \int_{\mathbf{x}_t} [s_\theta(\mathbf{x}_t, t) \cdot \nabla \log p_t(\mathbf{x}_t)] p_t(\mathbf{x}_t)d\mathbf{x}_t + Const \\ &= \int_{\mathbf{x}_t} \| s_\theta(\mathbf{x}_t, t) \|_2^2 p_t(\mathbf{x}_t)d\mathbf{x}_t - 2 \int_{\mathbf{x}_t} s_\theta(\mathbf{x}_t, t) \cdot \nabla p_t(\mathbf{x}_t)d\mathbf{x}_t + Const \\ &= \int_{\mathbf{x}_t} \| s_\theta(\mathbf{x}_t, t) \|_2^2 p_t(\mathbf{x}_t)d\mathbf{x}_t - 2 \int_{\mathbf{x}_t} s_\theta(\mathbf{x}_t, t) \cdot \nabla \left[ \int_{\mathbf{x}_0} p_0(\mathbf{x}_0)p_{t|0}(\mathbf{x}_t | \mathbf{x}_0)d\mathbf{x}_0 \right] d\mathbf{x}_t + Const \\ &= \int_{\mathbf{x}_t} \| s_\theta(\mathbf{x}_t, t) \|_2^2 p_t(\mathbf{x}_t)d\mathbf{x}_t - 2 \int_{\mathbf{x}_t} s_\theta(\mathbf{x}_t, t) \cdot \left[ \int_{\mathbf{x}_0} p_0(\mathbf{x}_0)\nabla p_{t|0}(\mathbf{x}_t | \mathbf{x}_0)d\mathbf{x}_0 \right] d\mathbf{x}_t + Const \\ &= \int_{\mathbf{x}_t} \| s_\theta(\mathbf{x}_t, t) \|_2^2 p_t(\mathbf{x}_t)d\mathbf{x}_t - 2 \int_{\mathbf{x}_t} s_\theta(\mathbf{x}_t, t) \cdot \left[ \int_{\mathbf{x}_0} p_0(\mathbf{x}_0)[\nabla \log p_{t|0}(\mathbf{x}_t | \mathbf{x}_0)]p_{t|0}(\mathbf{x}_t | \mathbf{x}_0)d\mathbf{x}_0 \right] d\mathbf{x}_t + Const \\ &= \int_{\mathbf{x}_t} \| s_\theta(\mathbf{x}_t, t) \|_2^2 p_t(\mathbf{x}_t)d\mathbf{x}_t - 2 \int_{\mathbf{x}_t} \int_{\mathbf{x}_0} p_0(\mathbf{x}_0)p_{t|0}(\mathbf{x}_t | \mathbf{x}_0) \nabla \log p_{t|0}(\mathbf{x}_t | \mathbf{x}_0) \cdot s_\theta(\mathbf{x}_t, t) d\mathbf{x}_0 d\mathbf{x}_t + Const \\ &= \mathbb{E}_{\mathbf{x}_t} \left[ \| s_\theta(\mathbf{x}_t, t) \|_2^2 \right] - 2 \cdot \mathbb{E}_{\mathbf{x}_0}\mathbb{E}_{\mathbf{x}_t|\mathbf{x}_0} \left[ \nabla \log p_{t|0}(\mathbf{x}_t | \mathbf{x}_0) \cdot s_\theta(\mathbf{x}_t, t) \right] + Const \\ &= \mathbb{E}_{\mathbf{x}_0}\mathbb{E}_{\mathbf{x}_t|\mathbf{x}_0} \left[ \| s_\theta(\mathbf{x}_t, t) \|_2^2 \right] - 2 \cdot \mathbb{E}_{\mathbf{x}_0}\mathbb{E}_{\mathbf{x}_t|\mathbf{x}_0} \left[ \nabla \log p_{t|0}(\mathbf{x}_t | \mathbf{x}_0) \cdot s_\theta(\mathbf{x}_t, t) \right] + \mathbb{E}_{\mathbf{x}_0}\mathbb{E}_{\mathbf{x}_t|\mathbf{x}_0} \left[ \| \nabla \log p_{t|0}(\mathbf{x}_t | \mathbf{x}_0) \|_2^2 \right] + Const \\ &= \mathbb{E}_{\mathbf{x}_0}\mathbb{E}_{\mathbf{x}_t|\mathbf{x}_0} \left\{ \| s_\theta(\mathbf{x}_t, t) \|_2^2 - 2 \cdot [\nabla \log p_{t|0}(\mathbf{x}_t | \mathbf{x}_0) \cdot s_\theta(\mathbf{x}_t, t)] + \| \nabla \log p_{t|0}(\mathbf{x}_t | \mathbf{x}_0) \|_2^2 \right\} + Const \\ &= \mathbb{E}_{\mathbf{x}_0}\mathbb{E}_{\mathbf{x}_t|\mathbf{x}_0} \left[ \| s_\theta(\mathbf{x}_t, t) - \nabla \log p_{t|0}(\mathbf{x}_t | \mathbf{x}_0) \|_2^2 \right] + Const \end{aligned}$$

Q.E.D.

Tip: The Relationship Between Score and Epsilon in Common Derivations

Giving the obvious conclusion:the score is proportional to $\epsilon$

The above tip actually illustrates that the training objective of DDPM is fundamentally to predict the score function.

Discretization of Forward and Reverse Equations (Sampling)

SDE describes the continuous evolution of random variables over time. However, in practical applications, we cannot perform calculations in continuous time, so we need to discretize the SDE into a series of small time steps. The Euler-Maruyama method is the simplest and most commonly used SDE discretization method.

Before discretizing, it is important to note a fact:

$$d\mathbf{w} \sim \mathcal{N}(\mathbf{0}, |dt| \cdot \mathbf{I})$$

Therefore, we can consider:

$$d\mathbf{w} = \sqrt{|dt|} \cdot \mathbf{z} \text{, where } \mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$$

Euler-Maruyama Discretization of Forward SDE

Let $dt$ be $1$, and take two time points as $\{t, t+1\}$, then:

$$\begin{aligned} \mathbf{x}_{t+1} &= \mathbf{x}_t + \mathbf{f}(\mathbf{x}_t, t) \cdot 1 + \mathbf{G}(\mathbf{x}_t, t)\sqrt{1}\mathbf{z}_t \\ &= \mathbf{x}_t + \mathbf{f}(\mathbf{x}_t, t) + \mathbf{G}(\mathbf{x}_t, t)\mathbf{z}_t \end{aligned}$$

Euler-Maruyama Discretization of Reverse SDE

Let $dt$ be $-1$, and take two time points as $\{t, t-1\}$, then the Reverse SDE can be discretized as:

$$\begin{aligned} \mathbf{x}_{t-1} &= \mathbf{x}_t + \left[ \mathbf{f}(\mathbf{x}_t, t) - \nabla \cdot (\mathbf{G}(\mathbf{x}_t, t)\mathbf{G}(\mathbf{x}_t, t)^T) - \mathbf{G}(\mathbf{x}_t, t)\mathbf{G}(\mathbf{x}_t, t)^T \nabla_{\mathbf{x}} \log p_t(\mathbf{x}) \right] \cdot (-1) + \mathbf{G}(\mathbf{x}_t, t)\sqrt{|-1|}\mathbf{z}_t \\ &= \mathbf{x}_t - \left[ \mathbf{f}(\mathbf{x}_t, t) - \nabla \cdot (\mathbf{G}(\mathbf{x}_t, t)\mathbf{G}(\mathbf{x}_t, t)^T) - \mathbf{G}(\mathbf{x}_t, t)\mathbf{G}(\mathbf{x}_t, t)^T \nabla_{\mathbf{x}} \log p_t(\mathbf{x}) \right] + \mathbf{G}(\mathbf{x}_t, t)\mathbf{z}_t \end{aligned}$$

Of course, there are many other discretization methods; as long as they are equivalent to the SDE in their limit form, they should all be meaningful.

Practical SDE Case Study

Concise Derivation of DDPM

Let $\mathbf{f}(\mathbf{x}, t) = (\sqrt{1 - \beta_t} - 1)\mathbf{x}$ and $\mathbf{G}(\mathbf{x}, t) = \sqrt{\beta_t}$

Then the forward SDE is:

$$d\mathbf{x} = (\sqrt{1 - \beta_t} - 1)\mathbf{x}dt + \sqrt{\beta_t}d\mathbf{w}$$

After discretization, it becomes:

$$\mathbf{x}_{t+1} = \sqrt{1 - \beta_t}\mathbf{x}_t + \sqrt{\beta_t}\mathbf{z}_t$$

Consequently, the reverse SDE is:

$$d\mathbf{x} = (\sqrt{1 - \beta_t}\mathbf{x} - \mathbf{x} - \beta_t \mathbf{s}_\theta(\mathbf{x}, t))dt + \sqrt{\beta_t}d\mathbf{\bar{w}}$$

After discretization, it becomes:

$$\mathbf{x}_{t-1} = (2 - \sqrt{1 - \beta_t})\mathbf{x}_t + \beta_t \mathbf{s}_\theta(\mathbf{x}, t) + \sqrt{\beta_t}\mathbf{z}_t$$

Note that this formula is different from the original DDPM formula, but:

  1. They are equivalent when $\beta_t \ll 1$.
  2. Both dynamics converge to $p_t(\mathbf{x})$.

This concludes our understanding and summary of discretization. Naturally, there are countless discretization methods with different properties, which introduces a vast array of samplers, but we won't expand on that here.

ODE Framework

General Form of Random Variable ODE

We are not aiming to study all general ODEs. In our current context, we study equations within the SDE framework where $\mathbf{G}(\mathbf{x}, t) = \mathbf{0}$:

$$d\mathbf{x} = \mathbf{v}(\mathbf{x}, t)dt$$

In our context, this essentially still models the evolution of a random variable over time. Accordingly, we can still perform the following:

  • Derive the Fokker-Planck equation, which simplifies to the Liouville equation as follows:

    $$\frac{\partial p}{\partial t} = -\nabla \cdot [\mathbf{v}(\mathbf{x}, t)p(\mathbf{x}, t)]$$

  • Calculate the marginal distribution (via the Liouville equation).

It's just that for ODEs, there is no distinction between forward and reverse equations, as the trajectory is deterministic.

In What Contexts Do We Encounter ODEs?

Flow matching

As is well known, the forward process of flow matching is an ODE:

$$\frac{d\mathbf{x}}{dt} = \mathbf{v}(\mathbf{x}, t)$$

Becausethroughout the entire forward process, there is no Brownian motion involved; there is only drift and no diffusion. This is a very typical ODE Forward process.

Converting SDE to ODE (Probability Flow ODE)

Suppose we have an SDE of the form $d\mathbf{x} = \mathbf{f}(\mathbf{x}, t)dt + \mathbf{G}(\mathbf{x}, t)d\mathbf{w}$. We can actually convert it into an ODE for the forward process, known as itsProbability Flow ODE:

$$d\mathbf{x} = \{\mathbf{f}(\mathbf{x}, t) - \frac{1}{2}\nabla \cdot (\mathbf{G}(\mathbf{x}, t)\mathbf{G}(\mathbf{x}, t)^T) - \frac{1}{2}\mathbf{G}(\mathbf{x}, t)\mathbf{G}(\mathbf{x}, t)^T \nabla \log p_t(\mathbf{x})\}dt$$

First,this formula has no Brownian motion, and second, its marginal density is exactly the same as the original SDE. The proof is as follows (using the Liouville equation):

Proof:

Let the marginal density of the probability flow ODE be $q(\mathbf{x}, t)$, and let $\mathbf{D} = \mathbf{G}(\mathbf{x}, \lambda)\mathbf{G}(\mathbf{x}, \lambda)^T$. According to the Liouville equation:

$$\begin{aligned} \frac{\partial q}{\partial t} &= -\nabla \cdot \left[ [\mathbf{f}(\mathbf{x}, \lambda) - \frac{1}{2}\nabla \cdot \mathbf{D} - \frac{1}{2}\mathbf{D}\nabla \log p_t(\mathbf{x})] p(\mathbf{x}, \lambda) \right] \\ &= -\nabla \cdot [\mathbf{f}(\mathbf{x}, \lambda)p(\mathbf{x}, \lambda)] + \frac{1}{2}\nabla \cdot [(\nabla \cdot \mathbf{D})p] + \frac{1}{2}\nabla \cdot (\mathbf{D}\nabla p(\mathbf{x}, \lambda)) \\ &= -\nabla \cdot [\mathbf{f}(\mathbf{x}, \lambda)p(\mathbf{x}, \lambda)] + \frac{1}{2}\nabla \cdot (\nabla \cdot (\mathbf{D}p)) \\ &= \frac{\partial p}{\partial t} \end{aligned}$$

Q.E.D.

Note: If we treat $\mathbf{f}(\mathbf{x}, t) - \frac{1}{2}\nabla \cdot (\mathbf{G}(\mathbf{x}, t)\mathbf{G}(\mathbf{x}, t)^T) - \frac{1}{2}\mathbf{G}(\mathbf{x}, t)\mathbf{G}(\mathbf{x}, t)^T \nabla \log p_t(\mathbf{x})$ as $\mathbf{v}(\mathbf{x}, t)$, it essentially becomes the forward process of flow matching without any difference.

Converting ODE to SDE

Another point to note is that an ODE can even be converted into an SDE while maintaining the same marginal density.

By adding a diffusion term $\mathbf{G}(\mathbf{x}, t)$ to the equation $\frac{d\mathbf{x}}{dt} = \mathbf{v}(\mathbf{x}, t)$, we can derive the corresponding SDE:

$$d\mathbf{x} = \{\mathbf{v}(\mathbf{x}, t) + \frac{1}{2}\nabla \cdot (\mathbf{G}(\mathbf{x}, t)\mathbf{G}(\mathbf{x}, t)^T) + \frac{1}{2}\mathbf{G}(\mathbf{x}, t)\mathbf{G}(\mathbf{x}, t)^T \nabla \log p_t(\mathbf{x})\}dt + \mathbf{G}(\mathbf{x}, t)d\mathbf{w}$$

The proof can be stated in one sentence: $\frac{d\mathbf{x}}{dt} = \mathbf{v}(\mathbf{x}, t)$ is actually the probability flow ODE of the above equation, so naturally, they have the same marginal distribution.

ODE Sampling

When using ODE forward, we can naturally use the ODE itself for sampling without any obstacles; simply discretizing it with Euler-Maruyama is sufficient.

It is worth noting thatwhen using SDE forward, we can even use ODE sampling, as long as we formulate it using the probability flow ODE.

For example, DDIM is an instance of this: it is trained using an SDE, but sampling is performed using an ODE.

Practical ODE Case Study

Concise Derivation of Flow-GRPO

For Flow-GRPO, if we want to introduce randomness into sampling, we must transform the reverse process into an SDE.

Referring to the original text:

Substituting Eq. 15 to Eq. 14, we arrive at the drift coefficients of the target forward SDE:

$$f_{\text{SDE}} = \boldsymbol{v}_t(\mathbf{x}_t, t) + \frac{\sigma_t^2}{2} \nabla \log p_t(\mathbf{x})$$

Hence, we can rewrite the forward SDE in Eq. 11 as:

$$d\mathbf{x}_t = \left( \boldsymbol{v}_t(\mathbf{x}_t) + \frac{\sigma_t^2}{2} \nabla \log p_t(\mathbf{x}_t) \right) dt + \sigma_t d\mathbf{w}_t$$

It is very easy to see that this is the simplified form of converting ODE to SDE as mentioned above, where $\mathbf{G}(\mathbf{x}, t) = \sigma_t$.

From this perspective, it is quickly understood.

Conclusion

After this review, we find that the aforementioned knowledge is the key to understanding generative models. Without this foundation, it is impossible to comprehend the related theories and papers.

引用这篇文章Cite this article

如果您觉得这篇文章对您有帮助,欢迎引用: If you find this article helpful, please consider citing it:

@article{feng2025sde,
  title   = {Summary of Basic SDE/ODE Math Derivations in Diffusion Models},
  author  = {Feng, Wanquan},
  year    = {2025},
  url     = {https://wanquanf.github.io/blog/posts/20250912_sde_derivation.html}
}