1

As far as I know, there is no *very* simple proof of the convergence of temporal-difference algorithms. The proofs of convergence of TD algorithms are often based on stochastic approximation theory (given that e.g. Q-learning can be viewed as a stochastic process) and the work by Robbins and Monro (in fact, the Robbins-Monro conditions are usually assumed in the theorems and proofs).

The proofs of convergence of Q-learning (a TD(0) algorithm) and SARSA (another TD(0) algorithm), when the value functions are represented in tabular form (as opposed to being approximated with e.g. a neural network), can be found in different research papers.

For example, the **proof of convergence of tabular Q-learning** can be found in the paper Convergence of Stochastic Iterative Dynamic Programming Algorithms (1994) by Tommi Jaakkola et al. The **proof of convergence of tabular SARSA** can be found in the paper Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms (2000) by Satinder Singh et al.

See also How to show temporal difference methods converge to MLE?.