# Gradient Descent on pWFAs

24 Jul 2019

Joshua Moerman

Given a parametric weighted automaton, we would like to choose good values for the parameters.
For example, we want to satisfy a property like βthe system eventually reaches state $x$ with a probability of 0.75 or more.β
Can we find values for our parameter $p$ such that this property holds?
This problem is called *parameter synthesis.*

As shown in previous posts, the semantics of parametric weighted automata are polynomials in $p$. In order to compute a reachability probability, we need to consider fixpoints (as automata have loops) and consequently we work with rational functions. So in principle, we can readily solve the parameter synthesis problem by solving an equation of the form $\frac{f(p)}{g(p)} \geq 0.75$.

However, this exact method is often infeasible. The rational function can be a huge formula, and constructing it can already take a lot of time. And once we have the equations, actually solving it is an even harder problem.

Today, I want to show how we can solve this problem with a gradient descent method, trying to avoid the difficulties of the exact method. I am not aware of this being done in the literature, and I am pretty convinced that this is currently not implemented in any of the probabilistic model checkers.

## Derivatives

Recall (from a previous post) that the rational function we are interested in is $v(i) \in \mathbb{R}(p)$, where $i$ is the initial state and $v$ a linear function assigning to each state the reachability propability. For our gradient descent, we want to consider the derivative with respect to $p$:

$\frac{\partial}{\partial p} v(i) .$

With several parameters $p_1, \ldots, p_k$, we can consider the partial derivative in each direction. This would give us a gradient for a gradient descent solver. I stick to one parameter for the moment, and so I simply write $\partial$ instead of $\frac{\partial}{\partial p}$. Note that the above expression makes sense, as we are working in a derivation ring (meaning a ring with an $\mathbb{R}$-linear map $\partial$ such that the Lebniz rule holds: $\partial(a \cdot b) = (\partial a) \cdot b + a \cdot (\partial b)$).

Of course, simply writing down that quantity wonβt help us. We want to avoid calculating $v(i)$ in the first place. In the end, we only want to compute $\partial v(i)$ at a specific point $p \in \mathbb{R}$. Can we do that efficiently?

Letβs unfold the (coinductive) equation for $v$ on an arbitrary state $q$:

$v(q) = o(q) + \sum_a \sum_{q'} v(q') \cdot \delta_a(q, q') ,$ take the derivative (using linearity): $\partial v(q) = \partial o(q) + \sum_a \sum_{q'} \partial (v(q') \cdot \delta_a(q, q')) ,$ and apply the Leibniz rule: $\partial v(q) = \partial o(q) + \sum_a \sum_{q'} (\partial v(q')) \cdot \delta_a(q, q') + v(q') \cdot (\partial \delta_a(q, q')) ,$

So we see that the derivative depends on the derivative of the output, and in a particular way on the derivatives of the successor states. But it also depends on the actual (non-derived) value, by the Leibniz rule. As it currently stands, this is not an easy equation to solve.

**Side note:**
In the previous post I defined the output to be of type $o \colon Q \to \mathbb{R}$, so its derivative will always be zero.
I keep it here in the equation, because we can actually generalise the definition to $o \colon Q \to \mathbb{R}(p)$.

## Solving at a point

Letβs plug in a specific point $p \in \mathbb{R}$. Normally, I would denote the valuation of a polynomial as function application, for example $v(q)(p)$ is the value of the rational function $v(q)$ evaluated at $p$. But that would require lots of parentheses, so I denote the evaluation of polynomials with a subscript $v_p(q)$. By evaluating the above expression at a point $p$, we get:

$\begin{aligned} \partial v_p(q) &= \partial o_p(q) + \sum_a \sum_{q'} (\partial v_p(q')) \cdot \delta_{a p}(q, q') + v_p(q') \cdot (\partial \delta_{a p}(q, q') \\ &= \partial o_p(q) + \sum_{q'} \partial v_p(q') \cdot \left(\sum_a \delta_{a p}(q, q')\right) + \sum_{q'} v_p(q') \cdot \left(\sum_a \partial \delta_{a p}(q, q')\right) \end{aligned}$ In the second line, I have regrouped the sums. This is a big messy expression, but if we squint, we can see the following structure: $\partial v_p = \partial o_p + \partial v_p \cdot M + v_p \cdot M'$

for some matrices $M$ and $M'$ (given by the transition matrix and the derivative of the transition matrix w.r.t. $p$). Now, this is a system of linear equations over $\mathbb{R}$, instead of the field of fractions. Donβt let the symbols confuse you, β$\partial v_p$β is simply a variable in this system of equations! Further, the symbols β$\partial o_p$β, β$M$β and β$M'$β are just constant which depend on the given automaton. Instead of solving this equation exactly, we can do a quick value iteration to get an approximate value. (Note that we also need to approximate the vector $v_p$ at the same time.)

The above system of linear equations has a dimension of $2n$, where $n$ is the number of states. I expect that solving the system (approximately) can be very efficient, especially if the automaton structure is sparse. Of special interest is the value $\partial v_p(i)$ which tells us how $v_p(i)$ increases depending on an increase of $p$. This way, we can gradually move toward a point $p$ where the value $v_p(i)$ is high enough for our property to be satisfied. (Obviously, this comes with the disclaimer that a gradient descent may only find local optima.) This will also work with multiple parameters, although we will have to do the above work in each parameter.

I havenβt actually done the gradient descent, I have only shown how to obtain the gradient. It would be a nice project to implement such a thing in a probabilistic model checker.

Other iterative methods, such as Newtonβs method, to find optimal parameters can also be used. Such methods use the second derivatives (or even higher), which can be computed in a similar way. The system of equations will be bigger and more costly to solve or approximate, but on the other hand, it would require fewer iterations of improving the parameter.

## Automaton construction

The above manipulations can also be carried out to construct the βderivative of the automaton.β I havenβt worked out the details yet, but it gives us a construction $\mathcal{A} \mapsto \partial \mathcal{A}$

with the property that $[\![ \partial \mathcal{A} ]\!] (w) = \partial [\![ \mathcal{A} ]\!] (w)$. This would increase the size automaton by a factor of 2 (I guess), just like the system of equations above was of dimension $2n$. Whether this derived automaton is interesting on its own right, I donβt know.

I used to have a Disqus comments section here. But everybody seems to simply send me emails instead. I have disabled Disqus, in order to avoid third-party trackers.