7.2 Root Finding Methods

Basic concepts

In our study of mathematical models, we frequently encounter equations of the form \(f(x)=0\). The function \(f\) can take many forms, and our goal is to find a solution \(x^*\) that satisfies the equation. For linear systems such as \(\mathbf{A x}=\mathbf{b}\), solutions for systems with millions of unknowns are routinely found using methods like the finite element method. Nonlinear cases require different strategies: we either use pure search methods or linearize the system into a sequence of more manageable linear systems. In the one-dimensional case, simple solution formulas sometimes exist. A classical example is the quadratic polynomial \(f(x)=x^2+a x+b\), whose solution is given by the familiar formula

\[ x_{1,2}^*=-\frac{a}{2} \pm \sqrt{\left(\frac{a}{2}\right)^2-b} \]

For polynomial equations, the solutions are also known as roots. Formulas exist for finding roots of polynomials of degree 3 and 4 (discovered by Girolamo Cardano), but these are considerably more complex than the quadratic formula. For polynomials of degree greater than 4, Niels Henrik Abel proved that no general formula exists. Beyond polynomials, most functions \(f(x)\) admit no analytic solution formulas whatsoever. In these cases, we must use approximation methods. Moreover, even when exact formulas exist (such as for polynomials of degree 2, 3, or 4), they may be too complex or time-consuming to evaluate in practice. An approximate numerical method can often provide accurate solutions more efficiently. For more historical context see the video by Veritasium as well as this.

For many other equations, such as transcendental equations (involving exponential, logarithmic or trigonometric functions), no exact solution formulas are known. In these cases, numerical methods are the only way to find approximate solutions. One of the most famous examples is Kepler’s equation, which relates the position of a planet in its orbit to time. Kepler’s equation is given by

\[ M = x - e \sin(x) \]

where \(M\) is the mean anomaly, \(x\) is the eccentric anomaly, and \(e\) is the eccentricity of the orbit. Rearranging this equation to the standard root-finding form gives \[ f(x) = x - e \sin(x) - M = 0 \]

This equation cannot be solved analytically for \(x\) in terms of elementary functions, so numerical methods are used to find approximate solutions for given values of \(M\) and \(e\).

All methods we present are iterative, beginning with an initial guess \(x_0\) and generating a sequence \(x_0, x_1, \ldots\) of progressively better approximations. The iteration continues until reaching a desired tolerance or until a maximum number of steps is exceeded. Several practical considerations guide the choice and application of iterative methods. Visualizing the function helps identify how many solutions exist and which one the application requires. The distinction between real and complex solutions determines which methods apply. Convergence properties differ dramatically: some methods approach the solution reliably but slowly, while others converge rapidly when started near a root but fail completely for poor initial guesses. Methods that fail to converge are said to diverge. Termination criteria such as \(\left|x_{k+1}-x_k\right|<\varepsilon\) for small \(\varepsilon > 0\) balance accuracy against computational cost.

Iterative methods divide into two broad categories based on whether they require derivatives. Methods exploiting derivative information typically converge faster by using more knowledge about \(f\)’s local behavior. When derivatives are unavailable, expensive to compute, or unreliable, derivative-free methods provide robust alternatives. We begin with bisection, the most reliable derivative-free method.

A sample function, \(f(x)\), with a root shown where the curve intersects the horizontal axis.

The Bisection Method

We begin with the bisection method, one of the most intuitive and robust derivative-free methods. The method applies to any continuous function of one variable and rests on a simple geometric idea: if a continuous function \(f\) has opposite signs at two points, it must cross zero somewhere between them. More precisely, the intermediate value theorem guarantees that if \(f(a)f(b) < 0\) on an interval \([a,b]\), then at least one solution to \(f(x)=0\) exists within that interval.

The method’s robustness comes from its systematic bracketing strategy. If multiple solutions exist in \([a,b]\), the method converges to one of them. Should we approach an unintended root, we simply narrow the interval to bracket the desired solution and restart. Plotting the function beforehand reveals how many roots exist and where to position the initial interval.

The algorithm proceeds iteratively, halving the interval at each step. We first calculate the midpoint \(c=\frac{1}{2}(a+b)\), which divides the interval into two equal subintervals: \([a, c]\) and \([c, b]\). We then determine which subinterval must contain a root.

In the rare event that \(f(c)=0\), we have found the exact solution and terminate immediately. Otherwise, we evaluate the product \(f(a)f(c)\). If \(f(a)f(c) < 0\), then the function has opposite signs at \(a\) and \(c\), so the root must lie in \([a,c]\), and we update the interval accordingly. If \(f(a)f(c) > 0\), the root must lie in \([c,b]\), and we update to that subinterval instead.

We repeat this halving procedure until a stopping criterion is satisfied, such as \(|f(c)|<\delta\) (the function value is sufficiently small) or \(b-a<\varepsilon\) (the interval is sufficiently narrow) for some predefined tolerance. The final midpoint \(c\) serves as our approximation \(x^*\) to the root.

We demonstrate the method on a simple polynomial where the exact solution is known, allowing verification of the numerical approximation.

Example 7.2.1 (Example: Finding the largest root of a quadratic) We now apply the bisection method to find the largest root of \(x^2-x-6=0\) and compare our numerical approximation with the exact analytical solution.

The quadratic function \(f(x) = x^2 - x - 6\) used to demonstrate the bisection method. The goal is to find the largest root at \(x=3\).

We first compute the exact roots using SymPy’s solve() function for comparison:

sp.solve(f, x)

[-2, 3]

We target the root \(x^*=3\) using the bisection method. From the graph, we see that the starting interval \([1,4]\) brackets this root since \(f(1) < 0\) and \(f(4) > 0\). We set the tolerance to \(\varepsilon=0.01\) and expect convergence to this single root within the interval.

a, b = (1, 4) # initial interval

i = 1 # iteration counter
while (b-a) > 0.01: # stopping criterion
    c = (a + b) / 2 # midpoint
    if f.subs(x, c) * f.subs(x, a) < 0: # root in [a,c]
        b = c # update b
    else:
        a = c # root in [c,b], update a
    i += 1 # increment iteration counter
    print(f"Iteration {i}: a = {a:0.3f}, b = {b:0.3f}, c = {c:0.3f}, f(c) = {f.subs(x, c):0.3f}")
    if i > 100: # safeguard against infinite loops
        print("Max iterations reached"); break

Iteration 2: a = 2.500, b = 4.000, c = 2.500, f(c) = -2.250
Iteration 3: a = 2.500, b = 3.250, c = 3.250, f(c) = 1.312
Iteration 4: a = 2.875, b = 3.250, c = 2.875, f(c) = -0.609
Iteration 5: a = 2.875, b = 3.062, c = 3.062, f(c) = 0.316
Iteration 6: a = 2.969, b = 3.062, c = 2.969, f(c) = -0.155
Iteration 7: a = 2.969, b = 3.016, c = 3.016, f(c) = 0.078
Iteration 8: a = 2.992, b = 3.016, c = 2.992, f(c) = -0.039
Iteration 9: a = 2.992, b = 3.004, c = 3.004, f(c) = 0.020
Iteration 10: a = 2.998, b = 3.004, c = 2.998, f(c) = -0.010

Code

from matplotlib.animation import FuncAnimation

# Recreate bisection iterations for animation
a_init, b_init = (1, 4)
tol = 0.01
a_vals = [a_init]
b_vals = [b_init]
c_vals = []
diffs = []

a_temp, b_temp = a_init, b_init
while (b_temp - a_temp) > tol:
    c_temp = (a_temp + b_temp) / 2
    diffs.append(b_temp - a_temp)
    c_vals.append(c_temp)
    if float(f.subs(x, c_temp) * f.subs(x, a_temp)) < 0:
        b_temp = c_temp
    else:
        a_temp = c_temp
    a_vals.append(a_temp)
    b_vals.append(b_temp)

# Generate function values for plotting
x_plot = np.linspace(-1, 5, 400)
f_numpy = sp.lambdify(x, f, 'numpy')
y_plot = f_numpy(x_plot)

# Create animation
fig, ax = plt.subplots(figsize=(8, 5), dpi=150)

def animate(frame):
    ax.clear()

    # Problem info box at top center
    problem_text = r"$f(x) = x^2 - x - 6$"
    ax.text(0.5, 0.98, problem_text,
            transform=ax.transAxes,
            fontsize=11,
            verticalalignment='top',
            horizontalalignment='center',
            bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.8))

    # Plot function
    ax.plot(x_plot, y_plot, 'b-', lw=2, label='$f(x)$')
    ax.axhline(0, color='black', lw=0.5, linestyle='--', alpha=0.5)
    ax.grid(True, alpha=0.3)

    if frame < len(c_vals):
        a_curr = a_vals[frame]
        b_curr = b_vals[frame]
        c_curr = c_vals[frame]

        f_a = float(f.subs(x, a_curr))
        f_b = float(f.subs(x, b_curr))
        f_c = float(f.subs(x, c_curr))

        # Highlight current interval
        ax.axvspan(a_curr, b_curr, alpha=0.2, color='yellow', label='Current Interval')

        # Mark interval endpoints
        ax.plot(a_curr, f_a, 'ro', markersize=8, label='$a$', zorder=5)
        ax.plot(b_curr, f_b, 'go', markersize=8, label='$b$', zorder=5)

        # Mark midpoint
        ax.plot(c_curr, f_c, 'ms', markersize=10, label='$c = \\frac{a+b}{2}$', zorder=5)

        # Vertical lines to x-axis
        ax.plot([a_curr, a_curr], [0, f_a], 'r--', lw=1, alpha=0.5)
        ax.plot([b_curr, b_curr], [0, f_b], 'g--', lw=1, alpha=0.5)
        ax.plot([c_curr, c_curr], [0, f_c], 'm--', lw=1, alpha=0.5)

        # Info text box in lower right
        interval_width = b_curr - a_curr

        # Determine which interval is kept
        if float(f.subs(x, c_curr) * f.subs(x, a_curr)) < 0:
            next_interval = f"$[{a_curr:.3f}, {c_curr:.3f}]$"
            reason = "$f(a) \\cdot f(c) < 0$"
        else:
            next_interval = f"$[{c_curr:.3f}, {b_curr:.3f}]$"
            reason = "$f(c) \\cdot f(b) < 0$"

        info_text = (
            f"Iteration {frame + 1}\n"
            f"$a = {a_curr:.4f}, \\quad f(a) = {f_a:.4f}$\n"
            f"$b = {b_curr:.4f}, \\quad f(b) = {f_b:.4f}$\n"
            f"$c = {c_curr:.4f}, \\quad f(c) = {f_c:.4f}$\n"
            f"Interval width: ${interval_width:.4f}$\n"
            f"{reason}\n"
            f"Next: {next_interval}"
        )

        ax.text(0.98, 0.02, info_text,
                transform=ax.transAxes,
                fontsize=9,
                verticalalignment='bottom',
                horizontalalignment='right',
                bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))

    ax.set_xlabel('$x$')
    ax.set_ylabel('$f(x)$')
    ax.set_title(f"Bisection Method: Iteration {frame + 1}, a-b = {diffs[frame-1]:.4f}" if frame > 0 else "Bisection Method: Initial Interval")
    ax.legend(loc='upper left', fontsize=8)
    ax.set_xlim(0, 4.5)
    ax.set_ylim(-8, 10)

anim = FuncAnimation(fig, animate, frames=len(c_vals), interval=1000, repeat=True)
plt.close()

mk.to_responsive_html(anim, container_id='bisection-animation')

Convergence Analysis

Having established the bisection algorithm, we now analyze its convergence properties. The proof of convergence is constructive and proceeds in three stages. First, we show that the interval sequence must converge to a single point. Second, we prove that this point is a root of the function. Finally, we derive the method’s error bounds and convergence rate.

Stage 1: The Interval Converges to a Point

Let \([a_k, b_k]\) denote the interval after \(k\) iterations. The algorithm systematically shrinks this interval. At each step, we select one of the two halves, so the interval length decreases by exactly a factor of two: \[ b_k - a_k = \frac{1}{2}(b_{k-1} - a_{k-1}) \] Applying this relation repeatedly shows that after \(k\) iterations, the interval has shrunk by a factor of \(2^k\): \[ b_k - a_k = 2^{-k}(b_0 - a_0) \] Since \(2^{-k} \to 0\) as \(k \to \infty\), the interval width \(b_k - a_k\) vanishes in the limit.

Furthermore, the endpoint sequences have a special structure. The left endpoints, \(\{a_k\}\), form a non-decreasing sequence (\(a_0 \leq a_1 \leq \ldots\)), while the right endpoints, \(\{b_k\}\), form a non-increasing sequence (\(b_0 \geq b_1 \geq \ldots\)). Both sequences are bounded, since \(a_0 \leq a_k \leq b_k \leq b_0\) for all \(k\). A fundamental theorem from real analysis states that any monotone bounded sequence must converge. Therefore, both sequences converge to limits. Because the interval width between them goes to zero, they must converge to the same limit, which we denote by \(x^*\): \[ \lim_{k \to \infty} a_k = \lim_{k \to \infty} b_k = x^* \] This establishes that the iteration converges to a unique point, \(x^*\).

Stage 2: The Limit Point is a Root

Next, we must show that this limit point \(x^*\) is a root of \(f\). The algorithm maintains the bracketing property at every iteration, which by definition means \(f(a_k)f(b_k) \leq 0\) for all \(k\). In other words, the function values at the endpoints always have opposite signs (or one is zero).

If we take the limit of this inequality as \(k \to \infty\), the continuity of \(f\) allows us to state that: \[ \lim_{k \to \infty} f(a_k)f(b_k) = f(\lim_{k \to \infty} a_k) f(\lim_{k \to \infty} b_k) = f(x^*)f(x^*) = f(x^*)^2 \leq 0 \] Since the square of any real number is non-negative, the only way for \(f(x^*)^2 \leq 0\) to hold is if \(f(x^*)^2 = 0\), which implies \(f(x^*)=0\). We conclude that the limit point \(x^*\) is indeed a root of the function.

Stage 3: Error Bounds and Convergence Rate

Finally, we can quantify the method’s accuracy. Since the true root \(x^*\) lies within the interval \([a_k, b_k]\), and we use the midpoint \(c_k = (a_k + b_k)/2\) as our approximation, the absolute error is bounded by half the interval width: \[ |x^* - c_k| \leq \frac{b_k - a_k}{2} = \frac{b_0 - a_0}{2^{k+1}} \] This formula provides a powerful a priori error bound; we can determine the number of iterations required to achieve a desired tolerance without actually running the method. For example, to guarantee an accuracy of \(10^{-3}\) on an initial interval of length 3 (e.g., \([1, 4]\)), we solve for \(k\): \[ \frac{3}{2^{k+1}} < 10^{-3} \implies 2^{k+1} > 3000 \implies k+1 > \log_2(3000) \approx 11.55 \] Thus, 12 iterations are sufficient to guarantee the desired accuracy. This analysis also reveals that the bisection method exhibits linear convergence, as each iteration reduces the error bound by a constant factor of 0.5. While slower than other methods, this guaranteed convergence and predictable error reduction make bisection exceptionally reliable.

See the video explanation: Bisection Method Video

Fixed Point Iteration

While bisection guarantees convergence, its linear rate means that many iterations are required to achieve high accuracy. We now introduce fixed-point iteration, a different approach that can converge faster when properly formulated. The method reformulates the root-finding problem \(f(x) = 0\) as an equivalent fixed-point problem \(x = g(x)\), then iterates this relation directly.

The method begins by choosing a function \(g(x)\) such that a solution to \(f(x) = 0\) corresponds to a fixed point of \(g\), meaning a point where \(x^* = g(x^*)\). Starting from an initial guess \(x_0\), we generate the sequence \(x_1 = g(x_0)\), \(x_2 = g(x_1)\), and in general

\[ \boxed{ x_{k+1}=g\left(x_k\right), \quad k=0,1, \ldots } \tag{7.2.1}\]

A solution \(x^*\) of (7.2.1) is called a fixed point of \(g\), motivating the name of the method, also known as Picard iteration (Emile Picard). This means that \(x^*\) also is a solution to \(f(x)=0\). We may obtain several different fixed-point forms of (7.2.1) from \(f(x)=0\). The behavior of corresponding iterative sequences \(x_0, x_1, x_2, \ldots\) may differ, in particular, with respect to their speed of convergence. Indeed, some of them may not converge at all.

If the natural convergence test \(\left|x_{k+1}-x_k\right|<\varepsilon\) is satisfied for some predefined small parameter \(\varepsilon>0\), then we can choose \(x^*=x_{k+1}\).

Convergence Analysis

The machinery relies on the following theorems of Brouwer and Banach, ranging from existence to uniqueness.

Theorem 7.2.1 Brouwer fixed point theorem.

Assume that \(g(x)\) is continuous on the closed interval \([a, b]\). Assume that the interval \([a, b]\) is mapped to itself by \(g(x)\), that is, for any \(x \in[a, b], g(x) \in[a, b]\). Then there exists a point \(x^* \in[a, b]\) such that \(x^*=g\left(x^*\right)\). The point \(x^*\) is a fixed point of \(g(x)\). (Jan Brouwer)

Proof. Let \(u(x)=x-g(x)\). Since \(g(a) \in[a, b]\) and also \(g(b) \in[a, b]\), we know that \(u(a)=a-g(a) \leq 0\) and \(u(b)=b-g(b) \geq 0\). Since \(g(x)\) is continuous in \([a, b]\), so is \(u(x)\), and hence according to the intermediate value theorem, there must exist a point \(x^* \in[a, b]\) at which \(u\left(x^*\right)=0\). At this point \(x^*=g\left(x^*\right)\).

An iteration process defined by (7.2.1) is called convergent for a value \(x_0\) if the corresponding sequence \(x_0, x_1, x_2, \ldots\) is convergent. A sufficient condition for convergence is given in the following theorem, which has various practical applications.

Theorem 7.2.2 Banach fixed point theorem.

Let \(x^*\) be a solution of \(x=g(x)\) and suppose that \(g\) has a continuous first derivative in some interval \(I\) containing \(x^*\). Then if \(\left|g^{\prime}(x)\right| \leq L<1\) in \(I\), the iteration process defined by (7.2.1) converges for any \(x_0\) in \(I\), and the limit of the sequence \(\left\{x_k\right\}\) is \(x^*\). (Stefan Banach)

Proof. By the mean value theorem there is a \(t\) between \(x, x^* \in I\), such that

\[ g(x)-g\left(x^*\right)=g^{\prime}(t)\left(x-x^*\right) \]

Since \(x^*=g\left(x^*\right)\) and \(x_k=g\left(x_{k-1}\right)\), we can use the condition \(\left|g^{\prime}(x)\right| \leq L\) to relate the error at step \(k\) to the error at step \(k-1\):

\[ \left|x_k-x^*\right|=\left|g\left(x_{k-1}\right)-g\left(x^*\right)\right|=\left|g^{\prime}(t)\right|\left|x_{k-1}-x^*\right| \leq L\left|x_{k-1}-x^*\right| . \]

This inequality shows that the error is reduced by at least a factor of \(L\) at each step. We can apply this relationship recursively to relate the error at step \(k\) all the way back to the initial error:

\[ \left|x_k-x^*\right| \leq L\left|x_{k-1}-x^*\right| \leq L(L\left|x_{k-2}-x^*\right|) = L^2\left|x_{k-2}-x^*\right| \leq \ldots \leq L^k\left|x_0-x^*\right|. \]

Since we assumed \(L<1\), the term \(L^k\) must go to zero as the number of iterations \(k\) approaches infinity. Consequently, the error \(\left|x_k-x^*\right|\) must also go to zero, which completes the proof.

A function \(g\) satisfying the condition in the theorem is called a contraction because \(|g(x)-g(y)| \leq L|x-y|\) for all \(x, y \in I\), where \(L<1\). Furthermore, \(L\) gives information on the speed of convergence: smaller \(L\) implies faster convergence. The inequality \(|g(x)-g(y)| \leq L|x-y|\) is called the Lipschitz condition (Rudolf Lipschitz), where \(L\) is the Lipschitz constant. This provides a constructive definition of continuity and differentiability, in contrast with the “usual” Cauchy \(\varepsilon\)-\(\delta\) existence definitions from standard analysis courses.

The theorem guarantees convergence to a fixed point \(x^*\) of \(g(x)\) if we start the iterations sufficiently close to that point. Starting far from \(x^*\) may or may not lead to convergence. Since we consider only a neighborhood of the fixed point, we can no longer guarantee uniqueness.

The behavior of the iteration depends on the magnitude of the derivative \(g'(x^*)\) at the fixed point. If \(|g'(x^*)| < 1\), the iteration converges when started sufficiently close to \(x^*\). If \(|g'(x^*)| > 1\), the iteration diverges, even when starting close to \(x^*\). The absolute value notation captures both positive and negative slopes: what matters for convergence is the magnitude, not the sign. The following plots demonstrate both scenarios.

Code

import numpy as np
import matplotlib.pyplot as plt
import sympy as sp

x = sp.symbols('x')
g_conv = 2.9*x*(1-x)
x1, x2 = (0.58, 0.72)
fixed_point = float(sp.nsolve(x - g_conv, 0.5))
g_conv_num = sp.lambdify(x, g_conv, 'numpy')

fig, ax = plt.subplots()
x_vals = np.linspace(x1, x2, 400)

ax.plot(x_vals, g_conv_num(x_vals), 'b-', label=f'$g(x) = {sp.latex(g_conv)}$')
ax.plot(x_vals, x_vals, 'k--', label='$y=x$', alpha=0.5)
ax.plot(fixed_point, fixed_point, 'ro', label=f'Fixed Point $x^* \\approx {fixed_point:.3f}$')

# Cobweb plot
x_n = 0.7 # Starting point
n_iter = 20
y_n = 0

scale = 0.1
# First arrow from x-axis
g_x_n = g_conv_num(x_n)
ax.arrow(x_n, y_n, 0, g_x_n - y_n, width=0.005*scale, 
head_width=0.02*scale, head_length=0.04*scale, 
fc='g', ec='g', length_includes_head=True)
y_n = g_x_n

for _ in range(n_iter):
    next_x_n = y_n
    ax.arrow(x_n, y_n, next_x_n - x_n, 0, width=0.005*scale,
    head_width=0.02*scale, head_length=0.04*scale, 
    fc='g', ec='g', length_includes_head=True)
    x_n = next_x_n
    
    next_y_n = g_conv_num(x_n)
    ax.arrow(x_n, y_n, 0, next_y_n - y_n, width=0.005*scale,
    head_width=0.02*scale, head_length=0.04*scale, 
    fc='g', ec='g', length_includes_head=True)
    y_n = next_y_n

ax.set_title('Fixed-Point Iteration: Convergence')
ax.set_xlabel('$x$')
ax.set_ylabel('$y$')
ax.legend(loc='upper center')
ax.grid(True, which='both', linestyle='--', linewidth=0.5)
ax.set_xlim(x1, x2)
ax.set_ylim(x1, x2)
ax.set_aspect('equal', adjustable='box')
plt.show()
import numpy as np
import matplotlib.pyplot as plt
import sympy as sp

x = sp.symbols('x')
# g_div = -1.5*(1-x)+1
# g_div = 3.2*x*(1-x)
g_div = 1/x**2
x1, x2 = (0.6, 1.4)
# Find the fixed point numerically (same as for convergence case)
fixed_point = float(sp.nsolve(x - g_div, 0.5))
g_div_num = sp.lambdify(x, g_div, 'numpy')

fig, ax = plt.subplots()
x_vals = np.linspace(0.01, 3, 400)

ax.plot(x_vals, g_div_num(x_vals), 'b-', label=f'$g(x) = {sp.latex(g_div)}$')
ax.plot(x_vals, x_vals, 'k--', label='$y=x$', alpha=0.5)
ax.plot(fixed_point, fixed_point, 'ro', label=f'Fixed Point $x^* \\approx {fixed_point:.3f}$')

# Cobweb plot
x_n = 1.05 # Starting point
n_iter = 3
y_n = 0

scale = 0.8
# First arrow from x-axis
g_x_n = g_div_num(x_n)
ax.arrow(x_n, y_n, 0, g_x_n - y_n, width=0.005*scale,
head_width=0.02*scale, head_length=0.04*scale, 
fc='g', ec='g', length_includes_head=True)
y_n = g_x_n

for _ in range(n_iter):
    next_x_n = y_n
    if next_x_n <= 0: break # Avoid domain error for ln(x)
    ax.arrow(x_n, y_n, next_x_n - x_n, 0, width=0.005*scale,
     head_width=0.02*scale, head_length=0.04*scale, fc='g', ec='g', length_includes_head=True)
    x_n = next_x_n
    
    next_y_n = g_div_num(x_n)
    ax.arrow(x_n, y_n, 0, next_y_n - y_n, width=0.005*scale, head_width=0.02*scale, head_length=0.04*scale, fc='g', ec='g', length_includes_head=True)
    y_n = next_y_n

ax.set_title('Fixed-Point Iteration: Divergence')
ax.set_xlabel('$x$')
ax.set_ylabel('$y$')
ax.legend(loc='upper center')
ax.grid(True, which='both', linestyle='--', linewidth=0.5)
ax.set_xlim(x1, x2)
ax.set_ylim(x1, x2)
ax.set_aspect('equal', adjustable='box')
plt.show()

Figure 7.2.1: Cobweb plot illustrating a converging fixed-point iteration. When \(|g'(x^*)| < 1\), the iterates spiral inwards towards the fixed point, where \(g(x)\) intersects the line \(y=x\).

Figure 7.2.2: Cobweb plot illustrating a diverging fixed-point iteration. When \(|g'(x^*)| > 1\), the iterates spiral outwards, moving away from the unstable fixed point.

Example 7.2.2 We now apply fixed-point iteration to find the largest root of \(x^2-x-6=0\) and compare the result with the exact solution.

The equation has two roots: \(x_1=-2\) and \(x_2=3\). We seek \(x_2 = 3\) using fixed-point iteration. Rewriting the equation \(x^2 - x - 6 = 0\) as \(x^2 = x + 6\) and taking square roots gives the fixed-point form \(x = g(x) = \sqrt{x+6}\). To verify that iteration will converge to \(x^* = 3\), we check the derivative: \(g'(x) = \frac{1}{2\sqrt{x+6}}\). At the root, \(g'(3) = \frac{1}{2\sqrt{9}} = \frac{1}{6} < 1\), satisfying the contraction condition. The Banach fixed-point theorem guarantees convergence if we start sufficiently close to 3.

import numpy as np

g = lambda x: np.sqrt(x + 6)
x0 = 3.25

xi = x0
for i in range(10):
    xi = g(xi)
    fi = xi**2 - xi - 6
    print(f"Iteration {i+1}: x = {xi:0.6f}, f(x) = {fi:0.6f}")
    if fi < 1e-4:
        break

Iteration 1: x = 3.041381, f(x) = 0.208619
Iteration 2: x = 3.006889, f(x) = 0.034492
Iteration 3: x = 3.001148, f(x) = 0.005741
Iteration 4: x = 3.000191, f(x) = 0.000957
Iteration 5: x = 3.000032, f(x) = 0.000159
Iteration 6: x = 3.000005, f(x) = 0.000027

Fixed Point Method Video

Kepler’s Problem

Problem: Given the mean anomaly \(M\) and the eccentricity \(e\), find the eccentric anomaly \(E\) such that

\[ M = E - e \sin E \]

using Kepler’s method, which is a fixed point iteration method, we begin by rewriting the equation as

\[ E = M + e \sin E \]

import numpy as np

M = 19/88*2*np.pi  # example from video
e = 0.21  # example eccentricity

E_hat = M  # initial guess
for i in range(10):
    M_hat = E_hat - e * np.sin(E_hat)
    error = M - M_hat
    E_hat = E_hat + error

    print(f"Iteration {i+1}: E = {E_hat:0.6f}, M_hat = {M_hat:0.6f}, error = {error:0.6f}")
    if abs(error) < 1e-6:
        break

Iteration 1: E = 1.561798, M_hat = 1.151396, error = 0.205201
Iteration 2: E = 1.566588, M_hat = 1.351806, error = 0.004791
Iteration 3: E = 1.566595, M_hat = 1.356590, error = 0.000007
Iteration 4: E = 1.566595, M_hat = 1.356597, error = 0.000000

The algorithm converges rapidly, reaching high accuracy within a few iterations. But why does this work? Let us examine the connection to fixed-point iteration theory, show that it is indeed the same thing!

Connection to Fixed-Point Iteration

The code above uses the notation from the video, but we can reveal its structure by simplifying the update step. Starting from:

M_hat = E_hat - e * np.sin(E_hat)
error = M - M_hat
E_hat = E_hat + error

We substitute the expression for error: \[ \text{error} = M - \hat M = M - (\hat E - e\sin \hat E) = M - \hat E + e\sin \hat E \]

Therefore, the update becomes: \[ \hat E \leftarrow \hat E + \text{error} =\hat E + (M - \hat E + e\sin \hat E) = M + e\sin \hat E \]

This is precisely the fixed-point iteration \(E_{k+1} = g(E_k)\) with the iteration function: \[ g(E) = M + e\sin E \]

We can verify that \(E^*\) is a fixed point of \(g\) if and only if it solves Kepler’s equation: \[ E^* = g(E^*) = M + e\sin(E^*) \implies M = E^* - e\sin(E^*) \]

The convergence of this iteration follows from (Theorem 7.2.2). The derivative of the iteration function is: \[ g'(E) = e\cos(E) \]

For typical orbital eccentricities (\(0 \leq e < 1\)), we have \(|g'(E)| = |e\cos(E)| \leq e < 1\) for all \(E\). This satisfies the contraction condition, guaranteeing convergence from any starting point. In our example with \(e = 0.21\), the Lipschitz constant is \(L \leq 0.21\), which explains the rapid convergence we observe.

Let’s verify this by implementing the simplified fixed-point form directly:

# Simplified fixed-point iteration form
M = 19/88*2*np.pi
e = 0.21

E = M  # initial guess
print("Simplified fixed-point iteration: E_{k+1} = M + e*sin(E_k)")
for i in range(10):
    E = M + e * np.sin(E)
    error = M - (E - e * np.sin(E))
    M_hat = E - e * np.sin(E)
    print(f"Iteration {i+1}: E = {E:0.6f}, M_hat = {M_hat:0.6f}, error = {error:0.6f}")
    if abs(error) < 1e-6:
        break

Simplified fixed-point iteration: E_{k+1} = M + e*sin(E_k)
Iteration 1: E = 1.561798, M_hat = 1.351806, error = 0.004791
Iteration 2: E = 1.566588, M_hat = 1.356590, error = 0.000007
Iteration 3: E = 1.566595, M_hat = 1.356597, error = 0.000000

The two implementations produce identical results, confirming that Kepler’s method is indeed a fixed-point iteration in disguise. The video’s notation emphasizes the error-correction perspective (compute error, then correct), while the fixed-point formulation emphasizes the theoretical structure. Both are equivalent, but understanding the connection helps us apply convergence theory to predict and analyze the method’s behavior.

Application Example: 3D Printer Flow Calibration

Beyond classical problems from celestial mechanics, fixed-point iteration appears in modern engineering applications. A common calibration task in 3D printing is tuning the “extrusion multiplier” or “flow rate” to ensure dimensional accuracy. One popular method involves printing a small, hollow cube and measuring its wall thickness. The goal is to adjust the extrusion multiplier, let’s call it \(M\), so that the measured wall thickness, \(T_{\text{meas}}\), matches an expected wall thickness, \(T_{\text{exp}}\).

A common formula to update the multiplier is: \[ M_{\text{new}} = M_{\text{old}} \cdot \frac{T_{\text{exp}}}{T_{\text{meas}}} \] This process is repeated until the measured thickness is acceptably close to the expected thickness. We can formulate this as a fixed-point iteration problem.

The goal is to find a multiplier \(M^*\) such that the resulting measurement equals the target: \(T_{\text{meas}}(M^*) = T_{\text{exp}}\). The iterative formula can be written as: \[ M_{k+1} = M_k \cdot \frac{T_{\text{exp}}}{T_{\text{meas}}(M_k)} \] This is a fixed-point iteration \(M_{k+1} = g(M_k)\) with the iteration function: \[ g(M) = M \cdot \frac{T_{\text{exp}}}{T_{\text{meas}}(M)} \] A solution \(M^*\) is a fixed point of \(g\) if \(M^* = g(M^*)\), which simplifies to \(1 = T_{\text{exp}} / T_{\text{meas}}(M^*)\), or \(T_{\text{meas}}(M^*) = T_{\text{exp}}\).

The iterative formula, widely used in the 3D printing community, is a practical application of the fixed-point method. It converges rapidly to the correct extrusion multiplier needed to achieve the desired dimensional accuracy.

Newton’s Method

Fixed-point iteration achieves linear convergence when \(|g'(x^*)| < 1\), which is faster than bisection but still requires many iterations for high accuracy. We now introduce Newton’s method, which achieves much faster quadratic convergence by exploiting derivative information. The method is also known as Newton-Raphson’s method after Joseph Raphson, who published a simplified description in 1690, though Newton had developed the underlying ideas earlier.

Newton’s method uses the derivative of \(f\) to construct better approximations at each step. This extra information enables the method to converge rapidly in many cases, though it can also fail completely for certain starting points or functions. We therefore examine both how to use the method effectively and when it might fail.

We assume that \(f(x)=0\) has at least one solution \(x^*\). Starting with an initial guess \(x_0\), we construct the tangent line to \(f(x)\) at \(x_0\):

\[ l(x)-f\left(x_0\right)=f^{\prime}\left(x_0\right)\left(x-x_0\right) . \]

We use this tangent line as a linear approximation to \(f\) near \(x_0\). Where the tangent crosses the x-axis becomes our next guess \(x_1\). Setting \(l(x_1) = 0\) gives:

\[ 0-f\left(x_0\right)=f^{\prime}\left(x_0\right)\left(x_1-x_0\right) \implies x_1=x_0-\frac{f\left(x_0\right)}{f^{\prime}\left(x_0\right)} \]

Repeating this process gives the Newton iteration:

\[ \boxed{ x_{k+1}=x_k-\frac{f\left(x_k\right)}{f^{\prime}\left(x_k\right)} } \tag{7.2.2}\]

The stopping criterion is usually the same as for the previous methods: if \(\left|f\left(x_{k+1}\right)\right|<\delta\) or \(\left|x_{k+1}-x_k\right|<\varepsilon\) is satisfied with some predefined small parameters \(\delta, \varepsilon>0\), then we choose \(x^*=x_{k+1}\).

To illustrate the method, we find a root of \(f(x) = x^2 - 2\). The animation below shows the iterative process: starting from an initial guess \(x_0\), we draw a vertical line to \(f(x_0)\), construct the tangent line at that point, and find where it crosses the x-axis to obtain \(x_1\). Each iteration brings us closer to the root \(x^* = \sqrt{2} \approx 1.414\).

Code

from matplotlib.animation import FuncAnimation
import sympy as sp
import numpy as np
import matplotlib.pyplot as plt
import mechanicskit as mk

# Define function and derivative
x = sp.symbols('x')
f = x**2 - 2
f_prime = sp.diff(f, x)

# Convert to numpy functions
f_numpy = sp.lambdify(x, f, 'numpy')
f_prime_numpy = sp.lambdify(x, f_prime, 'numpy')

x_min, x_max = (0, 10)
y_min, y_max = (-10, 40)

# Newton's method iterations
x0 = 6.5  # Initial guess
x_vals = [x0]
n_iter = 4

x_current = x0
for _ in range(n_iter):
    x_next = x_current - f_numpy(x_current) / f_prime_numpy(x_current)
    x_vals.append(x_next)
    x_current = x_next

# Generate plot data
x_plot = np.linspace(x_min, x_max, 400)
y_plot = f_numpy(x_plot)

# Create animation frames structure
# Each iteration has 4 steps: mark x_i, vertical line, tangent line, mark x_{i+1}
frames_data = []
for i in range(len(x_vals) - 1):
    # Step 0: Mark x_i on x-axis
    frames_data.append({'iter': i, 'step': 0})
    # Step 1: Vertical line to f(x_i)
    frames_data.append({'iter': i, 'step': 1})
    # Step 2: Tangent line at f(x_i)
    frames_data.append({'iter': i, 'step': 2})
    # Step 3: Mark x_{i+1} on x-axis
    frames_data.append({'iter': i, 'step': 3})

# Create animation
fig, ax = plt.subplots(figsize=(10, 6), dpi=150)

def animate(frame_idx):
    ax.clear()

    # Get current frame data
    frame_data = frames_data[frame_idx]
    i = frame_data['iter']
    step = frame_data['step']

    x_curr = x_vals[i]
    x_next = x_vals[i + 1]
    f_curr = f_numpy(x_curr)
    f_prime_curr = f_prime_numpy(x_curr)

    # Problem info box
    problem_text = f"$f(x) = {sp.latex(f)}$"
    ax.text(0.5, 0.98, problem_text,
            transform=ax.transAxes,
            fontsize=12,
            verticalalignment='top',
            horizontalalignment='center',
            bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.8))

    # Plot function
    ax.plot(x_plot, y_plot, 'b-', lw=2, label='$f(x)$')
    ax.axhline(0, color='black', lw=0.5, linestyle='--', alpha=0.5)
    ax.grid(True, alpha=0.3)

    # Step descriptions
    step_desc = [
        f"Mark $x_{i}$ on x-axis",
        f"Draw vertical line to $f(x_{i})$",
        f"Construct tangent line at $(x_{i}, f(x_{i}))$",
        f"Find $x_{i+1}$ where tangent crosses x-axis"
    ]

    # Step 0: Mark x_i on x-axis
    if step >= 0:
        ax.plot(x_curr, 0, 'ro', markersize=10, label=f'$x_{i}$ = {x_curr:.4f}', zorder=5)

    # Step 1: Vertical line from x_curr to f(x_curr)
    if step >= 1:
        ax.plot([x_curr, x_curr], [0, f_curr], 'r--', lw=2, alpha=0.7)
        ax.plot(x_curr, f_curr, 'ro', markersize=10, zorder=5)

    # Step 2: Tangent line
    if step >= 2:
        tangent_x = np.linspace(x_min, x_max, 100)
        tangent_y = f_curr + f_prime_curr * (tangent_x - x_curr)
        ax.plot(tangent_x, tangent_y, 'g-', lw=2, alpha=0.7, label='Tangent line')

    # Step 3: Mark x_{i+1}
    if step >= 3:
        ax.plot(x_next, 0, 'gs', markersize=10, label=f'$x_{i+1}$ = {x_next:.4f}', zorder=5)

    # Info text box
    exact_root = float(sp.nsolve(f, 2))
    error_curr = abs(x_curr - exact_root)
    error_next = abs(x_next - exact_root)

    info_text = (
        f"Iteration {i}, Step {step+1}/4\n"
        f"{step_desc[step]}\n"
        f"\n"
        f"$x_{i} = {x_curr:.6f}$\n"
        f"$f(x_{i}) = {f_curr:.6f}$\n"
        f"$f'(x_{i}) = {f_prime_curr:.6f}$\n"
    )

    if step >= 3:
        info_text += f"\n$x_{i+1} = x_{i} - \\frac{{f(x_{i})}}{{f'(x_{i})}} = {x_next:.6f}$\n"
        info_text += f"Error: ${error_curr:.2e}$ → ${error_next:.2e}$"

    ax.text(0.02, 0.98, info_text,
            transform=ax.transAxes,
            fontsize=9,
            verticalalignment='top',
            horizontalalignment='left',
            bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8),
            family='monospace')

    ax.set_xlabel('$x$')
    ax.set_ylabel('$f(x)$')
    ax.set_title(f"Newton's Method: Iteration {i}, Step {step+1}/4")
    ax.legend(loc='lower right', fontsize=9)
    ax.set_xlim(x_min, x_max)
    ax.set_ylim(y_min, y_max)

anim = FuncAnimation(fig, animate, frames=len(frames_data), interval=1000, repeat=True)
plt.close()

mk.to_responsive_html(anim, container_id='newton-animation')

The animation demonstrates Newton’s method’s rapid convergence. Each tangent line approximation brings us significantly closer to the root, with the error decreasing quadratically at each step. This quadratic convergence distinguishes Newton’s method from bisection (which has linear convergence) and fixed-point iteration (which typically has linear convergence unless \(g'(x^*) = 0\)).

Example 7.2.3 Find the largest root of \(x^2-x-6=0\) using Newton’s method. Compare with the exact one. The roots are \(x_1=-2\) and \(x_2=3\). See previous examples. Use (7.2.2) with \(x_0=2.5\).

f = lambda x: x**2 - x - 6
f_prime = lambda x: 2*x - 1

x0 = 2.5  # Initial guess
xi = x0
for i in range(5):
    xi = xi - f(xi) / f_prime(xi)
    fi = f(xi)
    error = abs(fi)
    print(f"Iteration {i+1}: x = {xi:0.6f}, f(x) = {fi:0.6f}, error = {error:0.6f}")
    if error < 1e-6:
        break

Iteration 1: x = 3.062500, f(x) = 0.316406, error = 0.316406
Iteration 2: x = 3.000762, f(x) = 0.003812, error = 0.003812
Iteration 3: x = 3.000000, f(x) = 0.000001, error = 0.000001

Error Analysis and Convergence

We now analyze why Newton’s method converges so quickly when it works, and under what conditions convergence is guaranteed. The analysis reveals the method’s quadratic convergence property, the defining characteristic that makes Newton’s method so powerful.

Convergence Theory

Newton’s method can be viewed as a special case of fixed-point iteration. Rearranging (7.2.2), we write:

\[ x_{k+1} = x_k - \frac{f(x_k)}{f'(x_k)} = g(x_k) \]

where the iteration function is \(g(x) = x - \frac{f(x)}{f'(x)}\). At a root \(x^*\) where \(f(x^*) = 0\), we have \(g(x^*) = x^* - 0/f'(x^*) = x^*\), confirming that \(x^*\) is a fixed point.

The convergence rate depends on \(|g'(x^*)|\). Computing the derivative using the quotient rule:

\[ g'(x) = 1 - \frac{f'(x)^2 - f(x)f''(x)}{f'(x)^2} = \frac{f(x)f''(x)}{f'(x)^2} \]

At the root \(x^*\) where \(f(x^*) = 0\), we obtain:

\[ g'(x^*) = \frac{f(x^*)f''(x^*)}{f'(x^*)^2} = 0 \]

This is remarkable: not only is \(|g'(x^*)| < 1\) (which would guarantee linear convergence), but \(g'(x^*) = 0\) exactly. This leads to quadratic convergence, meaning the error roughly squares at each iteration.

To make this precise, we use Taylor expansion around \(x^*\):

\[ 0 = f(x^*) = f(x_k) + f'(x_k)(x^* - x_k) + \frac{1}{2}f''(\xi)(x^* - x_k)^2 \]

for some \(\xi\) between \(x_k\) and \(x^*\). Dividing by \(f'(x_k)\) and rearranging:

\[ x^* - x_k + \frac{f(x_k)}{f'(x_k)} = -\frac{f''(\xi)}{2f'(x_k)}(x^* - x_k)^2 \]

and with 7.2.2, we have \[ x^* - x_{k+1} = x^* - \left(x_k - \frac{f(x_k)}{f'(x_k)}\right) = x^* - x_k + \frac{f(x_k)}{f'(x_k)}. \] Thus the left side is exactly \(x^* - x_{k+1}\), and taking absolute values gives:

\[ |x_{k+1} - x^*| = \frac{|f''(\xi)|}{2|f'(x_k)|}|x_k - x^*|^2 \]

If \(f'\) and \(f''\) are bounded near \(x^*\) with \(f'(x^*) \neq 0\), then there exists a constant \(C\) such that:

\[ |x_{k+1} - x^*| \leq C|x_k - x^*|^2 \]

This is quadratic convergence: the error at step \(k+1\) is proportional to the square of the error at step \(k\). Once the iterates are close to \(x^*\), the number of correct digits roughly doubles at each step.

Convergence requires a simple root with \(f'(x^*) \neq 0\), smoothness of \(f\) near \(x^*\) so that \(f'\) and \(f''\) exist and are continuous, and an initial guess \(x_0\) sufficiently close to the root.

Condition 3 is crucial: Newton’s method has only local convergence. Starting far from a root can lead to divergence or cycling, as we demonstrate below.

Comparing Convergence Rates

To see quadratic convergence in action, we compare bisection, fixed-point iteration, and Newton’s method on the same problem: finding the root \(x^* = 3\) of \(f(x) = x^2 - x - 6\).

Code

import numpy as np
import matplotlib.pyplot as plt

# Define the function and derivatives
f = lambda x: x**2 - x - 6
f_prime = lambda x: 2*x - 1
x_star = 3.0  # Exact root

# Bisection method
def bisection(f, a, b, max_iter=50):
    errors = []
    for _ in range(max_iter):
        c = (a + b) / 2
        errors.append(abs(c - x_star))
        if f(a) * f(c) < 0:
            b = c
        else:
            a = c
    return errors

# Fixed-point iteration: x = sqrt(x + 6)
def fixed_point(g, x0, max_iter=50):
    errors = []
    x = x0
    for _ in range(max_iter):
        x = g(x)
        errors.append(abs(x - x_star))
        if errors[-1] < 1e-12:
            break
    return errors

# Newton's method
def newton(f, f_prime, x0, max_iter=50):
    errors = []
    x = x0
    for _ in range(max_iter):
        x = x - f(x) / f_prime(x)
        errors.append(abs(x - x_star))
        if errors[-1] < 1e-12:
            break
    return errors

# Run all three methods
bisection_errors = bisection(f, 1, 4)
fixed_point_errors = fixed_point(lambda x: np.sqrt(x + 6), 2.5)
newton_errors = newton(f, f_prime, 1.0)  # Start further from root for more iterations

# Display first 10 iterations
print("Iteration | Bisection  | Fixed-Point | Newton")
print("-" * 55)
for i in range(min(10, len(bisection_errors))):
    bis_err = bisection_errors[i] if i < len(bisection_errors) else np.nan
    fp_err = fixed_point_errors[i] if i < len(fixed_point_errors) else np.nan
    newt_err = newton_errors[i] if i < len(newton_errors) else np.nan
    print(f"{i:9d} | {bis_err:10.2e} | {fp_err:11.2e} | {newt_err:10.2e}")

Iteration | Bisection  | Fixed-Point | Newton
-------------------------------------------------------
        0 |   5.00e-01 |    8.45e-02 |   4.00e+00
        1 |   2.50e-01 |    1.41e-02 |   1.23e+00
        2 |   1.25e-01 |    2.35e-03 |   2.03e-01
        3 |   6.25e-02 |    3.92e-04 |   7.62e-03
        4 |   3.12e-02 |    6.54e-05 |   1.16e-05
        5 |   1.56e-02 |    1.09e-05 |   2.69e-11
        6 |   7.81e-03 |    1.82e-06 |   0.00e+00
        7 |   3.91e-03 |    3.03e-07 |        nan
        8 |   1.95e-03 |    5.05e-08 |        nan
        9 |   9.77e-04 |    8.41e-09 |        nan

The table shows that bisection reduces the error by a factor of two each iteration, which is linear convergence with rate 0.5, fixed-point iteration reduces the error by a roughly constant factor, which is also linear convergence, and Newton’s method squares the error once the iterates are close, which yields quadratic convergence and reaches machine precision in a few iterations for this problem.

We can visualize the convergence order by plotting \(\log(e_{k+1})\) vs \(\log(e_k)\). On such a plot, the slope reveals the convergence order: slope = 1 for linear (\(P_1\)), slope = 2 for quadratic (\(P_2\)).

Code

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5), dpi=150)

# Left plot: Error vs iteration (log scale)
ax1.semilogy(range(len(bisection_errors)), bisection_errors, 'b-o',
            label='Bisection', markersize=6, linewidth=2)
ax1.semilogy(range(len(fixed_point_errors)), fixed_point_errors, 'g-s',
            label='Fixed-Point', markersize=6, linewidth=2)
ax1.semilogy(range(len(newton_errors)), newton_errors, 'r-^',
            label="Newton", markersize=6, linewidth=2)

ax1.axhline(np.finfo(float).eps, color='gray', linestyle=':',
           linewidth=1.5, alpha=0.7, label='Machine epsilon')

ax1.set_xlabel('Iteration $k$', fontsize=12)
ax1.set_ylabel('Absolute Error $|x_k - x^*|$', fontsize=12)
ax1.set_title('Error vs Iteration', fontsize=13)
ax1.legend(loc='upper right', fontsize=10)
ax1.grid(True, alpha=0.3, which='both')
ax1.set_xlim(-0.5, 12)

# Right plot: log(e_{k+1}) vs log(e_k) - shows convergence order
# For bisection
bis_ek = np.array(bisection_errors[:-1])
bis_ek1 = np.array(bisection_errors[1:])
# Filter out only errors at machine precision
eps_threshold = np.finfo(float).eps * 10
mask_bis = (bis_ek > eps_threshold) & (bis_ek1 > eps_threshold)
ax2.loglog(bis_ek[mask_bis], bis_ek1[mask_bis], 'bo-',
          label='Bisection', markersize=6, linewidth=2)

# For fixed-point
fp_ek = np.array(fixed_point_errors[:-1])
fp_ek1 = np.array(fixed_point_errors[1:])
mask_fp = (fp_ek > eps_threshold) & (fp_ek1 > eps_threshold)
ax2.loglog(fp_ek[mask_fp], fp_ek1[mask_fp], 'gs-',
          label='Fixed-Point', markersize=6, linewidth=2)

# For Newton
newt_ek = np.array(newton_errors[:-1])
newt_ek1 = np.array(newton_errors[1:])
mask_newt = (newt_ek > eps_threshold) & (newt_ek1 > eps_threshold)
ax2.loglog(newt_ek[mask_newt], newt_ek1[mask_newt], 'r^-',
          label="Newton", markersize=6, linewidth=2)

# Add reference lines for P1 and P2
e_ref = np.logspace(-3, 0, 50)
p1_line = 0.5 * e_ref  # Slope = 1 on log-log
p2_line = 0.5 * e_ref**2  # Slope = 2 on log-log

ax2.loglog(e_ref, p1_line, 'k--', linewidth=2, alpha=0.5, label='$P_1$: slope = 1')
ax2.loglog(e_ref, p2_line, 'k:', linewidth=2, alpha=0.5, label='$P_2$: slope = 2')

ax2.set_xlabel('$|e_k|$', fontsize=12)
ax2.set_ylabel('$|e_{k+1}|$', fontsize=12)
ax2.set_title('Convergence Order Analysis', fontsize=13)
ax2.legend(loc='upper left', fontsize=10)
ax2.grid(True, alpha=0.3, which='both')
ax2.set_xlim(1e-4, 1)
ax2.set_ylim(1e-8, 1)

plt.tight_layout()
plt.show()

Analysis of convergence rates. (a) Absolute error vs. iteration on a log-linear scale, showing Newton’s method reaching machine precision much faster. (b) A log-log plot of successive errors, where the slope indicates the order of convergence (slope=1 for linear, slope=2 for quadratic).

The right plot reveals the convergence orders. Bisection and fixed-point follow the \(P_1\) reference line (slope = 1 on log-log plot), confirming linear convergence. Newton’s method follows the P2 reference line (slope = 2), confirming quadratic convergence. This means that for Newton’s method, when the error is \(10^{-2}\), the next error is roughly \((10^{-2})^2 = 10^{-4}\), then \(10^{-8}\), then \(10^{-16}\), reaching machine precision in just 4 iterations.

This analysis explains why Newton’s method is the default choice for root finding in most scientific computing applications: when it works, it works spectacularly well. However, the “when it works” qualification is important. Newton’s method does not always converge. We now examine two failure modes: cycling and divergence.

Failure Mode 1: Cycling

For certain starting points, Newton’s method can get stuck in an infinite loop, cycling between two or more values without ever reaching the root. Consider \(f(x) = \arctan(x)\), which has a root at \(x^* = 0\). If we start from \(x_0 \approx 1.39175\), the method oscillates between two points indefinitely.

Code

from matplotlib.animation import FuncAnimation
import sympy as sp
import numpy as np
import matplotlib.pyplot as plt
import mechanicskit as mk

# Define function and derivative
x = sp.symbols('x')
f = sp.atan(x)
f_prime = sp.diff(f, x)

# Convert to numpy functions
f_numpy = sp.lambdify(x, f, 'numpy')
f_prime_numpy = sp.lambdify(x, f_prime, 'numpy')

# Find the cycling point numerically
# For arctan, we want x where g(g(x)) = x but g(x) ≠ x
# where g(x) = x - arctan(x)*(1+x^2)
x0 = 1.39174754  # This value creates a 2-cycle

# Newton's method iterations
x_vals = [x0]
n_iter = 6

x_current = x0
for _ in range(n_iter):
    x_next = x_current - f_numpy(x_current) / f_prime_numpy(x_current)
    x_vals.append(x_next)
    x_current = x_next

# Generate plot data
x_plot = np.linspace(-3, 3, 400)
y_plot = f_numpy(x_plot)

# Create animation
fig, ax = plt.subplots(figsize=(10, 6), dpi=150)

def animate(frame):
    ax.clear()

    # Problem info box
    problem_text = r"$f(x) = \arctan(x)$, root at $x^* = 0$"
    ax.text(0.5, 0.98, problem_text,
            transform=ax.transAxes,
            fontsize=12,
            verticalalignment='top',
            horizontalalignment='center',
            bbox=dict(boxstyle='round', facecolor='lightcoral', alpha=0.8))

    # Plot function
    ax.plot(x_plot, y_plot, 'b-', lw=2, label='$f(x) = \\arctan(x)$')
    ax.axhline(0, color='black', lw=0.5, linestyle='--', alpha=0.5)
    ax.axvline(0, color='black', lw=0.5, linestyle='--', alpha=0.5)
    ax.grid(True, alpha=0.3)

    # Show all iterations up to current frame
    for i in range(frame + 1):
        if i < len(x_vals) - 1:
            x_curr = x_vals[i]
            x_next = x_vals[i + 1]
            f_curr = f_numpy(x_curr)
            f_prime_curr = f_prime_numpy(x_curr)

            # Vertical line from x_curr to f(x_curr)
            alpha_val = 0.3 if i < frame else 1.0
            ax.plot([x_curr, x_curr], [0, f_curr], 'r--', lw=1.5, alpha=alpha_val)

            # Point on curve
            ax.plot(x_curr, f_curr, 'ro', markersize=8, alpha=alpha_val, zorder=5)

            # Tangent line
            tangent_x = np.linspace(-3, 3, 100)
            tangent_y = f_curr + f_prime_curr * (tangent_x - x_curr)
            ax.plot(tangent_x, tangent_y, 'g-', lw=1.5, alpha=alpha_val * 0.5)

            # Mark x values on axis
            ax.plot(x_curr, 0, 'ro', markersize=8, alpha=alpha_val, zorder=5)
            if i < frame:
                ax.plot(x_next, 0, 'gs', markersize=8, alpha=alpha_val, zorder=5)

    # Highlight current iteration
    if frame < len(x_vals) - 1:
        x_curr = x_vals[frame]
        x_next = x_vals[frame + 1]
        ax.plot(x_next, 0, 'gs', markersize=10, label=f'$x_{frame+1}$ = {x_next:.5f}', zorder=6)

    # Info text box
    info_text = f"Iteration {frame}\n"
    if frame < len(x_vals) - 1:
        info_text += f"$x_{frame} = {x_vals[frame]:.6f}$\n"
        info_text += f"$x_{frame+1} = {x_vals[frame+1]:.6f}$\n"
        if frame >= 2:
            info_text += f"\nCycling between:\n"
            info_text += f"{x_vals[0]:.6f} and {x_vals[1]:.6f}"

    ax.text(0.02, 0.98, info_text,
            transform=ax.transAxes,
            fontsize=9,
            verticalalignment='top',
            horizontalalignment='left',
            bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8),
            family='monospace')

    ax.set_xlabel('$x$')
    ax.set_ylabel('$f(x)$')
    ax.set_title(f"Newton's Method Failure: Cycling (Iteration {frame})")
    ax.legend(loc='upper left', fontsize=9)
    ax.set_xlim(-3, 3)
    ax.set_ylim(-1.6, 1.6)

anim = FuncAnimation(fig, animate, frames=len(x_vals)-1, interval=1200, repeat=True)
plt.close()

mk.to_responsive_html(anim, container_id='newton-cycling')

The animation shows the method trapped in a 2-cycle, oscillating between approximately \(x \approx 1.39175\) and \(x \approx -1.39175\), never approaching the root at \(x^* = 0\). This occurs because the starting point lies at a periodic orbit of the Newton iteration map.

Failure Mode 2: Divergence

Newton’s method can also diverge, sending iterates progressively farther from the root. This happens when the tangent line has a very gentle slope, causing the next iterate to overshoot dramatically. A classic example is \(f(x) = x^{1/3}\) with root at \(x^* = 0\).

# Demonstrate divergence with f(x) = x^(1/3)
x = sp.symbols('x', real=True)
f = x**(sp.Rational(1, 3))
f_prime = sp.diff(f, x)

# Create custom cube root that handles negative numbers properly
def cbrt(x):
    """Real cube root that works for negative numbers."""
    return np.sign(x) * np.abs(x)**(1/3)

def cbrt_derivative(x):
    """Derivative of real cube root."""
    return 1 / (3 * np.abs(x)**(2/3))

# Newton's method from x_0 = 1
x_vals_div = [1.0]
for _ in range(4):
    x_curr = x_vals_div[-1]
    x_next = x_curr - cbrt(x_curr) / cbrt_derivative(x_curr)
    x_vals_div.append(x_next)

print("Newton's method on f(x) = x^(1/3) starting from x_0 = 1:")
for i, x_val in enumerate(x_vals_div):
    print(f"x_{i} = {x_val:10.4f}")

Newton's method on f(x) = x^(1/3) starting from x_0 = 1:
x_0 =     1.0000
x_1 =    -2.0000
x_2 =     4.0000
x_3 =    -8.0000
x_4 =    16.0000

Notice the iterates diverge: \(x_0 = 1\), \(x_1 = -2\), \(x_2 = 4\), \(x_3 = -8\), doubling in magnitude (with alternating sign) at each step. The issue is that \(f'(0) = \infty\) (vertical tangent at the root), violating Newton’s method’s assumptions.

We can visualize the first few iterations:

Code

fig, ax = plt.subplots(figsize=(10, 6), dpi=150)

# Plot function using the real cube root
x_plot = np.linspace(-5, 5, 400)
x_plot = x_plot[x_plot != 0]  # Remove zero to avoid issues
y_plot = cbrt(x_plot)

ax.plot(x_plot, y_plot, 'b-', lw=2, label='$f(x) = x^{1/3}$')
ax.axhline(0, color='black', lw=0.5, linestyle='--', alpha=0.5)
ax.axvline(0, color='red', lw=2, linestyle='--', alpha=0.7, label='Root at $x^* = 0$')
ax.grid(True, alpha=0.3)

# Show first 3 Newton iterations
colors = ['red', 'orange', 'purple']
for i in range(min(3, len(x_vals_div)-1)):
    x_curr = x_vals_div[i]
    x_next = x_vals_div[i+1]
    f_curr = cbrt(x_curr)
    f_prime_curr = cbrt_derivative(x_curr)

    # Vertical line
    ax.plot([x_curr, x_curr], [0, f_curr], '--', color=colors[i], lw=1.5, alpha=0.7)

    # Point on curve
    ax.plot(x_curr, f_curr, 'o', color=colors[i], markersize=10, zorder=5,
            label=f'$x_{i}$ = {x_curr:.1f}')

    # Tangent line
    if abs(x_next) < 10:  # Don't draw if too far
        tangent_x = np.linspace(max(-5, x_curr-3), min(5, x_curr+3), 100)
        tangent_y = f_curr + f_prime_curr * (tangent_x - x_curr)
        ax.plot(tangent_x, tangent_y, '-', color=colors[i], lw=1.5, alpha=0.5)

    # Next point on x-axis
    if abs(x_next) < 5:
        ax.plot(x_next, 0, 's', color=colors[i], markersize=10, zorder=5)

ax.set_xlabel('$x$')
ax.set_ylabel('$f(x)$')
ax.set_title("Newton's Method Failure: Divergence on $f(x) = x^{1/3}$")
ax.legend(loc='upper left', fontsize=9)
ax.set_xlim(-5, 5)
ax.set_ylim(-2, 2)

plt.tight_layout()
plt.show()

Failure mode of Newton’s method: divergence. For \(f(x) = x^{1/3}\), the iterates move progressively farther from the root at \(x=0\) due to the vertical tangent at the root.

The plot shows how each tangent line sends us farther from the root. The steep curve near the root (approaching a vertical tangent) causes the tangent line to intersect the x-axis far away, and the iterates diverge to \(\pm\infty\).

These examples illustrate why Newton’s method requires careful initialization and why bisection, despite its slower convergence, is sometimes preferred for its guaranteed convergence.

Multidimensional Newton’s Method

Having established Newton’s method for scalar equations, we now extend it to systems of nonlinear equations. Many engineering problems require solving multiple coupled nonlinear equations simultaneously, from steady-state chemical reactor design to structural equilibrium under nonlinear material behavior. The generalization to higher dimensions preserves the core Newton idea while replacing derivatives with matrices.

Consider a system of \(n\) nonlinear equations in \(n\) unknowns, which we write compactly as \(\mathbf{f}(\mathbf{x}) = \mathbf{0}\), where \(\mathbf{f}: \mathbb{R}^n \rightarrow \mathbb{R}^n\) and \(\mathbf{x} \in \mathbb{R}^n\). Expanded componentwise, this represents:

\[ \mathbf{f}(\mathbf{x})=\mathbf{0} \Leftrightarrow\left\{\begin{array}{c} f_1\left(x_1, x_2, \ldots, x_n\right)=0 \\ f_2\left(x_1, x_2, \ldots, x_n\right)=0 \\ \vdots \\ f_n\left(x_1, x_2, \ldots, x_n\right)=0 \end{array}\right. \]

The multidimensional generalization follows directly from the scalar case. Recall that Newton’s method for a single equation \(f(x) = 0\) uses the iteration \(x_{k+1} = x_k - f(x_k)/f'(x_k)\), which we can rearrange as \(f'(x_k)(x_{k+1} - x_k) = -f(x_k)\). This form reveals the essential structure: we solve a linear equation for the update \(\Delta x = x_{k+1} - x_k\).

For the vector case, the derivative \(f'(x)\) generalizes to the Jacobian matrix \(\mathbf{J}(\mathbf{x})\), whose entries are the partial derivatives \(J_{ij} = \frac{\partial f_i}{\partial x_j}\). The Jacobian captures how each component of \(\mathbf{f}\) responds to changes in each component of \(\mathbf{x}\). Just as the scalar derivative measures the local slope, the Jacobian measures the local linear approximation to the vector function.

Notation: We now switch from subscript notation \(x_k\) (used for scalar iterations) to superscript notation \(\mathbf{x}^{(k)}\) for the iteration counter. This avoids confusion with vector component indices \(x_i\). The parentheses distinguish iteration numbers from exponents: \(\mathbf{x}^{(k)}\) denotes the \(k\)-th iterate, while \(x_i^{(k)}\) denotes the \(i\)-th component of the \(k\)-th iterate.

The multidimensional Newton iteration becomes:

\[ \mathbf{J}\left(\mathbf{x}^{(k)}\right)\left(\mathbf{x}^{(k+1)}-\mathbf{x}^{(k)}\right)=-\mathbf{f}\left(\mathbf{x}^{(k)}\right) \tag{7.2.3}\]

where \(k\) denotes the iteration counter. At each step, we solve this linear system for the update vector \(\Delta \mathbf{x} = \mathbf{x}^{(k+1)} - \mathbf{x}^{(k)}\), then compute the new iterate \(\mathbf{x}^{(k+1)} = \mathbf{x}^{(k)} + \Delta \mathbf{x}\). The solution of (7.2.3) uses Gaussian elimination or LU factorization, never matrix inversion, which is both computationally expensive and numerically unstable.

The convergence properties from the scalar case extend to the vector setting. When started sufficiently close to a solution \(\mathbf{x}^*\) where \(\mathbf{J}(\mathbf{x}^*)\) is nonsingular, the method converges quadratically. This means the error roughly squares at each iteration, just as in the one-dimensional case. However, the method can fail if the Jacobian becomes singular or nearly singular during iteration, corresponding to the scalar case where \(f'(x) \approx 0\).

For overdetermined systems with more equations than unknowns (\(m > n\)), we cannot generally satisfy all equations simultaneously. In this case, we seek a least-squares solution by minimizing \(\|\mathbf{f}(\mathbf{x})\|^2\). The Newton iteration for this problem uses the normal equations:

\[ \mathbf{J}\left(\mathbf{x}^{(k)}\right)^{\mathrm{T}} \mathbf{J}\left(\mathbf{x}^{(k)}\right)\left(\mathbf{x}^{(k+1)}-\mathbf{x}^{(k)}\right)=-\mathbf{J}\left(\mathbf{x}^{(k)}\right)^{\mathrm{T}} \mathbf{f}\left(\mathbf{x}^{(k)}\right) \]

This formulation multiplies both sides of (7.2.3) by \(\mathbf{J}^T\), producing a square system that can be solved for the update. The resulting method is known as the Gauss-Newton algorithm for nonlinear least squares.

Example 7.2.4 Study the system of nonlinear equations

\[ \left\{\begin{array}{l} x_1+x_2^2=2 \\ x_1 x_2+x_2=1 \end{array} \right. \]

Code

# Let's create a parametric plot to visualize the system of equations
import numpy as np
import matplotlib.pyplot as plt

# Define the functions
def f1(x1, x2):
    return x1 + x2**2 - 2
def f2(x1, x2):
    return x1 * x2 + x2 - 1
# Create a grid of x1 and x2 values
x1_vals = np.linspace(-4, 4, 400)
x2_vals = np.linspace(-4, 4, 400)
X1, X2 = np.meshgrid(x1_vals, x2_vals)
Z1 = f1(X1, X2)
Z2 = f2(X1, X2)
# Plot the contours where f1=0 and f2=0
plt.figure(figsize=(8, 8), dpi=150)
contour1 = plt.contour(X1, X2, Z1, levels=[0
], colors='blue', linewidths=2)
contour2 = plt.contour(X1, X2, Z2, levels=[0], colors='orange', linewidths=2)
plt.title('Contour Plot of the System of Nonlinear Equations', fontsize=14)
plt.xlabel('$x_1$', fontsize=12)
plt.ylabel('$x_2$', fontsize=12)
plt.axhline(0, color='black', lw=0.5, linestyle='--', alpha=0.5)
plt.axvline(0, color='black', lw=0.5, linestyle='--', alpha=0.5)
plt.grid(True, alpha=0.3)
plt.xlim(-4, 4)
plt.ylim(-4, 4)
plt.legend(['','','$f_1(x_1, x_2) = 0$', '$f_2(x_1, x_2) = 0$'], loc='upper right', fontsize=14)
plt.show()

#Solve the system using built-in Newton's method from scipy
import scipy
def system(vars):
    x1, x2 = vars
    return [x1 + x2**2 - 2,
            x1 * x2 + x2 - 1]
initial_guess = [[0.0, 2.0],
                 [2.0, 0.0],
                 [-1.0, 2.0]]

for guess in initial_guess:
    solution = scipy.optimize.root(system, guess, method='hybr')
    x1_sol, x2_sol = solution.x
    print(f"Initial guess: {guess} => Solution found: x1 = {x1_sol:.6f}, x2 = {x2_sol:.6f}")

Initial guess: [0.0, 2.0] => Solution found: x1 = -0.347296, x2 = 1.532089
Initial guess: [2.0, 0.0] => Solution found: x1 = 1.879385, x2 = 0.347296
Initial guess: [-1.0, 2.0] => Solution found: x1 = -0.347296, x2 = 1.532089

Seek the first root, starting at the point \((0,2)\). First the equations \(f\), then the Jacobian matrix \(J\), finally some Newton iterations.

import numpy as np
import sympy as sp
import mechanicskit as mk
x1, x2 = sp.symbols('x1 x2', real=True)

f_sym = sp.Matrix([x1 + x2**2 - 2,
                   x1 * x2 + x2 - 1])

J_sym = f_sym.jacobian([x1, x2])
mk.display_labeled_latex("J(x_1, x_2) = ", J_sym)

\[ J(x_1, x_2) = {\def\arraystretch{1.5}\left[\begin{matrix}1 & 2 x_{2}\\x_{2} & x_{1} + 1\end{matrix}\right]} \]

vars = (x1, x2)
f_num = sp.lambdify(vars, f_sym, 'numpy')
J_num = sp.lambdify(vars, J_sym, 'numpy')


# Initial values
xk = np.array([0.0, 2.0], dtype=float)

for k in range(5):
    f_vec = np.array(f_num(*xk))
    J_mat = np.array(J_num(*xk))
    
    # Solve J * delta = -f 
    delta = np.linalg.solve(J_mat, -f_vec)[0]

    x_new = xk + delta
    
    x_str = np.array2string(x_new, formatter={'float_kind': lambda x: f"{x:8.4f}"})
    
    eps = np.linalg.norm(x_new - xk)
    print(f"Iteration {k}:\t x = {x_str},\t ||delta|| = {eps:.4e}")
    if eps < 1e-12:
        xk = x_new
        break
    xk = x_new

print(f"Converged in {k} iterations.")
print(f"Result: {xk}")

Iteration 0:     x = [ -0.2857   1.7143],    ||delta|| = 4.0406e-01
Iteration 1:     x = [ -0.3444   1.6556],    ||delta|| = 8.3048e-02
Iteration 2:     x = [ -0.3491   1.6509],    ||delta|| = 6.6283e-03
Iteration 3:     x = [ -0.3494   1.6506],    ||delta|| = 3.3099e-04
Iteration 4:     x = [ -0.3494   1.6506],    ||delta|| = 1.5721e-05
Converged in 4 iterations.
Result: [-0.34937026  1.65062974]

Newton’s method video

Newton’s fractal. Involving root finding using Newton’s method

The Secant Method

Newton’s method delivers rapid quadratic convergence but requires evaluating the derivative \(f'(x)\) at each iteration. For many functions, computing derivatives is expensive, inconvenient, or impossible. The secant method addresses this limitation by approximating the derivative using finite differences, eliminating the need for analytical derivatives while retaining superlinear convergence.

Derivation and Geometric Interpretation

We replace the derivative \(f'(x_k)\) in Newton’s formula with a finite difference approximation using the two most recent points. If we have iterates \(x_{k-1}\) and \(x_k\), we approximate the slope by:

\[ f'(x_k) \approx \frac{f(x_k) - f(x_{k-1})}{x_k - x_{k-1}} \]

Substituting this approximation into Newton’s formula \(x_{k+1} = x_k - f(x_k)/f'(x_k)\) yields the secant method:

\[ \boxed{ x_{k+1} = x_k - f(x_k) \frac{x_k - x_{k-1}}{f(x_k) - f(x_{k-1})} } \tag{7.2.4}\]

Geometrically, we construct a straight line (the secant) through the points \((x_{k-1}, f(x_{k-1}))\) and \((x_k, f(x_k))\). The next approximation \(x_{k+1}\) is where this secant line crosses the x-axis. We then discard the oldest point \(x_{k-1}\) and repeat using points \(x_k\) and \(x_{k+1}\) to find \(x_{k+2}\).

To see this geometrically, write the equation of the line passing through \((x_{k-1}, f(x_{k-1}))\) and \((x_k, f(x_k))\):

\[ y - f(x_k) = \frac{f(x_k) - f(x_{k-1})}{x_k - x_{k-1}}(x - x_k) \]

Setting \(y = 0\) and \(x = x_{k+1}\) gives:

\[ -f(x_k) = \frac{f(x_k) - f(x_{k-1})}{x_k - x_{k-1}}(x_{k+1} - x_k) \]

Solving for \(x_{k+1}\) recovers (7.2.4). As iterations progress, successive points converge toward the root, and the secant line increasingly resembles the tangent line. In the limit, the secant method behaves like Newton’s method.

Example 7.2.5 (Example: Secant Method) We apply the secant method to find the largest root of \(f(x) = x^2 - x - 6 = 0\), using initial guesses \(x_0 = 1\) and \(x_1 = 2.5\).

f = lambda x: x**2 - x - 6

x0 = 1.0
x1 = 2.5
for i in range(10):
    f0 = f(x0)
    f1 = f(x1)
    
    # Avoid division by zero
    if (f1 - f0) == 0:
        print("f(x1) - f(x0) is zero, secant method fails.")
        break
        
    x2 = x1 - f1 * (x1 - x0) / (f1 - f0)
    
    error = abs(f(x2))
    print(f"Iteration {i+1}: x = {x2:0.6f}, f(x) = {f(x2):0.6f}, error = {error:0.6f}")

    if error < 1e-6:
        break
        
    # Update points for next iteration
    x0 = x1
    x1 = x2

Iteration 1: x = 3.400000, f(x) = 2.160000, error = 2.160000
Iteration 2: x = 2.959184, f(x) = -0.202416, error = 0.202416
Iteration 3: x = 2.996954, f(x) = -0.015223, error = 0.015223
Iteration 4: x = 3.000025, f(x) = 0.000125, error = 0.000125
Iteration 5: x = 3.000000, f(x) = -0.000000, error = 0.000000

Convergence Analysis

The secant method exhibits superlinear convergence, a rate intermediate between the linear convergence of bisection and the quadratic convergence of Newton’s method. Analysis shows that the convergence order is approximately \(p \approx 1.618\), the golden ratio \(\phi = (1 + \sqrt{5})/2\). See e.g, [1] and [2] for a deeper analysis. This means that asymptotically, the error satisfies:

\[ |x_{k+1} - x^*| \approx C |x_k - x^*|^{1.618} \]

for some constant \(C\) depending on \(f\) and \(x^*\). Practically, this means the number of correct digits grows by roughly a factor of 1.6 at each iteration once the method is close to the root.

The superlinear rate arises because the secant method uses information from two previous points rather than one. While not as fast as Newton’s quadratic convergence (where correct digits double at each step), the secant method often requires fewer total function evaluations. Newton’s method needs both \(f(x_k)\) and \(f'(x_k)\) at each iteration, while the secant method needs only \(f(x_k)\) since it reuses \(f(x_{k-1})\) from the previous iteration.

Convergence requires similar conditions to Newton’s method. Starting sufficiently close to a simple root where \(f'(x^*) \neq 0\) ensures convergence. However, like Newton’s method, the secant method has only local convergence. Poor initial guesses can lead to divergence or very slow convergence. Unlike bisection, the method does not maintain a bracketing interval, so there is no guarantee that iterates stay near the root.

The secant method extends to multidimensional problems by replacing analytical Jacobian matrices with finite-difference approximations. However, the book-keeping becomes more complex, as we must approximate each column of the Jacobian using differences of function evaluations. Methods like Broyden’s method provide more sophisticated quasi-Newton approaches that build approximate Jacobians iteratively without computing all partial derivatives, see e.g. [3].

Numerical Comparison of All Methods

Having introduced four distinct root-finding methods, we now compare their performance on a common test problem. This comparison reveals the trade-offs between guaranteed convergence, convergence speed, and computational cost.

We solve \(f(x) = x^2 - x - 6 = 0\) for the root \(x^* = 3\) using all four methods: bisection, fixed-point iteration, Newton’s method, and the secant method. Each method starts from comparable initial conditions and runs until achieving a tolerance of \(10^{-12}\).

Code

import numpy as np

# Define the problem
f = lambda x: x**2 - x - 6
f_prime = lambda x: 2*x - 1
g = lambda x: np.sqrt(x + 6)  # Fixed-point function
x_star = 3.0  # Exact root

# Method 1: Bisection
def bisection(f, a, b, tol=1e-12, max_iter=100):
    iterations = []
    for k in range(max_iter):
        c = (a + b) / 2
        error = abs(c - x_star)
        iterations.append({'k': k+1, 'x': c, 'error': error})
        if b - a < tol:
            break
        if f(a) * f(c) < 0:
            b = c
        else:
            a = c
    return iterations

# Method 2: Fixed-point iteration
def fixed_point(g, x0, tol=1e-12, max_iter=100):
    iterations = []
    x = x0
    for k in range(max_iter):
        x = g(x)
        error = abs(x - x_star)
        iterations.append({'k': k+1, 'x': x, 'error': error})
        if error < tol:
            break
    return iterations

# Method 3: Newton's method
def newton(f, f_prime, x0, tol=1e-12, max_iter=100):
    iterations = []
    x = x0
    for k in range(max_iter):
        x = x - f(x) / f_prime(x)
        error = abs(x - x_star)
        iterations.append({'k': k+1, 'x': x, 'error': error})
        if error < tol:
            break
    return iterations

# Method 4: Secant method
def secant(f, x0, x1, tol=1e-12, max_iter=100):
    iterations = []
    for k in range(max_iter):
        f0, f1 = f(x0), f(x1)
        if abs(f1 - f0) < 1e-15:
            break
        x2 = x1 - f1 * (x1 - x0) / (f1 - f0)
        error = abs(x2 - x_star)
        iterations.append({'k': k+1, 'x': x2, 'error': error})
        if error < tol:
            break
        x0, x1 = x1, x2
    return iterations

# Run all methods
bisection_iters = bisection(f, 1, 4)
fixed_point_iters = fixed_point(g, 2.5)
newton_iters = newton(f, f_prime, 2.5)
secant_iters = secant(f, 1.0, 2.5)

# Print comparison table
print("Iteration | Bisection   | Fixed-Point | Newton      | Secant")
print("-" * 70)
max_len = max(len(bisection_iters), len(fixed_point_iters),
              len(newton_iters), len(secant_iters))

for i in range(min(15, max_len)):
    bis = bisection_iters[i]['error'] if i < len(bisection_iters) else None
    fp = fixed_point_iters[i]['error'] if i < len(fixed_point_iters) else None
    newt = newton_iters[i]['error'] if i < len(newton_iters) else None
    sec = secant_iters[i]['error'] if i < len(secant_iters) else None

    def fmt(val):
        return f"{val:11.2e}" if val is not None else "     ---    "

    print(f"{i+1:9d} | {fmt(bis)} | {fmt(fp)} | {fmt(newt)} | {fmt(sec)}")

print(f"\nTotal iterations to reach tolerance {1e-12}:")
print(f"  Bisection:     {len(bisection_iters)}")
print(f"  Fixed-point:   {len(fixed_point_iters)}")
print(f"  Newton:        {len(newton_iters)}")
print(f"  Secant:        {len(secant_iters)}")

Iteration | Bisection   | Fixed-Point | Newton      | Secant
----------------------------------------------------------------------
        1 |    5.00e-01 |    8.45e-02 |    6.25e-02 |    4.00e-01
        2 |    2.50e-01 |    1.41e-02 |    7.62e-04 |    4.08e-02
        3 |    1.25e-01 |    2.35e-03 |    1.16e-07 |    3.05e-03
        4 |    6.25e-02 |    3.92e-04 |    2.66e-15 |    2.51e-05
        5 |    3.12e-02 |    6.54e-05 |      ---     |    1.53e-08
        6 |    1.56e-02 |    1.09e-05 |      ---     |    7.68e-14
        7 |    7.81e-03 |    1.82e-06 |      ---     |      ---    
        8 |    3.91e-03 |    3.03e-07 |      ---     |      ---    
        9 |    1.95e-03 |    5.05e-08 |      ---     |      ---    
       10 |    9.77e-04 |    8.41e-09 |      ---     |      ---    
       11 |    4.88e-04 |    1.40e-09 |      ---     |      ---    
       12 |    2.44e-04 |    2.34e-10 |      ---     |      ---    
       13 |    1.22e-04 |    3.89e-11 |      ---     |      ---    
       14 |    6.10e-05 |    6.49e-12 |      ---     |      ---    
       15 |    3.05e-05 |    1.08e-12 |      ---     |      ---    

Total iterations to reach tolerance 1e-12:
  Bisection:     43
  Fixed-point:   16
  Newton:        4
  Secant:        6

The table shows the dramatic differences in convergence speed. Bisection makes steady progress, halving the error at each step. Fixed-point iteration converges linearly with a rate determined by \(|g'(x^*)|\). Newton’s method achieves quadratic convergence, doubling the number of correct digits once close to the root. The secant method converges superlinearly with rate \(\phi \approx 1.618\), faster than linear methods but slower than Newton.

We can verify the convergence orders of all four methods through log-log plots of successive errors. On such a plot, the slope reveals the convergence order directly: linear methods have slope 1, superlinear methods have slope 1.618, and quadratic methods have slope 2.

Code

import numpy as np
import matplotlib.pyplot as plt

# Define the problem
f = lambda x: x**2 - x - 6
f_prime = lambda x: 2*x - 1
g = lambda x: np.sqrt(x + 6)  # Fixed-point function
x_star = 3.0

# Bisection method with error tracking
def bisection_with_errors(f, a, b, max_iter=50):
    errors = []
    for k in range(max_iter):
        c = (a + b) / 2
        error = abs(c - x_star)
        errors.append(error)
        if b - a < 1e-14:
            break
        if f(a) * f(c) < 0:
            b = c
        else:
            a = c
    return errors

# Fixed-point iteration with error tracking
def fixed_point_with_errors(g, x0, max_iter=50):
    errors = []
    x = x0
    for k in range(max_iter):
        x = g(x)
        error = abs(x - x_star)
        errors.append(error)
        if error < 1e-14:
            break
    return errors

# Newton's method with error tracking
def newton_with_errors(f, f_prime, x0, max_iter=20):
    errors = []
    x = x0
    for k in range(max_iter):
        x = x - f(x) / f_prime(x)
        error = abs(x - x_star)
        errors.append(error)
        if error < 1e-14:
            break
    return errors

# Secant method with error tracking
def secant_with_errors(f, x0, x1, max_iter=20):
    errors = []
    for k in range(max_iter):
        f0, f1 = f(x0), f(x1)
        if abs(f1 - f0) < 1e-15:
            break
        x2 = x1 - f1 * (x1 - x0) / (f1 - f0)
        error = abs(x2 - x_star)
        errors.append(error)
        if error < 1e-14:
            break
        x0, x1 = x1, x2
    return errors

# Run all methods (Newton starts farther from root to show more iterations)
bisection_errors = bisection_with_errors(f, 1, 4)
fixed_point_errors = fixed_point_with_errors(g, 2.5)
newton_errors = newton_with_errors(f, f_prime, 4.0)  # Start farther from root
secant_errors = secant_with_errors(f, 1.0, 2.5)

# Create log-log plot
fig, ax = plt.subplots(figsize=(10, 7), dpi=150)

eps_threshold = np.finfo(float).eps * 10  # Lower threshold to capture more points

# Plot each method
def plot_method(errors, label, color, marker):
    e_k = np.array(errors[:-1])
    e_k1 = np.array(errors[1:])
    mask = (e_k > eps_threshold) & (e_k1 > eps_threshold)
    if np.sum(mask) > 0:
        ax.loglog(e_k[mask], e_k1[mask], marker=marker, color=color,
                  markersize=7, linewidth=2.5, label=label, alpha=0.8, zorder=3)

plot_method(bisection_errors, 'Bisection', 'blue', 'o')
plot_method(fixed_point_errors, 'Fixed-Point', 'green', 's')
plot_method(newton_errors, 'Newton', 'red', '^')
plot_method(secant_errors, 'Secant', 'magenta', 'd')

# Add reference lines for convergence orders
e_ref = np.logspace(-7, -1, 50)
p1_line = 0.5 * e_ref           # Linear (slope = 1)
p_golden = 0.2 * e_ref**1.618   # Golden ratio (slope = 1.618)
p2_line = 0.2 * e_ref**2        # Quadratic (slope = 2)

ax.loglog(e_ref, p1_line, 'k--', linewidth=1.5, alpha=0.4,
          label='P1: slope = 1 (linear)')
ax.loglog(e_ref, p_golden, 'k-.', linewidth=1.5, alpha=0.6,
          label=f'P$\\phi$: slope = 1.618 (superlinear)')
ax.loglog(e_ref, p2_line, 'k:', linewidth=2, alpha=0.5,
          label='P2: slope = 2 (quadratic)')

ax.set_xlabel('$|e_k|$ (error at iteration $k$)', fontsize=13)
ax.set_ylabel('$|e_{k+1}|$ (error at iteration $k+1$)', fontsize=13)
ax.set_title('Convergence Order Analysis: All Methods', fontsize=14, fontweight='bold')
ax.legend(loc='upper left', fontsize=10, ncol=2)
ax.grid(True, alpha=0.3, which='both')
ax.set_xlim(1e-8, 1e-1)
ax.set_ylim(1e-15, 1e-1)

plt.tight_layout()
plt.show()

Log-log plot of successive errors for all four root-finding methods. The slope of each curve reveals its convergence order: bisection and fixed-point follow the linear reference (slope = 1), secant follows the golden ratio line (slope ≈ 1.618), and Newton follows the quadratic reference (slope = 2).

The log-log plot reveals the dramatic differences in convergence orders. Bisection and fixed-point iteration follow the \(P_1\) reference line (slope = 1), confirming linear convergence. Each iteration reduces the error by a roughly constant factor. The secant method tracks the golden ratio line (slope ≈ 1.618), demonstrating superlinear convergence between linear and quadratic rates. Newton’s method follows the P2 reference line (slope = 2), showing quadratic convergence where the error squares at each iteration. Once Newton’s method gets close to the root, it achieves machine precision in just a few iterations, while bisection requires many more steps to reach the same accuracy.

Discussion: Choosing the Right Method

The comparison reveals clear trade-offs between the four methods, and the best choice depends on problem characteristics and available information.

Bisection is the most robust method. It always converges when started with a bracketing interval and provides guaranteed error bounds at every step. The linear convergence means many iterations are required for high accuracy, but each iteration is cheap (one function evaluation) and the method never fails. Use bisection when reliability is paramount, when good initial guesses are unavailable, or when \(f\) is poorly behaved (discontinuous derivative, multiple roots nearby). It serves as an excellent fallback when faster methods fail.

Fixed-point iteration offers simplicity and sometimes reveals problem structure. When \(|g'(x^*)| \ll 1\), convergence can be fast. However, finding a good iteration function \(g\) requires insight, and the method can diverge or converge slowly for poor choices of \(g\). The method’s theoretical framework (Banach fixed-point theorem) makes it valuable for proving existence and uniqueness of solutions beyond just computing them. Use fixed-point iteration when a natural contraction mapping exists or when theoretical analysis is needed.

Newton’s method dominates when applicable. Quadratic convergence means exponentially fast error reduction once the iterates are near the root. The cost per iteration is higher (both \(f\) and \(f'\) must be evaluated), but far fewer iterations are needed. The method can fail catastrophically for poor starting points or when \(f'(x) \approx 0\). Use Newton’s method when derivatives are readily available (analytically or via automatic differentiation), when a reasonable initial guess is known, and when the function is smooth near the root. This is the default choice for most scientific computing applications.

The secant method provides a compromise. It achieves superlinear convergence without requiring derivatives, making it attractive when \(f'\) is expensive or unavailable. The convergence rate (\(p \approx 1.618\)) lies between fixed-point and Newton methods. Each iteration requires only one new function evaluation (reusing the previous value), giving it an efficiency advantage over Newton when evaluating \(f\) is much cheaper than evaluating \(f'\). Use the secant method when derivatives are unavailable but faster convergence than bisection is desired, or when function evaluations are expensive and the cost of computing derivatives cannot be justified.

For practical problems, hybrid strategies often work best. Start with a few bisection iterations to move into the basin of attraction, then switch to Newton or secant for rapid final convergence. Most robust numerical libraries employ such adaptive strategies, balancing reliability against speed.

Computer Arithmetic and Error Analysis

The preceding analysis assumed exact arithmetic, but computers use finite-precision floating-point numbers that introduce unavoidable errors. These errors interact with the iterative methods we have studied in subtle ways. Bisection’s error bound \(|x_k - x^*| \leq 2^{-k}(b_0 - a_0)\) holds in theory, but eventually roundoff error prevents further refinement. Newton’s method’s quadratic convergence can amplify small errors catastrophically when \(f'(x) \approx 0\). Understanding how numbers are represented in binary and how errors accumulate during computation transforms abstract convergence theory into practical implementation guidance.

When implementing numerical methods on a computer, we immediately confront a fundamental limitation: computers cannot represent arbitrary real numbers exactly. The fraction \(1/10 = 0.1\) has no exact binary representation, just as \(1/3 = 0.333\ldots\) has no exact decimal representation. This section explores the IEEE 754 floating-point standard that governs how modern computers represent real numbers, quantifies the inherent limitations through machine epsilon, and provides practical guidance for managing numerical errors in iterative methods.

Floating-Point Representation

Modern computers represent real numbers using the IEEE 754 standard for floating-point arithmetic, established in 1985 and nearly universal today. A 64-bit double-precision floating-point number consists of three parts: one sign bit \(s\), eleven bits for the exponent \(e\), and fifty-two bits for the fractional part (mantissa) \(m\). The represented value is

\[ (-1)^s \times 2^{e-1023} \times (1 + m), \]

where \(m\) is interpreted as a binary fraction \(m = \sum_{i=1}^{52} b_i 2^{-i}\) with binary digits \(b_i \in \{0,1\}\). The exponent bias of 1023 allows representation of both very large and very small numbers. The implicit leading 1 in \((1 + m)\) provides one extra bit of precision: numbers are normalized so the leading bit is always 1 and need not be stored.

This representation has several important consequences. First, not all decimal numbers have exact binary representations. The fraction \(1/10 = 0.1\) in decimal becomes the repeating binary expansion \(0.0001100110011\ldots_2\), which must be truncated to fit in 52 bits. This is why seemingly simple calculations like 0.1 + 0.2 in Python produce 0.30000000000000004 rather than exactly 0.3. The representation error is small but unavoidable.

# Demonstrate floating-point representation error
result = 0.1 + 0.2
print(f"0.1 + 0.2 = {result}")
print(f"Is 0.1 + 0.2 == 0.3? {result == 0.3}")
print(f"Exact representation: {result:.20f}")

0.1 + 0.2 = 0.30000000000000004
Is 0.1 + 0.2 == 0.3? False
Exact representation: 0.30000000000000004441

Second, the gaps between representable numbers vary with magnitude. Numbers near zero are densely packed, while numbers near the maximum representable value (approximately \(1.8 \times 10^{308}\)) are separated by huge gaps. The relative spacing between consecutive representable numbers is characterized by a fundamental constant called machine epsilon.

# Compute machine epsilon experimentally
eps = 1.0
while (1.0 + eps) > 1.0:
    eps_last = eps
    eps = eps / 2.0

print(f"Computed machine epsilon: {eps_last}")
print(f"Theoretical value (2^-52): {2**-52}")
print(f"NumPy's value: {np.finfo(float).eps}")
print(f"Match: {eps_last == 2**-52 == np.finfo(float).eps}")

Computed machine epsilon: 2.220446049250313e-16
Theoretical value (2^-52): 2.220446049250313e-16
NumPy's value: 2.220446049250313e-16
Match: True

The values agree perfectly with theory. This fundamental limit means Python cannot distinguish between \(1\) and \(1 + \varepsilon\) for any \(\varepsilon < \varepsilon_{\text{mach}}\). No computation can reliably resolve differences smaller than roughly \(10^{-16}\) times the magnitude of the numbers involved. This sets an absolute floor on achievable accuracy.

Sources of Error in Numerical Computation

When implementing iterative methods like bisection, multiple error sources affect the final result. Understanding the distinction between error types is essential for writing robust numerical code.

Truncation error arises when we terminate an infinite process after finitely many steps. In the bisection method with tolerance \(\varepsilon = 0.01\), we stopped after 9 iterations, still \(0.002\) units from the exact root \(x^* = 3\). This gap is truncation error: we truncated the (theoretically) infinite sequence of bisections. Our theoretical bound \(|x_k - x^*| \leq 2^{-k}(b_0 - a_0)\) quantifies this truncation error precisely.

Roundoff error accumulates from representation errors at each arithmetic operation. Every addition, multiplication, or function evaluation introduces a small error of magnitude roughly \(\varepsilon_{\text{mach}}\) relative to the operands. Over many operations, these can accumulate or sometimes catastrophically amplify. Fortunately, bisection performs only simple operations (midpoint calculation, function evaluation) and is remarkably robust to roundoff.

To see the interplay, consider evaluating \(f(c) = c^2 - c - 6\) at \(c = 2.998\) during bisection. First, \(c\) itself carries a representation error. Then computing \(c^2\) introduces roundoff, as does the subtraction. Each error scales with \(\varepsilon_{\text{mach}}\), so for values near unity, total accumulated error is perhaps \(10^{-15}\). This is negligible compared to our truncation error of \(0.002\), explaining why bisection reliably produces accurate results despite finite-precision arithmetic.

Measuring and Controlling Error

When assessing a numerical approximation \(x\) to an unknown exact value \(x^*\), we use two complementary measures. The absolute error \(|x - x^*|\) gives the raw difference, useful when the problem scale is known. The relative error \(|x - x^*|/|x^*|\) expresses error as a fraction, providing scale-independent assessment.

For our bisection example, the final midpoint \(c \approx 2.998\) has absolute error \(|2.998 - 3| = 0.002\), meeting tolerance \(\varepsilon = 0.01\). The relative error is \(0.002/3 \approx 0.00067\) or 0.067%, indicating high relative accuracy.

However, relative error becomes problematic when \(|x^*| \approx 0\). Dividing by a near-zero number produces arbitrarily large relative errors even when absolute error is small. To handle all scales robustly, practical codes often use

\[ \text{error measure} = \frac{|x - x^*|}{1 + |x^*|}, \]

which reduces to absolute error when \(|x^*| \ll 1\) and relative error when \(|x^*| \gg 1\).

In practice we do not know \(x^*\) (finding it is the whole point!), so we cannot compute these error measures directly. Iterative methods instead estimate error using consecutive iterates: \(|x_{k+1} - x_k| < \varepsilon\) as a termination criterion. For bisection specifically, we have the stronger guarantee \(|x_k - x^*| \leq b_k - a_k\), allowing direct error bounds.

Let’s examine how errors evolve during bisection, running the method to very high precision to observe when roundoff error eventually dominates:

Code

# Track error during bisection for visualization
import sympy as sp
import numpy as np
import matplotlib.pyplot as plt
import mechanicskit as mk

x = sp.symbols('x')
f_expr = x**2 - x - 6
x_exact = 3.0  # Known exact root

a, b = 1.0, 4.0
tol = 1e-10  # Very tight tolerance to see many iterations
errors_abs = []
errors_rel = []
interval_widths = []
midpoints = []

for k in range(40):  # Limit iterations
    c = (a + b) / 2
    midpoints.append(c)
    
    # Compute errors
    err_abs = abs(c - x_exact)
    err_rel = err_abs / abs(x_exact)
    interval_width = b - a
    
    errors_abs.append(err_abs)
    errors_rel.append(err_rel)
    interval_widths.append(interval_width)
    
    if interval_width < tol:
        break
    
    # Update interval
    if float(f_expr.subs(x, c) * f_expr.subs(x, a)) < 0:
        b = c
    else:
        a = c

# Plot error convergence
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4), dpi=150)

iterations = list(range(len(errors_abs)))

# Absolute error plot
ax1.semilogy(iterations, errors_abs, 'bo-', label='Absolute error', markersize=4)
ax1.semilogy(iterations, interval_widths, 'r--', label='Interval width', linewidth=2, alpha=0.7)
ax1.axhline(np.finfo(float).eps, color='gray', linestyle=':', linewidth=2, label='Machine epsilon')
ax1.set_xlabel('Iteration $k$')
ax1.set_ylabel('Error')
ax1.set_title('Absolute Error Convergence')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Relative error plot
ax2.semilogy(iterations, errors_rel, 'go-', label='Relative error', markersize=4)
ax2.set_xlabel('Iteration $k$')
ax2.set_ylabel('Relative Error')
ax2.set_title('Relative Error Convergence')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"After {len(errors_abs)} iterations:")
print(f"Final approximation: {midpoints[-1]:.15f}")
print(f"Exact root:          {x_exact:.15f}")
print(f"Absolute error:      {errors_abs[-1]:.3e}")
print(f"Relative error:      {errors_rel[-1]:.3e}")
print(f"Machine epsilon:     {np.finfo(float).eps:.3e}")

After 36 iterations:
Final approximation: 3.000000000014552
Exact root:          3.000000000000000
Absolute error:      1.455e-11
Relative error:      4.851e-12
Machine epsilon:     2.220e-16

The plots reveal geometric convergence: both errors decrease linearly on the logarithmic scale, halving with each iteration. The absolute error closely tracks the interval width (red dashed line), confirming our theoretical bound. Crucially, error stagnates near machine epsilon \(\varepsilon_{\text{mach}} \approx 10^{-16}\) after about 50 iterations. Further iteration cannot improve accuracy because roundoff error in evaluating \(f(c)\) and computing midpoints prevents resolving differences smaller than \(\varepsilon_{\text{mach}}|x^*|\).

This observation provides practical guidance: setting tolerance much smaller than \(10^{-14}\) wastes computation without improving accuracy. For double-precision arithmetic, targeting 10-12 significant digits balances accuracy against computational efficiency.

Practical Recommendations

The interplay between truncation and roundoff error characterizes all numerical computation, so we need practical rules for implementation. For double‑precision arithmetic, absolute tolerances below \(10^{-14}\) or relative tolerances below \(10^{-12}\) rarely make sense because roundoff dominates. We should also match tolerance to the physical scale, e.g., when measuring lengths in meters, \(10^{-6}\) often suffices.

When solution magnitude is unknown, relative error provides scale‑independent stopping criteria, but we must switch to absolute error or mixed criteria near \(x^* \approx 0\) to avoid division by small numbers. Bisection’s bound \(|x_k - x^*| \leq 2^{-k}(b_0-a_0)\) lets us predict iteration counts in advance, while Newton’s method lacks such guarantees and requires careful monitoring. Finally, we must avoid catastrophic cancellation. When computing numerical derivatives \([f(x+h) - f(x)]/h\), choosing \(h\) too small destroys accuracy even though the truncation error is smaller. This is a recurring theme in scientific computing: we balance modeling error, truncation error, and roundoff error rather than pursuing arbitrarily small step sizes.

Case Study: Fast Inverse Square Root

One of the most celebrated examples of exploiting floating-point representation for performance is the “fast inverse square root” algorithm from the Quake III Arena video game engine (1999). We include it here because it illustrates a recurring scientific computing trade‑off: approximate arithmetic combined with one iteration of Newton’s method can be faster and accurate enough for real‑time simulation. This algorithm computes \(1/\sqrt{x}\) approximately 4 times faster than standard methods, using a clever combination of bit manipulation and Newton’s method. The problem arises constantly in 3D graphics: normalizing vectors (making them unit length) requires dividing by their magnitude, which involves computing \(1/\sqrt{x^2 + y^2 + z^2}\). When rendering thousands of polygons per frame at 60 frames per second, this operation becomes a critical bottleneck.

The video below provides an excellent explanation of the algorithm’s inner workings, connecting the IEEE 754 representation we just studied to practical performance optimization:

The algorithm exploits the IEEE 754 floating-point representation we discussed earlier. Recall that a 32-bit float has the form

\[ \text{float} = (-1)^s \times 2^{e-127} \times (1 + m), \]

where \(s\) is the sign bit, \(e\) is an 8-bit exponent, and \(m\) is the 23-bit mantissa representing a fraction in \([0,1)\). The insight is that when interpreted as an integer, the bit pattern approximately represents \(\log_2(\text{float})\). This logarithmic relationship enables fast approximate computation of inverse square roots through simple integer arithmetic.

Here is the original C code from Quake III, famous for its mysterious “magic constant” and terse comment:

float Q_rsqrt(float number) {
    long i;
    float x2, y;
    const float threehalfs = 1.5F;

    x2 = number * 0.5F;
    y  = number;
    i  = * ( long * ) &y;                       // evil floating point bit level hacking
    i  = 0x5f3759df - ( i >> 1 );               // what the fuck?
    y  = * ( float * ) &i;
    y  = y * ( threehalfs - ( x2 * y * y ) );   // 1st iteration of Newton's method
    return y;
}

The algorithm has three phases. First, it interprets the floating-point bits as an integer (i = * ( long * ) &y). Second, it performs the “magic” integer operation 0x5f3759df - (i >> 1), which exploits the logarithmic structure to approximate \(\log_2(1/\sqrt{x})\). The right shift by 1 corresponds to dividing the logarithm by 2 (taking a square root in log space), and subtracting from the magic constant inverts and adjusts the bias. Third, it applies one iteration of Newton’s method to refine the approximation.

While we cannot do direct bit-casting in pure Python as elegantly as C, we can implement the algorithm using the struct module to manipulate bit representations:

import struct

def fast_inverse_sqrt(number):
    """Fast inverse square root using Quake III algorithm."""
    threehalfs = 1.5
    x2 = number * 0.5
    
    # Convert float to integer bits
    i = struct.unpack('>l', struct.pack('>f', number))[0]
    
    # Magic bit hack
    i = 0x5f3759df - (i >> 1)
    
    # Convert back to float
    y = struct.unpack('>f', struct.pack('>l', i))[0]
    
    # Newton iteration
    y = y * (threehalfs - (x2 * y * y))
    
    return y

# Test the implementation
test_values = [1.0, 4.0, 9.0, 16.0, 100.0]
print("x      | Fast 1/√x  | Actual 1/√x | Error")
print("-" * 50)
for val in test_values:
    fast_result = fast_inverse_sqrt(val)
    actual_result = 1.0 / np.sqrt(val)
    error = abs(fast_result - actual_result)
    print(f"{val:6.1f} | {fast_result:10.7f} | {actual_result:11.7f} | {error:.2e}")

x      | Fast 1/√x  | Actual 1/√x | Error
--------------------------------------------------
   1.0 |  0.9983071 |   1.0000000 | 1.69e-03
   4.0 |  0.4991536 |   0.5000000 | 8.46e-04
   9.0 |  0.3329532 |   0.3333333 | 3.80e-04
  16.0 |  0.2495768 |   0.2500000 | 4.23e-04
 100.0 |  0.0998449 |   0.1000000 | 1.55e-04

The algorithm achieves remarkable accuracy with just one Newton iteration. The magic constant 0x5f3759df was determined empirically to minimize maximum error across the range of inputs. To understand why this works so well, let’s visualize how the magic initial guess compares to the exact value, and how Newton’s method refines it:

Code

# Analyze the algorithm's accuracy across a range of inputs
x_range = np.linspace(0.1, 100, 500)
exact = 1.0 / np.sqrt(x_range)
fast_results = np.array([fast_inverse_sqrt(val) for val in x_range])

# Also compute the magic guess before Newton iteration
def magic_guess_only(number):
    """Just the magic bit hack, no Newton iteration."""
    i = struct.unpack('>l', struct.pack('>f', number))[0]
    i = 0x5f3759df - (i >> 1)
    y = struct.unpack('>f', struct.pack('>l', i))[0]
    return y

magic_only = np.array([magic_guess_only(val) for val in x_range])

# Create visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5), dpi=150)

# Plot 1: Function values
ax1.plot(x_range, exact, 'b-', label='Exact $1/\\sqrt{x}$', linewidth=2)
ax1.plot(x_range, magic_only, 'r--', label='Magic guess (before Newton)', linewidth=1.5, alpha=0.7)
ax1.plot(x_range, fast_results, 'g:', label='After 1 Newton iteration', linewidth=2)
ax1.set_xlabel('$x$')
ax1.set_ylabel('$1/\\sqrt{x}$')
ax1.set_title('Fast Inverse Square Root Approximation')
ax1.legend()
ax1.grid(True, alpha=0.3)
ax1.set_xlim(0, 20)
ax1.set_ylim(0, 3)

# Plot 2: Relative error
rel_error_magic = np.abs((magic_only - exact) / exact)
rel_error_fast = np.abs((fast_results - exact) / exact)

ax2.semilogy(x_range, rel_error_magic, 'r--', label='Magic guess error', linewidth=1.5, alpha=0.7)
ax2.semilogy(x_range, rel_error_fast, 'g-', label='After Newton iteration', linewidth=2)
ax2.axhline(np.finfo(np.float32).eps, color='gray', linestyle=':', linewidth=2, label='Float32 machine epsilon')
ax2.set_xlabel('$x$')
ax2.set_ylabel('Relative Error')
ax2.set_title('Relative Error Analysis')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nMagic guess max relative error: {np.max(rel_error_magic):.2%}")
print(f"After Newton max relative error: {np.max(rel_error_fast):.2%}")
print(f"Float32 machine epsilon: {np.finfo(np.float32).eps:.2e}")

Analysis of the Fast Inverse Square Root algorithm. (a) The ‘magic guess’ provides a close initial approximation to the exact function, which is then refined by a single Newton iteration. (b) The relative error of the magic guess is good, but the Newton iteration improves it to near the limits of 32-bit floating-point precision.


Magic guess max relative error: 3.43%
After Newton max relative error: 0.17%
Float32 machine epsilon: 1.19e-07

The visualization shows how the method improves accuracy. The magic bit hack alone produces an initial guess with relative error around 1-2%, which is already close given that it uses only integer arithmetic. One Newton iteration then reduces this error to below 0.2%, approaching the precision limits of 32-bit floats. This combines our understanding of IEEE 754 representation (the bit manipulation exploits the exponential-mantissa structure), error analysis (we can quantify the approximation quality), and Newton’s method. The result highlights how careful use of computer arithmetic yields substantial speedups in real-time graphics, with reported performance around 4x in the late 1990s.

References

[1]

Verschelde J. The convergence of the secant method is superlinear. Department of Mathematics, Statistics; Computer Science, University of Illinois at Chicago; 2005.

[2]

Burden RL, Faires JD. Numerical analysis. 9th ed. Boston, MA: Brooks/Cole, Cengage Learning; 2010.

[3]

Stoer J, Bulirsch R. Introduction to numerical analysis. 3rd ed. New York: Springer; 2002.