Curve Fitting with Julia

Subtitle: Why Oil Production May Not Follow a Gaussian Function

“How did you go bankrupt?” Bill asked.
“Two ways,” Mike said. “Gradually, then suddenly.”

—Ernest Hemingway, The Sun Also Rises

The Weierstrass Approximation Theorem

In 1885, Karl Weierstrass

karl-weierstrass

proved that any continuous function can be approximated as closely as you like with a polynomial. Written in mathematical terms, the theorem he proved is

Theorem: Suppose f:[0,1]Rf:[0,1] \Rightarrow \mathbb{R} is continuous. For any ϵ>0\epsilon > 0, there exists a polynomial pp such that

supx[0,1]f(x)p(x)ϵ.{\text{sup} \atop x \in [0,1]} |f(x)-p(x)| \leq \epsilon.

What does all this mean? Let’s start with f:[0,1]Rf: [0,1] \Rightarrow \mathbb{R}. This says that the function ff takes any value of xx between 00 and 11 and returns a real-valued number, y=f(x)y = f(x) and yy can be anything. A function is continuous if it exists at every xx and you can trace the curve without lifting your pencil.

The part about “For any ϵ>0\epsilon > 0” is a trick mathematicians use a lot, and it means if you think of some very small positive number, we can always find another even smaller, but still positive number. The way we use this very, very tiny ϵ\epsilon is to take the absolute value of the difference between f(x)f(x) and p(x)p(x) at every xx and make sure it’s less than ϵ\epsilon.

The supx[0,1]\text{sup} \atop x \in [0,1] is not “'Sup, dude?”, but it’s short for supremum and it means that for all possible polynomials pp approximating ff, we want to choose the one where the maximum absolute value difference between f(x)f(x) and p(x)p(x) for any xx in the interval [0,1][0,1] is the minimum. In other words, we want to choose the pp which minimizes the biggest error.

You might be wondering why the xx-values are constrained to be between 00 and 11, and the answer is, it doesn’t matter. You can shrink, stretch and shift xx to fit any interval. Suppose your function is defined over the interval [a,b][a,b]. Then, for every xx, make

x=xabax' = \frac{x-a}{b-a}

so when x=ax=a, x=0x' = 0 and when x=bx=b, x=1x'=1. Then modify your function to

g(x)=(ba)f(x)+ag(x) = (b-a)f(x)+a

and you get the same yy values as ff, but over the interval [0,1][0,1].

A theorem isn’t a theorem until it’s been proved. Weierstrass proved his theorem, but in 1912, Sergei Bernstein

Sergei-Bernstein

used Bernstein basis polynomials to construct approximation functions. In “Yacht Design with Mathematics” we used Bernstein polynomials as the basis for Bezier functions. For any n0n \geq 0, the n+1n+1 Bernstein basis polynomials are defined as

Bk,n=(nk)xk(1x)nk,  k=0,1,2,,nB_{k,n}={n \choose k} x^k (1-x)^{n-k}, \; k=0,1,2,\ldots,n

with xx in the interval [0,1][0,1]. The Binomial Theorem shows how to expand the polynomial in xx and yy to the nthn^{th} power,

(x+y)n=k=0n(nk)xknyk.(x+y)^n = \sum_{k=0}^n {n \choose k} x^{k-n}y^k.

Substitute y=1xy = 1-x, add up all kk Bernstein polynomials and you get,

k=1nBk,n(x)=(x+(1x))n=1n=1\sum_{k=1}^n B_{k,n}(x) = (x + (1-x))^n = 1^n = 1

for any xx. Bernstein then defined a new function which included a continuous function ff on the interval [0,1][0,1] to be

Bn(f)(x)=k=0nf(kn)Bk,n(x)=k=0nf(kn)(nk)xk(1x)nk.B_n(f)(x) = \sum_{k=0}^n f\left( \frac{k}{n} \right) B_{k,n}(x) = \sum_{k=0}^n f\left( \frac{k}{n} \right) {n \choose k} x^k (1-x)^{n-k}.

Michael Bertrand gives the complete version of Bernstein’s proof in his article, “Bernstein Proves the Weierstrass Approximation Theorem”, and I’ll provide another version at the end.

In 1938, Marshall Stone

marshall-h-stone

extended Weierstrass’ Theorem which is now known as the Stone-Weierstrass Theorem.

Some Applications

What can you do with a polynomial approximation to a function? With a good polynomial fit, you can use the polynomial to estimate the derivative or the integral of ff, and this is what is done with software packages like Chebfun for Matlab, ApproxFun in Julia, or Python libraries such as pychebfun, ChebPy, or PaCAL. Polynomials are easy to differentiate and integrate, which is often not true for arbitrary functions, so having a good approximation is quite useful. These software tools can approximate continuous functions to machine precision, and computing with polynomials is fast.

With an approximate derivative or integral you can solve differential equations, and this has led to applications in neural networks as described in “Universal Invariant and Equivariant Graph Neural Networks”. For insight into how Chebfun works, see “Chebfun: A New Kind of Numerical Computing” by Platte and Trefethen. The Chebfun website has many examples,

You’ll find plenty of nerdy-cool projects to try out. The only negative aspect of Chebfun is it requires the commercial software Matlab, but you may find an open-source version you prefer in Chebfun-related projects.

Polynomials aren’t the only way to approximate functions. Fourier approximation and wavelets are other methods, and Nick Trefethen has written the book, “Approximation Theory and Approximation Practice” (partially available online). See the Publications page for other references.

Something you should never do is fit a function ff with a high degree polynomial pp and then extrapolate pp beyond the limits of ff. You only get a good fit over the interval [a,b][a,b] where you defined ff.

function-approximation-with-polynomial

Here, the function

f(x)=sin(x)xf(x) = \frac{\sin(x)}{x}

may be approximated by the polynomial

p(x)=1x26+x4120.p(x)=1 - \frac{x^2}{6} + \frac{x^4}{120}.

Between π2-\frac{\pi}{2} and π2\frac{\pi}{2} the fit is pretty good, but outside the interval, the fit is much worse.

A Naive Approach to Approximations

Suppose you knew nothing about approximation theory and decided to try fitting one function (or data) with the sum of a simpler function?

marion-king-hubbert

Marion King Hubbert was a geologist, mathematician, and physicist who worked for Shell Oil, the USGS, and taught at Columbia, Stanford, and UC Berkeley. In 1956, he looked at oil production data and proposed a model for the amount of oil extracted from a field,

Q(t)=Qmax1+aebtQ(t) = \frac{Q_{\text{max}}}{1+ae^{-bt}}

where QmaxQ_{\text{max}} is the total quantity of oil to eventually be produced from the field, Q(t)Q(t) is the cumulative production up to time tt, and constants aa and bb are parameters defined by local geology. This is the logistic function first described by Pierre Francois Verhulst to model population growth. The derivative of the logistic function produces a bell-shaped curve (or Gaussian function), which for Hubbert was the instantaneous rate of oil flow from the field.

logistic-and-bell-curve

Hubbert used the logistic function because initially with only a few wells production would be low, but as more came on line production should increase. Eventually, though, depletion would slow the rate of production. Hubbert thought his model could be extended to all U.S. oil production and predicted the peak of production would occur in 1970 for the lower 48 states.

hubbert-prediction-lower-48-us

In 1976 Hubbert predicted the world peak in production might happen as soon as 1995. A 2007 study by the UK Energy Resource Center found no geological reason to expect cumulative oil production should follow a logistic curve. They reported on field data suggesting peaks tended to happen at about 25% of the total production QmaxQ_\text{max}.

On the other hand, Ugo Bardi says decline is faster than growth and likes to quote Seneca,

“It would be some consolation for the feebleness of our selves and our works if all things should perish as slowly as they come into being; but as it is, increases are of sluggish growth, but the way to ruin is rapid.” Lucius Anneaus Seneca, Letters to Lucilius, n. 91

bardi-seneca-cliff

We can model individual fields using Gaussian functions,

g(t)=αe(tμ)22σ2g(t) = \alpha e^{-\frac{(t-\mu)^2}{2 \sigma^2}}

where α\alpha represents the peak amount produced, μ\mu is the time of peak production and σ\sigma determines the width of the bell curve. The Ghawar field in Saudi Arabia produces tens of thousands of barrels per day (α\alpha), has been in operation since μ=1948\mu = 1948, and continues to this day so σ\sigma is large. More recently, U.S. shale wells have increased production (the green curve departing from Hubbert’s prediction), but their peaks are much lower and production drops 90% from the peak after only three years.

Is it possible to generate Bardi’s Seneca Cliff with a series of Gaussians? Using Julia, the Seneca Cliff can be modeled as

f(x) = 0 < x <= 1 ? x : 0

which asks the question, Is xx between 00 and 11? If it is, f(x)=xf(x) = x, otherwise it’s zero.

cliff-function

If oil production mimicked this f(x)f(x), the rate would increase linearly until suddenly every oil well in the world stopped producing on the same day. This is about as likely as Zaphod Beeblebrox’s Infinite Improbability Drive, but we’re just trying to build a simple mind-sized problem. “See, all my procedures are mind-sized bites”, explains seventh-grader Robert in Seymour Papert’s Mindstorms.

First we need to create a vector of xx-values,

x = Array(-0.5:0.01:1.5));

and then the corresponding yy-values using a list comprehension

y = [f(x_i) for x_i in x];

Next, for each value of the parameters α,μ\alpha, \mu and σ\sigma we can define a function gg and evaluate it over xx,

g(x,α,μ,σ) = sum( α[i] * exp( - (x-μ[i])^2/(2*σ[i]^2) ) for i = 1:n)

or

g(x,α,μ,σ)=i=1nαiexp((xμi)22σi2).g(x,\alpha,\mu,\sigma) = \sum_{i=1}^n \alpha_i \exp \left( - \frac{(x-\mu_i)^2}{2 \sigma_i^2} \right).

which gives an estimate for yy using the sum of nn Gaussians

y_fit = [g(x_i,α,μ,σ) for x_i in x]

Using the LsqFit.jl package, we’ll get the best parameters to fit f(x)f(x). LsqFit requires an initial estimate, so I chose

αi=112iμi=112iσi=14i\begin{aligned} \alpha_i &= 1 - \frac{1}{2^i} \\ \mu_i &= 1 - \frac{1}{2^i} \\ \sigma_i &= \frac{1}{4^i} \end{aligned}

and then created an initial vector p0=[α;μ;σ];p_0 = [\alpha; \mu; \sigma]; with n=4n=4. To get the fit, LsqFit uses the function curve_fit

fit = curve_fit(gaussSum, x, y, p₀, lower = lb, upper = ub)

where gaussSum contains the definition for gg, extracts the parameters from p0p_0, and generates y_fity\_\text{fit}. The values of the parameters α,μ\alpha, \mu and σ\sigma are required to be between the lower bound lb and the upper bound ub which were set to zero and one for this example.

The result is a pretty good approximation

gaussian-fit

You can see the peaks of each Gaussian with centers μ=[0.52,0.80,0.93,0.99]\mu = [0.52, 0.80, 0.93, 0.99] and widths σ=[0.218,0.099,0.041,0.014]\sigma = [0.218, 0.099, 0.041, 0.014]. The optimal amplitudes are α=[0.51,0.57,0.63,0.69]\alpha = [0.51, 0.57, 0.63, 0.69]. The function ff has a discontinuity at x=1x=1, so we wouldn’t be able to find a polynomial approximation. Smoothing over the discontinuity would help with the sum of Gaussians technique. This doesn’t mean this model necessarily represents what’s actually happening, but it shows that a sum of Gaussian-like production curves can still lead to a Seneca Cliff. The Julia code is on GitHub.

This is a good experimental start to finding a method for fitting Gaussians to functions. Is there a pattern to the values for α,μ\alpha, \mu and σ\sigma? Can you find a functional representation for the parameters that works for different nn? What happens if you change ff? Can you find a general rule giving the parameters for any ff and any nn? Test your hypotheses by changing ff (line 17) and plotting the fit.

The Proof

Here’s the proof of Weierstrass’ theorem. It’s actually pretty easy to follow if you take it in mind-size bites. It comes from Chris Tisdell, a mathematician at UNSW Sydney. The proof below was taken from his YouTube video and is based on the proof by Dunham Jackson.

dunham-jackson

Theorem: Suppose f:[a,b]Rf: [a,b] \Rightarrow \mathbb{R} is a continuous function. Then we can construct a sequence of polynomials PnP_n such that for any ϵ>0\epsilon > 0 there exists N=N(ϵ)N = N(\epsilon) such that for all x[a,b]x \in [a,b] we have

Pn(x)f(x)<ϵ, for all n>N.|P_n(x) - f(x)| < \epsilon, \text{ for all } n > N.

Proof: Assume 0<a<b<10 < a < b < 1. Extend ff to the entire real line by defining

f(x)={0,if x0xaf(a),if 0<x<af(x),if axb1x1bf(b),if b<x<10,if x1f(x) = \left\{ \begin{array}{ll} 0, & \text{if } x \leq 0 \\ \frac{x}{a}f(a), & \text{if } 0 < x < a \\ f(x), & \text{if } a \leq x \leq b \\ \frac{1-x}{1-b}f(b), & \text{if } b < x < 1 \\ 0, & \text{if } x \geq 1 \end{array} \right.

This just adds pieces to f(x)f(x) so that it covers the entire real line by connecting (0,0)(0,0) to (a,f(a))(a,f(a)) and (b,f(b))(b,f(b)) to (1,0)(1,0), and then defining f(x)=0f(x) = 0 for any point outside the interval [0,1][0,1], (and it’s still continuous)

f

Now, create a sequence of constants (called a Landau integral),

Jn:=11(1u2)ndu.J_n := \int_{-1}^1 (1-u^2)^n du.

Here are some examples of the Landau integral, and the numeric values over the interval [1,1][-1,1]. Integration gives the area below the curve and above the xx-axis.

landau

11(1u2)du=u3(3u2)11=23(23)=4311(1u2)2du=u15(3u410u2+15)11=161511(1u2)3du=u35(5u6+21u435u2+35)11=3235.\begin{aligned} \int_{-1}^1 (1-u^2) du &= \left. \frac{u}{3}(3-u^2) \right|_{-1}^1 = \frac{2}{3} - \left(- \frac{2}{3} \right) = \frac{4}{3}\\ \int_{-1}^1 (1-u^2)^2 du &= \left. \frac{u}{15}(3u^4-10u^2+15) \right|_{-1}^1 = \frac{16}{15} \\ \int_{-1}^1 (1-u^2)^3 du &= \left. \frac{u}{35}(-5u^6+21u^4-35u^2+35) \right|_{-1}^1 = \frac{32}{35}. \end{aligned}

Next, define polynomials using these constants,

Pn(x):=1Jn01f(t)(1(tx)2)ndt.P_n(x) := \frac{1}{J_n}\int_0^1 f(t) \left( 1 - (t-x)^2 \right)^n dt.

The first three polynomials of the sequence are

P1(x)=1J101f(t)(1(tx)2)dtP2(x)=1J201f(t)(1(tx)2)2dtP3(x)=1J301f(t)(1(tx)2)3dt.\begin{aligned} P_1(x) &= \frac{1}{J_1} \int_0^1 f(t) (1-(t-x)^2) dt \\ P_2(x) &= \frac{1}{J_2} \int_0^1 f(t) (1-(t-x)^2)^2 dt \\ P_3(x) &= \frac{1}{J_3} \int_0^1 f(t) (1-(t-x)^2)^3 dt. \end{aligned}

The idea is when nn gets to be big enough, Pn(x)P_n(x) will very closely match f(x)f(x). Notice that the degree of Pn(x)P_n(x) is at most 2n2n.

Since f(x)=0f(x) = 0 for every xx outside of the interval [0,1][0,1] then

Pn(x)=1Jn01f(t)(1(tx)2)ndt=1Jn1+x1+xf(t)(1(tx)2)ndt=1Jn11f(x+u)(1u2)ndu.\begin{aligned} P_n(x) &= \frac{1}{J_n}\int_0^1 f(t) \left( 1 - (t-x)^2 \right)^n dt \\ &= \frac{1}{J_n}\int_{-1+x}^{1+x} f(t) \left( 1 - (t-x)^2 \right)^n dt \\ &= \frac{1}{J_n}\int_{-1}^1 f(x+u) \left( 1 - u^2 \right)^n du. \\ \end{aligned}

Changing the limits of integration from 0 to 1 in the first line to 1+x-1+x to 1+x1+x in the second line doesn’t really do anything for x[0,1]x \in [0,1] except it now runs over [1,2][-1,2] instead. In other words, let

g(t)=f(t)(1(tx)2)ng(t) = f(t) \left( 1 - (t-x)^2 \right)^n

so

01g(t)dt=1+x0g(t)dt+01g(t)dt+11+xg(t)dt\int_0^1 g(t) dt = \int_{-1+x}^0 g(t)dt + \int_0^1 g(t)dt + \int_1^{1+x} g(t)dt

where the first and third terms are zero since f(x)f(x) is zero on those intervals, leaving the middle term unchanged.

If we change variables, letting u=txu = t-x so t=u+xt = u+x, then dt=dudt = du. The limits of the integral change from

t=1+xu+x=1+xu=1t=1+xu+x=1+xu=1.\begin{aligned} t &= -1 + x \Rightarrow u+x=-1+x \Rightarrow u = -1 \\ t &= 1+x \Rightarrow u+x=1+x \Rightarrow u = 1. \end{aligned}

Now, look at the difference between this polynomial approximation Pn(x)P_n(x) and the function we want to approximate f(x)f(x),

Pn(x)f(x)=1Jn11f(x+u)(1u2)nduf(x)=1Jn11[f(x+u)f(x)](1u2)ndu\begin{aligned} P_n(x) - f(x) &= \frac{1}{J_n}\int_{-1}^1 f(x+u) \left( 1 - u^2 \right)^n du - f(x) \\ &= \frac{1}{J_n}\int_{-1}^1 \left[ f(x+u) - f(x) \right] \left( 1 - u^2 \right)^n du \end{aligned}

where we use the Landau integral

Jn=11(1u2)ndu1=1Jn11(1u2)nduf(x)=1Jn11f(x)(1u2)ndu.\begin{aligned} J_n &= \int_{-1}^1 (1-u^2)^n du \\ 1 &= \frac{1}{J_n} \int_{-1}^1 (1-u^2)^n du \\ f(x) &= \frac{1}{J_n} \int_{-1}^1 f(x) (1-u^2)^n du. \\ \end{aligned}

Since ff is continuous, then for any ϵ>0\epsilon > 0 there is a δ\delta such that

f(x+u)f(x)<ϵ2, for all u<δ.| f(x+u) - f(x)| < \frac{\epsilon}{2}, \text{ for all } |u| < \delta.

If MM is the maximum value of f|f| then

f(x+u)f(x)2M, for all u|f(x+u) - f(x)| \leq 2M, \text{ for all } u

since f(x+u)M|f(x+u)| \leq M and f(x)M|f(x)| \leq M, then

f(x+u)f(x)f(x+u)+f(x)M+M=2M.|f(x+u)-f(x)| \leq |f(x+u)| + |f(x)| \leq M + M = 2M.

For any u>δ|u| > \delta, then u2>δ2u^2 > \delta^2, so

1u2δ21 \leq \frac{u^2}{\delta^2}

and therefore

f(x+u)f(x)2Mu2δ2.|f(x+u)-f(x)| \leq \frac{2Mu^2}{\delta^2}.

In either case, f(x+u)f(x)|f(x+u)-f(x)| must be less than the sum of these two bounds, so

f(x+u)f(x)ϵ2+2Mu2δ2.|f(x+u)-f(x)| \leq \frac{\epsilon}{2} + \frac{2Mu^2}{\delta^2}.

With these bounds, we can write

Pn(x)f(x)=1Jn11[f(x+u)f(x)](1u2)ndu1Jn11[ϵ2+2Mu2δ2](1u2)ndu=1Jn11ϵ2(1u2)ndu+1Jn112Mu2δ2(1u2)ndu=ϵ21Jn11(1u2)ndu+2Mδ2Jn11u2(1u2)ndu=ϵ2+2Mδ2Jn11u2(1u2)ndu.\begin{aligned} |P_n(x) - f(x)| &= \frac{1}{J_n}\int_{-1}^1 \left[ f(x+u) - f(x) \right] (1 - u^2)^n du \\ &\leq \frac{1}{J_n}\int_{-1}^1 \left[ \frac{\epsilon}{2} + \frac{2Mu^2}{\delta^2} \right] (1 - u^2)^n du \\ &= \frac{1}{J_n}\int_{-1}^1 \frac{\epsilon}{2} (1 - u^2)^n du + \frac{1}{J_n}\int_{-1}^1 \frac{2Mu^2}{\delta^2} (1 - u^2)^n du \\ &= \frac{\epsilon}{2} \frac{1}{J_n}\int_{-1}^1 (1 - u^2)^n du + \frac{2M}{\delta^2 J_n}\int_{-1}^1 u^2 (1 - u^2)^n du \\ &= \frac{\epsilon}{2} + \frac{2M}{\delta^2 J_n}\int_{-1}^1 u^2 (1 - u^2)^n du. \\ \end{aligned}

Let

In=11u2(1u2)nduI_n = \int_{-1}^1 u^2 (1 - u^2)^n du

and then using integration by parts (μdν=μννdμ\int \mu d\nu = \mu \nu - \int \nu d \mu) and letting

μ=udμ=1dν=u(1u2)nν=(1u2)n(u21)2(n+1)=12(n+1)Jn+1.\begin{aligned} \mu &= u \Rightarrow d \mu = 1 \\ d \nu &= u(1-u^2)^n \Rightarrow \nu = \frac{(1-u^2)^n(u^2-1)}{2(n+1)} = -\frac{1}{2(n+1)}J_{n+1}. \end{aligned}

So this means the integral InI_n can be bounded by Jn/2nJ_n/2n,

In=0+12(n+1)Jn+1<Jn2n.I_n = 0 + \frac{1}{2(n+1)}J_{n+1} < \frac{J_n}{2n}.

In other words,

Pn(x)f(x)ϵ2+2Mδ2JnIn<ϵ2+2Mδ2JnJn2n=ϵ2+2M2nδ2=ϵ2+ϵ2=ϵ\begin{aligned} |P_n(x) - f(x)| &\leq \frac{\epsilon}{2} + \frac{2M}{\delta^2 J_n}I_n \\ &< \frac{\epsilon}{2} + \frac{2M}{\delta^2 J_n}\frac{J_n}{2n} \\ &= \frac{\epsilon}{2} + \frac{2M}{2n \delta^2}\\ &= \frac{\epsilon}{2} + \frac{\epsilon}{2} = \epsilon \end{aligned}

so long as nn is sufficiently large such that

2M2nδ2<ϵ2.  \frac{2M}{2n \delta^2} < \frac{\epsilon}{2}. \; \blacksquare