Creating Synthetic Data with AAFT and Time Dependent Trend in Python

Thanks to everyone who helped me the other day. Here is the finished code below. Check for spelling/paranthesis/formatting errors as it didn't copy over well and I wrote most of it here quickly.

Link to paper I have been working from (not mine). AAFT information can be found on page 106 of pdf.

AAFT: Amplitude Adjusted Fourier Transform

Given a real univariate time-series, the AAFT algorithm operates in Fourier (i.e., frequency) domain, where it preserves the amplitude spectrum of the series, but randomizes the phase, leading to a new realized signal.

What is great about it:

In other words, first and second order statistical moments (i.e., due to autocorrelation) of the signal are encoded in its power spectrum, which is purely dependent on the amplitude spectrum. Consequently, the randomization of the phase does not impact the first and second order moments of the series, hence the surrogates share statistical properties of the original signal.

Let's use SPY data from 2019-11-18 through 2020-10-30 (no reason for this)

Our Data:


So, we are going to use the same amplitudes with different phases. Let's get to the code:

First, we will define a way to get a polynomial fit to our data. I like using the Spyder IDE and it doesn't like taking user input while it displays a graph. Luckily, Matplotlib has event handling. Just hit the corresponding number on the keyboard while the plot is active.

import numpy as np import pandas as pd import matplotlib.pyplot as plt class PlotSelection(object): def __init__(self,y): self.y = y def on_key(self,event): = int(event.key) #generates warning but still works. def get_degree(self): xr = np.arange(len(self.y)) fx = lambda n: np.poly1d(np.polyfit(xr,self.y,n))(xr) fig,axes = plt.subplots(nrows=4,ncols=2,sharex=True,sharey=True) fig.suptitle("Hit Corresponding Key") cid = fig.canvas.mpl_connect('key_press_event',self.on_key) for i in range(4): for j in range(2): n = (2*i)+j+1 axes[i][j].plot(xr,self.y,'k') axes[i][j].plot(xr,fx(n),'r',linewidth=2) axes[i][j].set_title('N = {}'.format(n)) 

Result of the code above; I picked 7

#Assume data = pandas DataFrame of form OHLCV containing 241 days of data c = data.Close.values ps = PlotSelection(c) ps.get_degree() #throws error but still works; don't ask xr = np.arange(len(c)) yy = np.poly1d(np.polyfit(xr,c, dy = np.diff(yy) #the time dependent trend; same len as rr below rr = np.diff(c) #***240 values; values must be even for this version of fft!*** #Get the FFT fy = np.fft.rfft(rr) m = abs(fy) #magnitudes m[0]=0. theta = np.angle(fy) #unnecesary, used to show why we use a uniform random for phi phi = np.random.uniform(-np.pi,np.pi,size=(100,len(m))) z = m*np.exp(phi*1j) #Ask Euler for why this works sreturns = np.fft.irfft(z)+dy #synthetic returns #Calculate synthetic prices rm = np.hstack((np.full((100,1),c[0]),sreturns)) #return matrix prices = rm.cumsum(axis=1) #Get Plot; Use the semicolons if in ipython! plt.plot(prices.T,color='b',linewidth=.3,alpha=.5); plt.plot(c,color='r',linewidth=2); 

The resulting plot:

SPY w/ Synthetic Data

And, why we used a uniform random distribution between -pi and pi:

Histogram of variable (theta) in code above

If you take max and min of theta, the values should be very close to -pi and pi. These values are the bounds of the possible angles (complex values).

Anyway, that's it! I've been using this to create an initial portfolio that is then optimized using a variety of different methods. I will probably post the code to create an initial portfolio at a later date. It does require at least 32gb ram and a gpu to be worth running however. Otherwise, it's hitting memory limit after memory limit.

The polynomial fits aren't perfect but are easy to add to existing ARIMA like models.

The linear version of this is applicable most of the time and is easier to implement. Just get rid of the plot selection code and dy, get rid of the m[0]=0., and make phi[:,0]=0. The rest is the same. If looking at years of data, log transformation gives a linear plot and you can use log returns instead of simple returns. Just take the exponential of the synthetic log returns and use 1. as the fill value in rm or np.ones((100,1)). Then c[0] * rm.cumprod(axis=1) will return the price list.

If you use synthetic data in your research/algos, I would like to hear about it!

Submitted November 09, 2020 at 06:42PM by boolean_10

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s