This is the notes for mini course 1 of Machine Learning for Trading, offered by Udacity.

- Reading in a CSV file
- NumPy
- Statistical Analysis
- Incomplete Data
- Histograms and Scatter Plots
- Portfolio Statistics
- Optimizers

# Reading in a CSV file

You can read in the contents of a CSV (comma-separated values) file into a Pandas dataframe using: `df = pd.read_csv(<filename>)`

. This creates a dataframe with default index. To specify which column should be treated as the index, specify the optional parameter `index_col=<column_name>`

.

Selecting rows from a dataframe:

- First 5 rows:
`df.head()`

- Last 5 rows:
`df.tail()`

. Similarly, last n rows:`df.tail(n)`

. - Slicing:
`df[start:end]`

in which`end`

is exclusive.

Selecting a column: `df[<column_name>]`

`<column_name>`

is a string. To select multiple columns, use `df[[col1, col2, ...]]`

. It can be combined with slicing by rows: `df.ix[start:end, [col1, col2, ...]]`

.

For more information regarding slicing data, refer to pandas documentation on “Indexing and Selecting Data”.

## Building DataFrame

- Create a date series:
`pd.date_range(start_date, end_date)`

, returns a list of`DateTimeIndex`

. - Create a dataframe with a non-default index:
`pd.DataFrame(index=dates)`

. If`index`

parameter is not specified, the DataFrame will have a default index 0, 1, 2, … - Join dataframes on indexes:
`[df1.join(df2)](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.join.html)`

. By default`df.join()`

performs a left join, and different ways of joins can be specified using`how`

parameter. `df.dropna()`

will drop rows with`NaN`

values.`df.dropna(subset='COLNAME')`

removes rows with NaN values in specified column.- Rename a column:
`df.rename(columns=<old_name_to_new_name_dict>)`

. - Copy from existing DataFrame:
`df.copy()`

An example:

```
def get_data(symbols, dates):
"""Read stock data (adjusted close) for given symbols from CSV files."""
df = pd.DataFrame(index=dates)
if 'SPY' not in symbols: # add SPY for reference, if absent
symbols.insert(0, 'SPY')
for symbol in symbols:
df_temp = pd.read_csv(symbol_to_path(symbol), index_col='Date',
parse_dates=True, usecols=['Date', 'Adj Close'], na_values=['nan'])
df_temp = df_temp.rename(columns={'Adj Close': symbol})
df = df.join(df_temp)
if symbol == 'SPY': # drop dates SPY did not trade
df = df.dropna(subset=["SPY"])
return df
```

Normalization: `df / df.ix[0, :]`

.

## Plot DataFrame

To plot, call `df.plot()`

. You will need to `import matplotlib.pyplot as plt`

and call `plt.show()`

to display the output. Common customizations to plots are:

- Title and font size: pass
`title=str`

and`fontsize=int`

keyword parameters to`df.plot()`

function. - Axis label: use
`set_xlabel(str)`

and`set_ylabel(str)`

methods on the return value of`df.plot()`

.

An example:

```
import matplotlib.pyplot as plt
def plot_data(df, title="Stock prices"):
"""Plot stock prices with a custom title and meaningful axis labels."""
ax = df.plot(title=title, fontsize=12)
ax.set_xlabel("Date")
ax.set_ylabel("Price")
plt.show()
```

Plot a vertical line: `plt.axvline(value, color='r', linestyle='dashed', linewidth=2)`

# NumPy

NumPy is a wrapper around numeric library in C and Fortran. Pandas DataFrame wraps NumPy `ndarray`

(n-dimensional array type) by adding indexes and labels. The underlying ndarray can be retrieved by `df.values`

.

## Indexing and Slicing

- indexing:
`nd[row, col]`

, where`row`

and`col`

starts at 0. - slicing:
`nd[row_start:row_end, col_start:col_end]`

. Slice`m:n:t`

speficies a range`[m, n)`

in step`t`

.

Just like Python, the start and/or index can be omitted. For example, the colon itself represents all rows or columns, so `nd[:, 3]`

returns the 4th column. The index -1 represents the last index. So for example, `nd[:, -1]`

returns the last column.

For more information, refer to NumPy documentation.

### Integer Array Indexing

You can index an array with another integer array of indices.

```
a = np.random.rand(5)
indices = np.array([1, 2, 1])
a[indices] # returns the array of elements at each index
```

### Boolean Value Indexing

You can index an array with a boolean array, and only those positions are selected where there is a `True`

. You can create a corresponding boolean array from boolean expressions:

```
a = np.random.rand(5)
mean = a.mean()
a[a < mean] # elements less than the mean
a[a < mean] = mean # replace elements smaller than mean with mean
```

## Assignment

`nd[row, col] = 2`

assigns`2`

to a single position in the array.`nd[row, :] = 2`

assigns`2`

to the entire row.`nd[row, :2] = [0, 1]`

performs element-wise assignment, but make sure the size matches.

## Array Creation

In general, numerical data arranged in an array-like structure in Python can be converted to arrays through the use of the `np.array()`

function. The most obvious examples are lists and tuples. For example,

```
# note mix of tuple and lists, and types
np.array([[1,2.0],[0,0],(1+1j,3.)])
```

NumPy also provides functions to create arrays with initial values:

`np.empty(shape)`

: create an empty array. The value at each position is uninitialized (random value depending on the memory location).`np.ones(shape)`

: create an array with all 1s.- Random values:
`numpy.random`

package provides functions: **`random(shape)`

and`rand(rows, cols)`

that generates ndarrays filled with random values distributed in`[0.0, 1.0)`

. Note that`random()`

takes a tuple and`rand()`

takes multiple parameters, one for each dimension. **`normal(shape)`

uses normal distribution. **`randint(start, end, shape)`

generates integers from uniform distribution.

They all take a positional parameter, shape, which is a number for one-dimensional array, or a tuple for multi-dimensional array. The default data type is `float64`

, you can specify any data type using the `dtype`

parameter.

## Array Attributes

`ndarray.shape`

: a tuple of the shape of the ndarray.`shape[0]`

is number of rows, and`shape[1]`

is number of columns.`ndarray.ndim`

: number of dimensions (i.e.`len(ndarray.shape)`

).`ndarray.size`

: total number of elements in the array.`ndarray.dtype`

: the data type of the array.

## Array Operations

Mathmatical functions: reference

`ndarray.sum()`

sum of elements, along rows, columns, or all. To sum over rows (i.e. of each column), pass the keyword parameter`axis=0`

. To sum over columns (i.e. of each row), pass the keyword parameter`axis=1`

.`ndarray.min()`

minimum along axis or all, same as`sum()`

.`ndarray.max()`

maximum along axis or all, same as`sum()`

.`ndarray.mean()`

mean along axis or all, same as`sum()`

.

Arithmetic operations are applied element-wise, and return a new array. For example, `a + 2`

returns a new array with every element in `a`

incremented by 2.

`numpy.add`

: Element-wise addition, same as`+`

operator`numpy.subtract`

: Element-wise subtraction, same as`-`

`numpy.multiply`

: Element-wise multiplication, same as`*`

`numpy.divide`

: Element-wise division, same as`/`

`numpy.dot`

: Dot product (1D arrays), matrix multiplication (2D)

# Statistical Analysis

DataFrame provides many global statistics, such as `mean()`

, `median()`

, `std()`

, `sum()`

, `prod()`

(the product), etc. They all support specifying axis, by default `0`

along the index (i.e. stats of each column), `1`

along the column (i.e. stats of each index). These statistics can also be computed on all rolling (time window) basis.

DataFrame and Series provides a `rolling()`

method that takes parameters of the window to create a calling object, which stats methods such as `mean()`

, `max()`

are defined.

Bollinger Bands is the rolling range \(2\sigma\) above and below the rolling mean. For example, a 20-day Bollinger Band can be computed as follows:

```
# df is the Series of prices
window = 20
rm = df.rolling(window=window).mean()
rstd = df.rolling(window=window).std()
upper_band = rm + rstd * 2
lower_band = rm - rstd * 2
# Plot raw SPY values, rolling mean and Bollinger Bands
ax = df.plot(title="Bollinger Bands", label='SPY')
rm.plot(label='Rolling mean', ax=ax)
upper_band.plot(label='upper band', ax=ax) # Plot onto the same Axes object
lower_band.plot(label='lower band', ax=ax)
# Add axis labels and legend
ax.set_xlabel("Date")
ax.set_ylabel("Price")
ax.legend(loc='upper left')
plt.show()
```

Daily return is the percentage increased between today’s price and previous day’s price. It can be computed as follows:

```
# df is the DataFrame of prices
daily_returns = df.copy() # Make a copy of the data to preserve the size
# When making element-wise operations, pandas will match rows by index, but we don't want that.
# df.values returns the underlying ndarray, erasing the index information.
daily_returns[1:] = df[1:] / df[:-1].values - 1
# Or
daily_returns = (df / df.shift(1)) - 1
# There is no daily return at index 0, so set it to 0.
daily_returns.iloc[0] = 0
```

Cumulative return is the percentage increased between today’s price and the price at the beginning of the period.

```
df / df.iloc[0] - 1 #iloc gets the 1st row by index
```

# Incomplete Data

Financial data are not pristine. Stocks are not listed before IPO, no longer listed after merger, or sometimes they stopped being traded. In a time series, NaN can take place anywhere. To address NaN and not leaking information about the future, we should:

- Fill forward. For any given NaN, if there is a known value before it, fill it with the closest known value before it;
- Fill backward. This is to fill NaN with no known previous values, at the beginning of the time series.

`df.fillna()`

method can fill NaN. `method`

keyword parameter allows forward filling (`ffill`

or `pad`

) or backward filling (`bfill`

or `backfill`

), and `inplace`

keyword, when set to `True`

, can modify the DataFrame in place without creating a new one.

# Histograms and Scatter Plots

Daily returns are not interesting when examined as a time series. However, histograms and scatter plots are two useful way to further examine it.

## Histograms

we can use `df.hist()`

to plot a histogram. `bins`

parameter specifies how many bins to use for the histogram, and the default is 10. If the DataFrame contains multiple columns, `df.hist()`

will plot one subgraph for each column. To plot multiple histograms onto the same graph, call `df[symbol].hist()`

on each symbol. This is useful to compare the histograms of multiple stocks.

Once we plot the histogram of daily returns of a stock, there are several statistics of interest:

- mean:
`df.mean()`

; - standard deviation:
`df.std()`

; - kurtosis (measurement of fat tail): a positive value means fatter tail than normal distribution.
`df.kurtosis()`

## Scatter Plots

Scatter plots are useful to identify correlations between daily returns of different stocks, or between stocks and the market. If a line is fitted to the data points on a scatter plot, the slope is \(\beta\), and the vertical distance from the origin is the \(\alpha\).

By passing in `kind='scatter'`

parameter to `df.plot()`

, we can get a scatter plot. Note that if the DataFrame contains more than 2 columns, you can use keyword parameter `x`

and `y`

to specify two columns.

To fit a line between two columns, we can use `beta, alpha = numpy.ployfit(x, y, degree)`

function with `degree=1`

. To plot a line,

```
beta, alpha = numpy.ployfit(daily_returns['SPY'], daily_returns['XOM'], 1)
# plot.plot() takes two series.
# the 3rd parameter indicates we want a line plot.
plt.plot(daily_returns['SPY'], beta * daily_returns['SPY'] + alpha, '-', color='r')
plt.show()
```

Note that \(\beta\) is not the same as correlation. Higher correlation is indicated by the tighter correlations between the fitted line and the points. You can also compute the correlation among each column of the DataFrame using `df.corr()`

method. It computes the Pearson correlation.

# Portfolio Statistics

You can compute the value of your portfolio as follows:

```
start_val = 1000000 # start at $1 MM
symbollist = ['SPY', 'XOM', 'GOOG', 'GLD']
start_date = '2009-01-01'
end_date = '2012-12-31'
allocs = [0.4, 0.4, 0.1, 0.1]
# Read stock data from portfolio
idx = pd.date_range(start_date, end_date)
df_data = get_data(symbollist, idx)
# Normalize stock price
normed = df_data / df_data.iloc[0]
# Normalized price for each allocation
alloced = normed * allocs
# Position for each allocation
pos_vals = alloced * start_val
# Sum across columns (axis=1) to get the portfolio value
port_val = pos_vals.sum(axis=1)
```

We can calculate the daily returns from the portfolio. Note that because the first value of daily returns is always 0, we would like to exclude that. You can do so by `daily_rets = daily_rets[1:]`

. Some useful statistics are:

- Cumulative return:
`port_val[-1] / port_val[0] - 1`

; - average daily return:
`daily_rets.mean()`

; - standard deviation of daily return:
`daily_rets.std()`

; - Sharpe ratio.

Sharpe ratio quantifies risk adjusted return. Sharpe ratio is \(S = \frac{E[R_p-R_f]}{std[R_p-R_f]}\), where \(R_p\) is the return of the portfolio, \(R_f\) is the risk-free rate. This is the *ex ante* form, meaning the forward-looking form. When computing Sharpe Ratio using historical data, the expectation is just the mean.

Risk-free rate can be

- LIBOR;
- 3-month T-bill;

which are usually in annual rate. To convert from annual rate to daily rate, the traditional short cut is \(\sqrt[252]{1+R_f} - 1\). Mostly we can treat the daily return of the risk-free rate as a constant, thus Sharpe Ratio becomes \(S = \frac{E[R_p]-R_f}{std[R_p]}\).

Note, Sharpe Ratio can vary widely depending on the sample frequency. Traditionally, Sharpe ratio is an annual measure, so it must be normalized from other sampling frequency. The normalization factor is the square root of the number of samples per year. So the normalization factor is \(\sqrt{252}\) for daily data, \(\sqrt{52}\) for weekly data, and \(\sqrt{12}\) for monthly data. Note that the normalization factor has nothing to do with number of samples.

```
risk_free_rate = 0.0
norm = sqrt(252) # Normalization factor
mean = daily_returns.mean()
std = daily_returns.std()
sharpe_ratio = norm * (mean - risk_free_rate) / std
```

# Optimizers

We can use SciPy to minimize an objective function.

```
import scipy.optimize as spo
# The objective function
def f(X):
return (X - 1.5) ** 2 + 0.5
Xguess = 2.0 # Provide an initial guess
min_result = spo.minimize(f, Xguess, method='SLSQP', options={'disp': True})
print("X = {}, Y = {}".format(min_result.x, min_result.fun))
```

From a given data set, we can build a parameterized model that fits the data. We need to define an error function for which the minimizer can minimize. For example, to fit a line (i.e. first-order polynomial) to a data set, we can:

```
# Define the error function
def error(line, data):
err = np.sum((data[:, 1] - (line[0] * data[:, 0] + line[1])) ** 2)
return err
def fit_line(data, error_func):
# initial guess
l = np.float32([0, np.mean(data[:, 1])]) # a horizontal line
# minimize
result = spo.minimize(error_func, l, args=(data,), method = "SLSQP", options={'disp': True})
return result.x
```

Portfolio optimization aims to find the best allocation from a given set of assets and time period, using some criteria such as

- Cumulative return;
- Minimum volatility;
- Sharpe Ratio.

Cumulative return and volatility are easy to optimize, just allocate all investments into the asset with the best cumulative return or minimum volatility. Thus, we will optimize by Sharpe Ratio.

So the optimization problem becomes: given an allocation \(X\), find the optimal value that minimizes \(-S\) (negation of Sharpe Ratio). To improve the performance of the optimizer, we can provide ranges and constraints. In this case, we have:

- Range: \(0\leq x_i\leq 1\);
- Constraint: properties of \(X\) that must be
`True`

. In this case, \(\sum_i x_i = 1.0\).

For `spo.minimize()`

, range can be expressed by a sequence of `(min, max)`

pairs for each element in `x`

. `None`

is used to specify no bound. Constraints are more complicated, see the documentation.

```
# Initial guess
allocs = np.asarray([1.0 / len(syms) for _ in syms])
# Optimization function: negation of Sharpe Ratio
def optimization_func(alloc):
daily_returns = compute_daily_returns(portfolio_value(alloc, normed_prices))
return -1 * sharpe_ratio(daily_returns)
# Each allocation must be 0.0 <= x <= 1.0
bounds = [(0.0, 1.0) for _ in syms]
# Sum of the allocation must equal to 1.0
# The lambda function must return 0 when True
constraint = {'type': 'eq', 'fun': lambda alloc: 1.0 - alloc.sum()}
result = spo.minimize(optimization_func, allocs, method='SLSQP', options={'disp': True}, bounds=bounds, constraints=constraint)
allocs = result.x
```