Finance by M&M : February 2016

Wednesday 24 February 2016

Forecasting Financial Time Series Models & Python's Scikit-Learn Machine Learning

This blog post focuses on models that use classification to forecast the direction of a financial time series which can act as a foundation for further algo trading strategies. In Python, the relevant library to import is Scikit-Learn, which has developed machine learning techniques to assist in accurate implementation of code.

Performance Accuracy

The methods we focus on utilise binary supervised classification to predicts whether the % return for a particular future day is positive or negative. The parameters that need to be evaluated for their accuracy in this case includes deviation from actual outcome and magnitude.

Hit Rate

As the name suggests, the hit rate determines how many times the correct direction was predicted using the training dataset as a % of overall predictions. The Scikit-Learn library has a function that can calculate this as part of training.

Confusion Matrix

This is also known as a contingency table, commonly used after determining the hit rate. The matrix identifies how many times the model forecasted positive returns accurately versus how many times the model forecasted negative returns accurately. The point of this is to work out if the algorithm is more useful towards predicting a certain direction e.g. better at predicting falls in the index. An example strategy based off this involves being long/short depending on its bias.

Behind the scenes the matrix is calculated how many Type I errors and Type II errors for a classifier algorithm. Type I error is where the model incorrectly rejects a true null ("false positive") whilst a Type II error incorrectly fails to reject a false null ("false negative"). These concepts are emphasised in QBUS2810 Statistical Modelling for Business (Usyd Business Analytics subject). Again, Scikit-Learn also has a function to calculate the confusion matrix.

Factor Choice

A logical approach to select predictors involves identifying the fundamental drivers of asset movement. It is sometimes simpler than what most people thing - in the S&P500, the 500 companies listed will be the drivers as it is composed of their value. The question is can the past predict the future i.e. do prior returns hold any predictive power. We could add fundamental macro data e.g. employment, inflation, interest rates, earnings, etc. A more creative approach is examining exchange rates with countries that trade most with the US as drivers.

When considering historical asset prices in a time series, indicators include lags e.g. k-1, k-2, k-3,...k-p for daily time series with p lags. Traded volume is also a common indicator, we form a p + one dimensional feature vector daily which includes the p time lags and volume.

The above information is considered micro data, related to and found within the time series itself. External macro time series can be overlaid in the forecasts. Commodities prices can be correlated to weather, or forex related with offshore interest rates. Firstly we work out whether the correlations are statistically significant to include in a trading strategy.

Classification Models

Below are the more commonly used models in supervised classification. We describe the techniques behind each model and in what situations they would be useful.

Logistic Regression

Logistic regression measures the relationship between continuous independent variables (lagged % returns) and binary categorical dependent variables ("positive", "negative"). The regression outputs a probability between 0 and 1 the next time period will be classified as positive or negative based on the past % returns. We use logistic rather than linear as linear can incorrectly result in negative probabilities for continuous variables. This parameter is set at 50% but this can be modified.

In mathematical notation the formula that works out the probability of having a positive return day assuming we have L1 and L2 as previous returns:

P(Y= U | L1, L2) = e^(B0 + B1L1 + B2L2) / 1 + e^(B0 + B1L1 + B2L2)

Maximum likelihood method is used to estimate the Beta coefficients, as part of the Scikit-Learn library.

Logistic regression has the advantage over other models such as Naive Bayes as there are less restrictions on correlation between features. It is suited to where thresholds are used due to the probabilistic nature of results rather than automatically selecting the highest probability category to set stronger prediction power. For instance, indicating 75% as a threshold rather than picking a "positive" prediction that is only 51%.

Discriminant Analysis

Discriminant analysis is more strict with its assumptions compared to logistic regression, though if such strict assumptions hold then prediction power is stronger.

Linear Discriminant Analysis models the distribution of the L variables independently, using Bayes' Theorem to obtain the probability. This theorem describes how conditional probability can be calculated from knowing the probability of each cause and the conditional probability of each cause. Predictors are based on a multivariate Gaussian (normal) distribution, and those parameters are used in Bayes' Theorem to predict what class an observation should be classified under. Assumptions include that all outcomes ("positive" "negative" "up" "down") have a common covariance matrix, which is handled by Pyton's scikit-learn library. Covariance measures the correlation between two trends.

Quadratic Discriminant Analysis differs as it assumes each outcome has a separate covariance matrix. We use this analysis if there are non-linear decision boundaries, when there are more training observations (reducing variance is not a priority). It is evident the choice between linear and quadratic discriminant analysis comes down to bias versus variance.

Support Vector Machines

Support Vector Classifiers try to find a linear separation boundary that can accurately classify most of the observations into multiple distinct groups. Sometimes this can work if the class separation is mainly linear, though occasionally this requires further techniques to enable non-linear decision boundaries. Support Vector Machines have an advantage in enabling non-linear expansion whilst still allowing efficiency in computations. How? Instead of using a fully linear separating boundary, we can use quadratic polynomials or higher order polynomials to modify the kernel used and therefore define non-linearity in boundaries. This means they make relatively flexible models, though the right boundary must be selected for optimum results. In real life applications, Support Vector Machines are useful in the field of text classification where there is high dimensionality, though drawdowns include complex computations, difficulty in fine tuning and model interpretation.

Decision Trees & Random Forests

Decision trees use a tree structure to allocate into recursive subsets via a decision at each tree node. This can be visually illustrated using this sample scenario. If one asked if yesterday's price was above or below a certain level, it creates two subsets. It could then be asked if volume was above or below a certain level, forming four separate subsets. This continues until predictive power reaches a peak through partitioning. The advantage of a decision tree is that it is naturally interpretable relative to the "behind the scenes" approach that Discriminant Analysis or Support Vector Machines use.

The advantages of using Decision Trees are extensive, such as ability to handle interactions between features and being non-parametric. They are also useful when it is difficult to linearly separate data into classes, an assumption required in support vector machines.

The disadvantage of using individual decision trees is that they are prone to over-fitting (high variance). A newer field in classification involves ensemble learning, where a large amount of classifiers is created using the same model and trained with differing parameters. The results are then combined and averaged out with the goal of achieving a prediction accuracy greater than just one classifier.

One of the most popular ensemble learning techniques is the Random Forest (constantly a popular topic in Quantopian Forums, and arguably the best classifier to use in machine learning competitions). Scikit-Learn has a RandomForestClassifier class in its module that enables predictions from thousands of decision tree learners to be combined. The main parameters associated with the RandomForestClassifier includes n_jobs, which refers to how many processing cores to spread calculations over, and n_estimators, which refers to how many decision trees to form. These features will be discussed in the future blog posts.

Principal Components Analysis

This is an example of an unsupervised classification technique, where the algorithm identifies features by itself. We would use this technique if we wanted to filter the problem down to only important dimensions, or finding topics in large amounts of textual data, or finding features that unexpectedly hold predictive power in time series analysis.

Principal Components Analysis uses autocorrelation in time series data to transforms a set of potentially correlated variables into a set of linearly uncorrelated variables. These variables are the Principal Components, which are then ranked depending on the amount of variability they describe. So if we have many dimensions, we can reduce the feature space through principal components analysis to just 2 or 3 that provides almost all the variance in the data, resulting in a stronger parameters fed through to a supervised classifier model.

Multinomial Naive Bayes Classifier

Naive Bayes is useful when we have a limited dataset as it is a high-bias classifier which assumes conditional independence. This means it is unable to identify interactions between individual features, unless they are added as extra features. For instance, in sentiment analysis, document classification is common due to the qualitative nature of the data. Naive Bayes learns that individual words are referring to texts relating to those words, but phrases or slang that have underlying meaning with those individual words could not be considered under the same topic as the individual words. Instead, we treat the phrase/slang as an additional feature rather than associate it with the category the individual words are classified under.

Conclusion

If there is tick level data the type of classifier applied does not matter, rather technical efficiency and algo scalability become priorities to address. As data size is substantial, the marginal performance power increase reduces. Note classifiers often have differing assumptions so if a specific classifier exhibits poor performance, it is usually due to a violation of the assumptions in the classifier.

Tuesday 23 February 2016

Intro to Statistical Learning

Overview

Statistical learning addresses how a fund creates a trading model by utilising its collection of fundamental data to generate index forecasts. This is achieved through modelling the behaviour of an event (the market index value) with predictors that have a relationship with it (fundamental data). In mathematical terms the relationship is Y = f(x) + error. f represents an unknown function of independent variables. Error is independent of predictors with mean 0. We aim to estimate the f form based on our observations and subsequently evaluate the accuracy of such estimates.

Prediction

Predictive modelling is the central focus in our blog posts. It concerns itself with predicting Y based on an added observed independent variable X. If the optimum model equation has been calculated then we are able to predict the response in Y based on adding this new predictor. Different predictors will result in differing accuracies in the Y estimate. Reducible error refers to the error of having poor predictors. We aim to make the reducible error as minimal as possible. Note that irreducible error is always present through the error term in the original equation which consists of unmeasured influences.

Inference

Inference refers to identifying the relationship between the predictor variables and the independent variable Y. For instance, determining important predictors and what type of relationship exists between each predictor and the Y outcome. Whilst the result is a model that can be understood relatively easily, it comes at a cost of worse prediction power. Determining whether the relationship is linear or non-linear means the model has greater prediction power though is harder to interpret. Because the actual form is not as important to know compared to the ability to create accurate predictions, predictive modelling has greater emphasis in the quant finance community.

Parametric Models

These methods need the user to assume the linearity of the model. From this, we know which variables we need to estimate. In the example of a linear model, the line of best fit does not necessarily pass through the origin so a coefficient specifies the intercept of the y axis (α). With multiple predictors, β represents the intercept which we find the estimate of using Ordinary Least Squares ftor instance. This is easier than to fit a potentially non-linear function, though comes at the expense that the estimate of f is unlikely to represent the true form and reduces flexibility in the model. Adding further parameters is wise though avoid over fitting so to ensure the model is following trading signals rather than noise.

Non Parametric Models

Alternatively, a non parametric model is more flexible though due to this feature, many more observations are required. A nonparametric test is a hypothesis test that does not require the population's distribution to be characterised by certain parameters. In terms of trading, as extensive historical data is already available, non parametric models seem to have an advantage. Nevertheless, financial time series often embody a poor noise vs. signal ratio so over fitting bias can still be an issue. It is evident a balance should be struck between parametric and non-parametric models.

Supervised Machine Learning

In a supervised model, each independent variable/predictor there is a related response in the Y outcome. The model is "trained" onto the dataset. That is, the Ordinary Least Squares method trains the dataset to have a linear regression model fit onto it, resulting in an estimate of β to the vector of regression coefficients.

Unsupervised Machine Learning

Despite the lack of training dataset to evaluate accuracy, this technique is still useful with regards to clustering.

Parametised Clustering Model

This model is often used to determine unexpected relationships evident in the dataset that would otherwise not be easily found. In finance, this is usually useful in analysing volatility.

Linear & Logistic Regression

Regression uses supervised machine learning to model the relationship between x and y variables. The end equation identifies the change in response of y when x changes ceteris paribus. For instance, Linear regression uses Ordinary Least Squares to produce parameter estimates depending on a linear relationship to the x predictor. A model can predict the value of the ASX through historical data, dynamically updated through new market data to predict the next day's price. With inference, the relationship strength between the price and market data predictors can be analysed to determine the reasons behind the outcome changing. The underlying relationship however is not a priority compared to prediction quality in developing algorithms for trading. Another very common and easy to learn regression is known as logistic regression, which results in a response that suits a categorical type (e.g. "positive", "negative", "up", "down" etc) as opposed to continuous (e.g. stock prices). We recommend taking QBUS3830 (Usyd Business Analytics subject) to learn about the statistical procedure of Maximum Likelihood Estimation which is used in logistic regression to estimate parameters. MLE is the procedure that finds the value of a parameter(s) for a given statistic which makes the known likelihood distribution a maximum.

Classification Technique

This technique refers to supervised machine learning which tries to classify an observation into a user-specified category based on its features. Such categories may be ordinal or unordered. Classifiers are the algorithms behind this technique, commonly used in the quant finance field particularly with regards to predicting market direction. They are able to predict whether a certain time series in the future will have positive or negative returns (note: not the actual value, like in a regression). The predictors themselves can be continuous. Our classifiers include linear, non-linear, logistic regression, discriminant analysis, artificial neutral networks and support vector machines.

Time Series Technique

This technique is often deemed as a combination of regression and classification. Time series use chronological ordering in the series, therefore predictors are often derived from past/present values. The main types of time series models relevant to algorithmic trading include ARIMA and ARCH models. These concepts are covered in depth in Usyd's Predictive Analytics class (QBUS 2820). ARIMA refers to linear autoregressive integrated moving average models. They are used to model changes in the absolute value of a time series. ARCH refers to autoregressive conditional heteroskedasticity models. They are used to model variance/volatility of time series i.e. using the previous volatilities of a time series to predict future volatility. Stochastic volatility models differ by using numerous stochastic time series to model volatility. In a time series, asset prices are discrete i.e. finite values. However it is usual in quant finance to examine continuous time series using models such as Heston Stochastic Volatility, Geometric Brownian Motion and Ornstein-Uhlenbeck. These models are further explained in the next blog post with the aim of taking advantage of their features to form trading strategies.

Monday 22 February 2016

Backtesting in Algorithmic Trading

Backtesting is what we have been doing in our previous posts - using historical market data, we see how our strategy would have worked over different timeframes. The higher the frequency of the algo, the more challenging it is to discover whether the backtest provides realistic results given that execution becomes the key advantage of the strategy and a millisecond change in any of the parameters can affect performance greatly.

Michael Halls Moore attributes backtesting to three key reasons:

1. Filtration

We can filter strategies to only those that meet our needs, such as by taking advantage of performance metrics including Sharpe Ratio and Max Relative Drawdown/Drawdown duration as provided in Quantopian.

2. Modelling

We can relate theory to practice by testing (with no risk) under realistic market microstructure conditions, and realise impacts from illiquidity.

3. Strategy Optimisation

We can change different parameters in the strategy and see the impact of that change in the overall performance, as we did in our previous posts.

Interestingly, most biases in backtesting idealises the performance and hence one should always consider the overall PnL to be the most optimal result. This can be due to optimisation/curve fitting/data snooping bias, where parameters are all changed to generate the highest profit. Although more data and less parameters can help reduce bias, timeframes are also subject to this bias as older data can be related to a different market regulatory structure and hence not applicable anymore.

As I learnt from Michael Halls Moore, a sensitivity analysis can be used to eliminate optimisation bias. This involves adjusting parameters incrementally and plotting the relevant layer of performance. A volatile parameter surface suggests the parameter is an artefact of the past data. Going to check how this works out on Quantopian, and will update later!

Look-ahead bias, when future data is used in the backtest when that data shouldn’t have been known at that point in time, can be introduced in various ways. One I found particularly interesting was trading strategies that use maximum and minimum values. These values should be lagged by at least one period because they can only be calculated at the end of a period. Another risk is shorting constraints - awareness of the lack of liquidity in certain equities and market regulations e.g the US 2008 shorting ban can explain inflated backtest results.

Backtesting should be performed on complete data, though this is typically expensive and used by institutions. Some such as Yahoo Finance would only include assets that did not end up delisting. This means the strategy is only being tested on those assets that are already strong. Equities are prone to this survivorship bias. A shorter time frame can reduce the chance of delisters. They can also have OHLC prices (opens, highs, lows and closing) that are biased from outliers due to the small orders from different exchanges. Trades involving FX markets are also challenging in this aspect as the bid ask prices on one exchange is different to another and therefore consolidated price (and transaction cost) data from multiple ECNs is not ideal.

Transaction costs can eat into the profits of algorithms particularly those that rely on high volume small margin trades and therefore should be modelled. The fixed costs are easier to implement in backtests.

broker commission, fees to clear and settle trades (fixed)
taxes (fixed)
slippage: affected by latency (time difference between signal and execution, can be high for assets with high volatility. Momentum and contrarian strategies are sensitive to high slippage as you are trying to buy assets that are already moving in the expected direction/opposite direction respectively).
market impact (breaking down into smaller volume) and spread (wider for illiquid assets)

Sunday 21 February 2016

Strategy Formulation and Evaluation in Algo Trading

Overview

One of the many benefits of automated trading includes the ability to verify and subsequently optimise a strategy through testing it on historical data. This is a process known as backtesting, a topic we will focus on in future blog posts. Strategy formulation and evaluation is the key focus of this blog post.

Automation also means traders do not need to constantly monitor prices. Risk management is also important to traders, and position sizes, leverage, etc. can likewise be dynamically adjusted depending on market movements.

With various performance metrics being tracked, capital allocation is more efficient as comparison is made easier relative to the traditional profit and loss tracking which masks drawdown. Drawdown refers to the decline between the peak and trough of an investment.

Algo trading also eliminates psychological bias, which sometimes act as motivators in trading and erodes the performance of a trading strategy. However, human judgement is important when determining whether it is sensible and logical to modify parameters in a strategy depending on external influences.

Strategy Formulation

There are two main approaches to coming up with a trading strategy. Firstly, the data mining approach- we apply numerous parameters onto a time series and utilise such parameters in a strategy. Whilst this may appear to be effective, it is difficult and time consuming to identify the reason behind performance erosion. This then results in arbitrary, non-logical optimisation where further irrelevant parameters are applied that could lead to temporary profits only in the short term, hence resulting in a cycle of poor performance. In our opinion, a better approach is the hypothesis testing method. Firstly, traders observe trends and correlations and come up with a hypothesis. The null is the random walk hypothesis. We then apply statistical tests such as Variance-Ratio tests to prove the hypothesis. This test and further statistical models will be explained in future blog posts. Always remember even if the hypothesis is disproved at the level of significance it could still be true- try using a larger dataset, and think of additional relevant parameters to refine the hypothesis. Even if profitability performance erodes in the future, we are able to refine the hypothesis.

We use Python language to code up our algorithms. Python is the fastest growing language in the finance industry due to its development speed. We also found it the easiest to learn as it comes with its own libraries relevant to statistical modelling. The most useful are: NumPy, SciPy, matplotlib, statsmodel, scikit-learn, IPython, and pandas. We will expand on coding in future blog posts. For those with an extensive programming background, C++ is most useful in terms of speed, making it effective for High Frequency Trading. Execution can be optimised in Python to be as fast as C with more complex code.

Ideas can be sourced from academic journals, trading forums, finance magazines and textbooks. The aim is to establish a consistent approach to sourcing, testing and executing strategies so ideas are of consistent quantity and frameworks are more easily developed to accept and reject ideas without psychological and emotional attachment. It is recommended by numerous quant finance specialists to avoid cognitive biases in strategy, which could even come down to personal preference of asset classes. Such preferences must always be justified logically using metrics such as leverage and capital constraints.

From our experience, many websites and forums refer to technical analysis for trade ideas. Technical analysis is essentially the use of signals to enter trades and behavioural finance to predict trends and mean reversion in prices. However, when reading up on quant finance, it seems to be less emphasised upon. However programming and modelling have capabilities to statistically evaluate the profitability of technical analysis based strategy.

Academic journals can provide strong fundamental trading ideas however it is clear that extensive testing using more updated data and liquid asset classes that take into consideration transaction costs such as fees, spread and slippage is important to ensure a more realistic replica of the strategy.

Strategy Evaluation

Below are important criteria we have compiled inspired by Quantopian and Michael Halls-Moore to assess strategy performance:

Approach - The main types of trading strategies we have come across fall under momentum/directional, mean reversion or hedged (market neutral). Strategies have different characteristics in profit and loss. For instance, momentum strategies tend to exhibit this pattern as they base themselves on a small number of winning trades to generate profits, even if the majority are losing trades. Mean reversion on the other hand tends to be the opposite, where most trades are winning trades though mis-timing the reversion leads to severe losses. We should always be able to identify the reason behind the market movement we are trying to exploit to ensure the strategy still holds if there is an external event e.g. regulatory change. This also enables greater understanding of whether the strategy only applies to a certain asset class or over a specific time series, to avoid wasting time and resources backtesting and refining ineffective strategies. It is important that none of the strategies come with numerous parameters as it would lead to bias in optimisation.

Parameters - If you've learnt regression then you should be familiar with optimisation bias/curve fitting. In any regression, the more parameters there are, the better the fit of the model, even if one parameter is almost unrelated and only adds to the fit by less than a percent. The R squared does not consider the relevance of each individual parameter. Therefore it is important to stick with strategies with minimal parameters and ensure datasets are large enough to test such parameters.

Benchmark - Depending on your strategy, there is usually an appropriate benchmark to measure the strategy against average market performance. After all, there is no point in designing and optimising a strategy that returns 5% when the benchmark is returning over 10%. It is recommended to use a performance benchmark that is composed of the underlying asset class the strategy is based on. For instance, using the most liquid equities in the ASX means the ASX200 is an effective benchmark. This leads to terminology such as "alpha" and "beta". Alpha is a risk adjusted rate of return. That is, it considers the risk involved and the coefficient will indicate whether the return was worth the risk. Beta refers to whether the investment strategy was more or less volatile relative to the market. For fixed income funds, you should compare performance relative to a basket of bonds, fixed income products or risk free rate.

Sharpe Ratio - This metric refers to risk reward ratio, that is, how much return achieved for the level of volatility the investor encounters. If we set the strategy to be a higher frequency strategy, the sampling rate of volatility (standard deviation) should be greater to match this.

Volatility- This important metric is embodied in the Sharpe Ratio and is useful to identify whether hedging should be undertaken. For instance, high volatility in the underlying assets leads to higher volatility in the equity curve, reducing the Sharpe ratio. The equity curve is a graphical representation of the changes in value of a trading account, and positive slopes implies profitable trading strategies.

Maximum Drawdown - Maximum drawdown refers to the peak-trough ratio drop on the equity curve. This means incrementally losing trades will lead to extended drawdowns, which is common in momentum strategies. It is wise to avoid giving up during such drawdowns, as historical backtesting usually implies this is as expected. Though it is up to the trader to determine what drawdown percentage is acceptable prior to exiting the strategy, similar to setting stop losses on a trade.

Frequency - Frequency is related to technological skills, Sharpe ratio and transaction costs. This concept therefore introduces the concept of technology stack. The technology stack refers to the operating system and related support programs and runtime environments to support applications such as Python. Higher frequency strategies are naturally more complex in terms of programming expertise, cost and storage requirements as intraday ticks and order book data is often required.

Liquidity - We only have experience with highly liquid instruments when it comes to algorithmic trading. We must always ensure the strategy is scalable in case of increased capital allocation in the future.

Leverage - Futures, options, swaps are leveraged derivatives. These instruments have high volatility and can therefore easily result in margin calls, which requires capital (and patience).

Technology - In order to generate the above metrics, a database engine is needed for data storage. These databases can include MySQL or Oracle and accessed via an application code (R, Matlab, Excel) that queries the database then provides users with tools to analyse such data. Code can be Python as previously mentioned, or C++ or Java.

Data - The type of data stored is important as this is what the strategy will use for entry/exit points and profitability calculations. Fundamental data, that is, macro metrics including inflation, interest rates, corporate actions, earnings reports, etc. are helpful in valuation of a company or asset based on their fundamentals. Storage should be minimal unless multiple companies are being analysed simultaneously, it does not involve time series of asset prices. Asset price data is considered the norm for finance quants. Such data can apply to equities, fixed income, commodities and forex prices. If intraday data is also included, storage requirements increase significantly, as we must also consider cleaning the data for increased accuracy. Qualitative data can also be included. Subscription to media feeds falls under this category. Currently the newer trend in data analysis involves machine learning using classifiers to collect investor sentiment, and due to the qualitative nature of such data, it is stored using NoSQL document databases.

The returns of the strategy do not provide direct information to what is working and what isn't working in a trading strategy, as there is no insight into capital requirements, leverage, volatility, benchmarks, etc. It is therefore advisable to consider the risk characteristics in a strategy prior to evaluating returns. The way we like to think of it is that returns represent the y variable, the dependent variable. Sites such as Quantopian have backtesting platforms embedded, so these metrics are generated automatically, allowing us to focus solely on strategy formulation and optimisation of strategy.

Wednesday 17 February 2016

Data Analysis in R - Interesting Datasets & Graphs

Hi everyone,

This blog post is also related to functions in R useful for data analysis. The datasets we used were quite interesting. If you're a student attending University of Sydney, a great Business School major to take is Business Analytics as it allows us to perform our own analysis on whatever dataset we like, and R is a highly recommended tool by our academics. Read on for tips on how to approach datasets and apply functions in R.

1) We are using a database on race results available on the American Racing Pigeon Union's website. The aim is to identify relationship between a pigeon's speed and the colour of its feathers. For greater reliability, we formatted the data frame to only include colours that appear more than ten times in the database.

We import ggplot2, which enables powerful graphics to create complex plots such as correlation plots. Calculations include average flying speed based on the group colour. Speed is our independent variable and rank is the dependent variable, though these variables could be switched.

Note: %>% is a piping operator which enables functions relating to the same variable to be passed along without needing to re-enter the variable name for manipulation. It is a component of the dplyr library.

2) There is a research paper on the optimum length of chopsticks- the key performance indicator is the number of peanuts able to be picked up and placed into a cup (variable name: food picking efficiency). Whilst the results of the study have been publicised, it is always fun and rewarding to reproduce the graph in a new software environment. Data visualisation is just as important as the analysis itself.

We use ggplot to visualise the data. Aes refers to aesthetic, i.e. mapping user specified variable to user specified part of the plot. We are using fill to group the data by chopstick length, and the dependent variable is relative frequency, that is, how often a certain food picking efficiency is noted compared to its total number. Geom_density displays a smooth density estimate relevant to relative frequency, while alpha refers to transparency, useful for when there are multiple overlapping plots. Other optional settings to include are weight of an observation (weight), border colour (colour), size and line type (linetype). We can conclude a chopstick length of 240mm is optimal.

3) We are reproducing two graphs based on Spanish Silver production during the 18th century. Firstly, we are plotting the annual silver production as a time series graph, and secondly we are plotting the annual amount of silver but another time series plot overlapping that to demonstrate cumulative production over time.

We use the ggplot2 function for graphical output. Geom_area refers to producing area plots, which is similar to a continuous stacked bar chart. It aims to visualise how the composition varies over the x range (time). We refer to the cumulative graph as new variable silver_cs. We mutate the data frame to add an extra column named cumsum, which is the cumulative sum of silver production. Mutate seems to act much faster than transform for large data frames. The user defined format #c0c0c0 refers to the colour silver. Here transparency is slightly lesser with alpha set to 0.5.

Data Analysis in R - 2013 American Community Survey

Hi everyone,

Mary and I both recently finished a summer internship at UBS in their FRC department.
It was a highly rewarding experience for the both of us. During that time I mastered a new technical skill, R, and Mary learnt Excel VBA. R is a programming language that enables statistical computing and output of graphics.

Note this blog post is related to data mining rather than finance.

To further consolidate on my R knowledge, we decided to use DataCamp's R platform to undertake a quick analysis of the 2013 American Community Survey, a dataset provided in Kaggle. This survey is similar to ABS surveys in Australia. The aim is to determine whether it is worthwhile pursuing a PhD.

This was our approach:

1. Load in data to identify how observations are formatted in the dataset.
2. Load in dplyr package. This package provides tools to manipulate datasets efficiently.
3. Using the dplyr package, we convert the dataset into a table.
4. Clean up the dataset: remove NA values, use only university level education qualifications including Bachelors, Masters and PhDs and then grouping by such levels for further analysis.
5. We perform an inner join of our formatted table to one which contains data on the number of higher level education holders to produce a bar graph.
6. Using a separate income dataset (code for relevant calculations such as min, median, max and interquartile ranges provided below), we use a box plots to compare incomes.

Calculations for box plot:

Our next few blog posts will be related to data mining and analysis.

Thanks for reading!