Jan Freyberg's blogStatistics, data, and cognitive science. I blog neither frequently nor regularly
www.janfreyberg.com
The most momentous year in history?<p>I really like history, and one of the things I sometimes spend my time doing is reading historical articles on <a href="https://www.wikipedia.org/">Wikipedia</a>. The other day, this got me thinking about whether the amount of historical fact on wikipedia can be analysed in any interesting way.</p>
Sun, 25 Mar 2018 20:21:52 +0000
www.janfreyberg.com/blog/2018-03-25-the-most-momentous-year-in-history/
www.janfreyberg.com/blog/2018-03-25-the-most-momentous-year-in-history/Project-specific cookiecutter templates for reproducible work<p>I recently read a really good blogpost by Enrico Glerean titled <a href="https://eglerean.wordpress.com/2017/05/24/project-management-data-management/">Project management == Data management</a>. In it, he explains best practices for managing data, and standardising project file structures and layouts. One of the tools he mentioned was <code class="highlighter-rouge">cookiecutter</code>, a tool with which a template github repository can be cloned, with specific variables and file/folder names in the template being replaced by questions <code class="highlighter-rouge">cookiecutter</code> asks you during setup.</p>
Fri, 23 Jun 2017 00:00:00 +0000
www.janfreyberg.com/blog/2017-06-23-project-specific-cookiecutter-templates/
www.janfreyberg.com/blog/2017-06-23-project-specific-cookiecutter-templates/Visualising the bootstrap with shiny<p><a href="https://en.wikipedia.org/wiki/Bootstrapping">Bootstrapping</a> is a really useful statistical tool. It relies on re-sampling, with replacement, from a sample of data you have acquired. The idea is that by re-sampling your sample over and over again, you simulate running studies over and over again. It’s obviously not exactly analogous - sampling bias in your original sample will still affect your bootstrapped samples. But what’s great is that you can re-calculate summary statistics, such as standard deviation, for each bootstrapped sample. And due to the <a href="https://en.wikipedia.org/wiki/Central_limit_theorem">central limit theorem</a>, these statistics will be normally distributed.</p>
Sat, 27 May 2017 00:00:00 +0000
www.janfreyberg.com/blog/2017-05-27-visualising-bootstrapping/
www.janfreyberg.com/blog/2017-05-27-visualising-bootstrapping/A workflow for writing papers in Rmarkdown<p><a href="http://rmarkdown.rstudio.org">Rmarkdown</a> is a syntax for writing plain text documents that get converted to rich text webpages, pdfs, word documents and presentations. At its basic level, it follows the ideas behind all plain-to-rich text converters: that writing without having to focus on the layout of the document makes it easier to concentrate on what you want to convey, not how you are going to convey it.</p>
Sat, 27 May 2017 00:00:00 +0000
www.janfreyberg.com/blog/2017-05-27-workflow-papers-rmarkdown/
www.janfreyberg.com/blog/2017-05-27-workflow-papers-rmarkdown/Analysing SSVEPs and other evoked frequencies in python<p>I frequently work with evoked frequencies. This involves stimulating your subjects at a certain frequency (say, by flickering a light at a 10 Hz), and simultaneously recording brain activity. Usually, this recording is done with EEG or MEG data, since it gives you enough temporal resolution to pick up a wide range of frequencies. When you then analyse the frequencies in the recorded data, you can usually very distinctly pick up the frequency at which you stimulated.</p>
Sat, 20 May 2017 00:00:00 +0000
www.janfreyberg.com/blog/2017-05-20-evoked-frequency-analysis-in-python/
www.janfreyberg.com/blog/2017-05-20-evoked-frequency-analysis-in-python/Interactive Brain Visualisations in Notebooks<p>One of the things I think about a lot these days is interactive data. I believe that when you’re presenting research or analyses to a data-literate crowd, it’s best to allow people to interact with your data. This means readers can explore the data themselves, and if you did your analysis well, then providing interaction with the data will let the reader convince themselves of that fact. It makes your arguments more compelling.</p>
Fri, 27 Jan 2017 00:00:00 +0000
www.janfreyberg.com/blog/2017-01-27-interactive-brains-in-notebooks/
www.janfreyberg.com/blog/2017-01-27-interactive-brains-in-notebooks/Univariate k-means clustering<p>Often data is distributed normally. But sometimes, there is sub-grouping in your data that is worth exploring a bit more: For example, a subset of your data is drawn from a different population, and therefore has a different mean.</p>
<p>One popular technique to find these clusters without any prior knowledge is k-means clustering. The idea is that you ask an algorithm to provide you with <em>k</em> clusters, where k is an integer (between 1 and your sample size). The algorithm then selects some random cluster centers, and assigns each datapoint to the cluster it is closest to. It then calculates the within-group sum of squares (SS), a proxy for how much variance is in your data <em>after</em> you account for the cluster centers.</p>
<p>The algorithm then moves around the cluster centers until this variance is minimised.</p>
<p>The tricky part comes when you need to decide how many clusters your data likely has. Of course, the more clusters you have, the more variance they account for, and subsequently the less variance is left after accounting for cluster centers. If you have the same number of clusters as you have data points, then each cluster just contains one point - and there is no variance left at the end. However, you’ve also not learned anything about your dataset.</p>
<p>One simple method is the “elbow” method, where you try out a certain number of clusters, and plot the within-group sum of squares against number of clusters. You then try and find the “elbow”, or the point at which suddenly, increasing the number of clusters doesn’t reduce the residual variance very much.</p>
<p>Since this is a visual procedure, it lends itself well to an interactive framework, so I’ve built one =) you can try it below, or find it <a href="http://shiny.janfreyberg.com/shiny-elbow-kmeans">here</a>. You can also see the backend code <a href="http://www.github.com/janfreyberg/shiny-elbow-kmeans">here</a>, although if you wanted to include a k-means algorithm in your data you probably don’t want to use my code since for the purposes of making it interactive it’s not the prettiest implementation.</p>
<p>I’ll work on a bivariate version soon!</p>
<hr>
<!--break out of column and row, container, open new-->
</div></div></div>
<div class="container-fluid">
<div class="row">
<div class="col-lg-12">
<iframe class="shiny-embed" style="" src="http://shiny.janfreyberg.com/elbow-kmeans/">
</iframe>
<!--close breakout divs, open original divs-->
</div></div></div>
<div class="container"><div class="row"><div class="col-lg-12 text-center single-col">
<hr>
Fri, 27 Jan 2017 00:00:00 +0000
www.janfreyberg.com/blog/2017-01-27-interactive-univariate-clustering/
www.janfreyberg.com/blog/2017-01-27-interactive-univariate-clustering/Building this website<p>I could really like web development. I’ve created two websites, this one, and one for a charity I used to volunteer for, <a href="http://www.epafrica.org.uk/">Education Partnerships Africa</a>. The website for EPAfrica was primarily built by others, but I was involved in some of the detail work. It’s built with wordpress, which is great: it can be a simple drag/drop, but it can also change almost every detail. But I always found it really fiddly too, and it takes a long time to do small changes.</p>
<p>There are many reasons I wanted to build my personal website without a managed platform like wordpress. One, I like understanding anything I’m working on. Two, I want my page to be lightweight, and wordpress is anything but. Three, I usually code in <a href="http://www.atom.io">Atom</a>, and really like writing in it - it’s what I’m writing this blogpost in right now. And four, I want to have a good place to host the website<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>.</p>
<p>So I turned to github pages, which I had seen other scientists use for their website. I didn’t really know how it worked, but since it’s based on github, it would also have excellent version control. You can just upload your own fully written HTML to it, but that would make it very hard to blog with it. But it turns out it uses Jekyll, which is a nifty framework that puts together your website for you and uses variables to e.g. present your most recent blogposts on the front page (see <a href="/">here</a>). When you put a new blogpost - which is completely written in markdown - in your posts folder, jekyll will find it, get its title and content, and display it on the <a href="/">front page</a>, the <a href="/blog/">blog index</a>, or even a <a href="/tags/">tags page</a>.</p>
<p>I can highly recommend jekyll, since it applies much of the same logic that you use in regular scientific computing, but uses it for websites and blogging. Some of the introductions that helped me a lot are <a href="http://jmcglone.com/guides/github-pages/">this guide to getting started</a>, and <a href="https://github.com/volny/stylish-portfolio-jekyll">this template</a>. In particular, the template I used also uses bootstrap, which is a framework developed by twitter for laying out your website. It’s great.</p>
<p>I also really wanted to make R shiny visualisations a part of this blog, since I’ve been using it in teaching recently and like it for illustrating statistical concepts. So in addition to the website itself being hosted on github pages, I started an <a href="https://aws.amazon.com/free/">amazon web services free-tier computing instance</a>. This is essentially a linux computer in the cloud. I installed git, R, and shiny server on it, and I was good to go<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>. Simply cloning the <a href="https://www.github.com/janfreyberg/factorial-anova/">github repository</a> of one of my shiny visualisations will make it available at <a href="https://shiny.janfreyberg.com/factorial-anova/">shiny.janfreyberg.com</a>. I can then embed these in blogposts, such as <a href="/2016/11/16/visualising-a-2x2-anova/">this one</a>.</p>
<p>Having my own shiny server also means I will be able to host R-markdown documents such as presentations on it. I don’t have any of those yet - but I will eventually…</p>
<h4 id="footnotes">Footnotes</h4>
<p>PS: All pictures on this website are either taken by me or from the mailing list <a href="http://deathtothestockphoto.com/">Death To Stock Photo</a> mailing list.</p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>I don’t mind paying for hosting space, but I get so little traffic that I wanted to go free for now. Github pages seems like a good place to do that. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>This is a slight exaggeration, it actually took a some fiddling to get all the R packages I needed installed <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Sun, 20 Nov 2016 00:00:00 +0000
www.janfreyberg.com/blog/2016-11-20-how-i-built-this/
www.janfreyberg.com/blog/2016-11-20-how-i-built-this/Visualising Confidence Intervals<p>Confidence intervals are a nice way to present your results. They get you away from the dichotomous nature of <em>p</em>-values, and allow you to express the precision of the variable of interest. Whether it is the difference between two groups in a t-test or the slope of a linear regression, it helps the reader in understanding your data if you tell them how precisely you can estimate the data.</p>
<p>There is a common misconception, however, about confidence intervals: that they allow you to predict the range in which the true value falls with a certain probability. The idea that e.g. a 95% confidence interval of an average implies that there is a 95% chance that the true population average falls inside the interval is false.</p>
<p>This is because once you’ve completed a study, the confidence interval either includes the true value or not, so it’s not helpful to think it as a predictor of the possible range of values for the true mean. The confidence value actually refers to the fact that if you ran the same study many many times over, in e.g. 95% of cases the confidence interval would include the true mean.</p>
<p>Instead, confidence intervals allow you to think about the precision of your estimates. So if you estimate the difference between two groups to be 0.5 (of an arbitrary measure), and your 95% confidence interval ranges from 0.05 to 0.95, you’re data isn’t providing you with a particularly precise estimate, and even though you may have a statistically significant result ar your chosen alpha level, you may want to consider replicating your study in a bigger sample, or with less measurement error.</p>
<p>I have written a visualisation to demonstrate this idea. It simulates a study many many times over, and plots the resulting confidence interval (in this case a simple group difference) in the top plot. As you run it many times over, it accumulates a histogram of how often a value was inside the confidence interval. You will see that the true mean falls inside the confidence interval 95% of the time (or 99% if that is where you set your confidence level).</p>
<p>But you’ll also see that large confidence intervals result in a much larger range of values, and that reducing your confidence interval through increasing sample size means that values inside your confidence intervals are much less frequently far away from the true mean.</p>
<hr>
<!--break out of column and row, container, open new-->
</div></div></div>
<div class="container-fluid">
<div class="row">
<div class="col-lg-12">
<iframe class="shiny-embed" style="min-height: 850px;" src="http://shiny.janfreyberg.com/confidence-intervals/">
</iframe>
<!--close breakout divs, open original divs-->
</div></div></div>
<div class="container"><div class="row"><div class="col-lg-12 text-center single-col">
<hr>
<p>The code for this visualisation is available <a href="https://www.github.com/janfreyberg/confidence-interval-simulation/">here</a>, and you can view it below, or on a separate page <a href="http://shiny.janfreyberg.com/confidence-intervals/">here</a>. If you have any thoughts on improving the visualisation (much of which I gleamed from others’ online, but which I didn’t see anywhere in the way I wanted it), send me a tweet or open a Github issue.</p>
Sat, 19 Nov 2016 00:00:00 +0000
www.janfreyberg.com/blog/2016-11-19-visualising-confidence-intervals/
www.janfreyberg.com/blog/2016-11-19-visualising-confidence-intervals/Visualising a 2x2 ANOVA<p>The factorial design ANOVA (or Analysis of Variance) is maybe one of simplest yet most used tools in psychological research. It’s a test of mean differences between groups, but it tests for those mean differences using a “Variance explained” approach.</p>
<p>If we assign incoming participants to one of 2 levels (level “a” or level “b”) in two factors (factors 1 and 2) each, we end up with four groups<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>. Let’s assume these four groups are our only way of understanding this data, so we didn’t ask participants for age, sex, or anything else (or at least, we don’t want to use these metrics in our model). This means that the best model we can construct to explain this data is a model of group differences: for a participant in level <em>“a”</em> in factor 1 and level <em>“b”</em> in factor 2, the best prediction we have for this participant’s dependent variable is the average of all participants in his or her group (the <em>Factor 1: a / Factor 2: b</em> group).</p>
<p>An ANOVA performs an analysis of the variance in the data, and how well the group differences explain these results. To do so, you calculate the variation explained by our various factors, and compare it with the total variation in our sample. The metric of variation is the sum of squares, or SS - subtracting all values from the mean value and adding the square of them<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>. We can then compare the amount of variation explained by our factors, and our model as a whole, with the <em>residual</em> amount of variance, or <em>error</em>. This is simply the variation within each group, so the sum of the square of the differences between each point and its group mean.</p>
<p>This calculation can sometimes be tricky to understand, as to do an ANOVA, you (or your statistical programs) need to calculate many different sums of squares, and it’s not always obvious what these are. To make it a bit clearer, I wrote the <a href="http://shiny.janfreyberg.com/factorial-anova">following visualisation</a> - it lets you look at a very simple 2x2 factorial design dataset, with 4 subjects in each group. You can first add the three effects that can be present in a 2x2 ANOVA (main effect 1, main effect 2, and interaction), and then choose to visualise which sum of squares gets shown.</p>
<p>The code for this visualisation is <a href="https://www.github.com/janfreyberg/factorial-anova">available on github</a>.</p>
<hr>
<!--break out of column and row, container, open new-->
</div></div></div>
<div class="container-fluid">
<div class="row">
<div class="col-lg-12">
<iframe class="shiny-embed" style="" src="http://shiny.janfreyberg.com/factorial-anova/">
</iframe>
<!--close breakout divs, open original divs-->
</div></div></div>
<div class="container"><div class="row"><div class="col-lg-12 text-center single-col">
<hr>
<h4 id="footnotes">Footnotes</h4>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>As an example, think of a drug study where we have 2 drugs - say paracetamol and aspirin - and we assign people to a level in each of these <em>factors</em> - say placebo and standard dose. This yields 4 groups: A placebo/placebo group, a placebo/paracetamol group, an aspirin/placebo group, and an aspirin/paracetamol group. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>Note that <em>variation</em> is different from <em>variance</em>: variation is calculated as the sum of the squared differences. To obtain the variance of a dataset, you calculate the average variation - dividing the sum of the squared differences by the number of data points. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Wed, 16 Nov 2016 00:00:00 +0000
www.janfreyberg.com/blog/2016-11-16-visualising-a-2x2-anova/
www.janfreyberg.com/blog/2016-11-16-visualising-a-2x2-anova/