Jan Freyberg's blogStatistics, data, and cognitive science. I blog neither frequently nor regularly
https://www.janfreyberg.com
Why coverage doesn't cover pytorch backward calls.<p>Having recently switched to using pytorch for modeling, after primarily building
neural networks in tensorflow / keras, I have been enjoying how easy it is to
write new (automatically differentiable) functions and layers.</p>
Mon, 01 Apr 2019 00:00:00 +0000
https://www.janfreyberg.com/blog/2019-04-01-testing-pytorch-functions/
https://www.janfreyberg.com/blog/2019-04-01-testing-pytorch-functions/The most momentous year in history?<p>I really like history, and one of the things I sometimes spend my time doing is reading historical articles on <a href="https://www.wikipedia.org/">Wikipedia</a>. The other day, this got me thinking about whether the amount of historical fact on wikipedia can be analysed in any interesting way.</p>
Sun, 25 Mar 2018 20:21:52 +0000
https://www.janfreyberg.com/blog/2018-03-25-the-most-momentous-year-in-history/
https://www.janfreyberg.com/blog/2018-03-25-the-most-momentous-year-in-history/Project-specific cookiecutter templates for reproducible work<p>I recently read a really good blogpost by Enrico Glerean titled <a href="https://eglerean.wordpress.com/2017/05/24/project-management-data-management/">Project management == Data management</a>. In it, he explains best practices for managing data, and standardising project file structures and layouts. One of the tools he mentioned was <code class="language-plaintext highlighter-rouge">cookiecutter</code>, a tool with which a template github repository can be cloned, with specific variables and file/folder names in the template being replaced by questions <code class="language-plaintext highlighter-rouge">cookiecutter</code> asks you during setup.</p>
Fri, 23 Jun 2017 00:00:00 +0000
https://www.janfreyberg.com/blog/2017-06-23-project-specific-cookiecutter-templates/
https://www.janfreyberg.com/blog/2017-06-23-project-specific-cookiecutter-templates/Visualising the bootstrap with shiny<p><a href="https://en.wikipedia.org/wiki/Bootstrapping">Bootstrapping</a> is a really useful statistical tool. It relies on re-sampling, with replacement, from a sample of data you have acquired. The idea is that by re-sampling your sample over and over again, you simulate running studies over and over again. It’s obviously not exactly analogous - sampling bias in your original sample will still affect your bootstrapped samples. But what’s great is that you can re-calculate summary statistics, such as standard deviation, for each bootstrapped sample. And due to the <a href="https://en.wikipedia.org/wiki/Central_limit_theorem">central limit theorem</a>, these statistics will be normally distributed.</p>
Sat, 27 May 2017 00:00:00 +0000
https://www.janfreyberg.com/blog/2017-05-27-visualising-bootstrapping/
https://www.janfreyberg.com/blog/2017-05-27-visualising-bootstrapping/A workflow for writing papers in Rmarkdown<p><a href="https://rmarkdown.rstudio.org">Rmarkdown</a> is a syntax for writing plain text documents that get converted to rich text webpages, pdfs, word documents and presentations. At its basic level, it follows the ideas behind all plain-to-rich text converters: that writing without having to focus on the layout of the document makes it easier to concentrate on what you want to convey, not how you are going to convey it.</p>
Sat, 27 May 2017 00:00:00 +0000
https://www.janfreyberg.com/blog/2017-05-27-workflow-papers-rmarkdown/
https://www.janfreyberg.com/blog/2017-05-27-workflow-papers-rmarkdown/Analysing SSVEPs and other evoked frequencies in python<p>I frequently work with evoked frequencies. This involves stimulating your subjects at a certain frequency (say, by flickering a light at a 10 Hz), and simultaneously recording brain activity. Usually, this recording is done with EEG or MEG data, since it gives you enough temporal resolution to pick up a wide range of frequencies. When you then analyse the frequencies in the recorded data, you can usually very distinctly pick up the frequency at which you stimulated.</p>
Sat, 20 May 2017 00:00:00 +0000
https://www.janfreyberg.com/blog/2017-05-20-evoked-frequency-analysis-in-python/
https://www.janfreyberg.com/blog/2017-05-20-evoked-frequency-analysis-in-python/Interactive Brain Visualisations in Notebooks<p>One of the things I think about a lot these days is interactive data. I believe that when you’re presenting research or analyses to a data-literate crowd, it’s best to allow people to interact with your data. This means readers can explore the data themselves, and if you did your analysis well, then providing interaction with the data will let the reader convince themselves of that fact. It makes your arguments more compelling.</p>
Fri, 27 Jan 2017 00:00:00 +0000
https://www.janfreyberg.com/blog/2017-01-27-interactive-brains-in-notebooks/
https://www.janfreyberg.com/blog/2017-01-27-interactive-brains-in-notebooks/Univariate k-means clustering<p>Often data is distributed normally. But sometimes, there is sub-grouping in your data that is worth exploring a bit more: For example, a subset of your data is drawn from a different population, and therefore has a different mean.</p>
<p>One popular technique to find these clusters without any prior knowledge is k-means clustering. The idea is that you ask an algorithm to provide you with <em>k</em> clusters, where k is an integer (between 1 and your sample size). The algorithm then selects some random cluster centers, and assigns each datapoint to the cluster it is closest to. It then calculates the within-group sum of squares (SS), a proxy for how much variance is in your data <em>after</em> you account for the cluster centers.</p>
<p>The algorithm then moves around the cluster centers until this variance is minimised.</p>
<p>The tricky part comes when you need to decide how many clusters your data likely has. Of course, the more clusters you have, the more variance they account for, and subsequently the less variance is left after accounting for cluster centers. If you have the same number of clusters as you have data points, then each cluster just contains one point - and there is no variance left at the end. However, you’ve also not learned anything about your dataset.</p>
<p>One simple method is the “elbow” method, where you try out a certain number of clusters, and plot the within-group sum of squares against number of clusters. You then try and find the “elbow”, or the point at which suddenly, increasing the number of clusters doesn’t reduce the residual variance very much.</p>
<p>Since this is a visual procedure, it lends itself well to an interactive framework, so I’ve built one =) you can try it below, or find it <a href="https://shiny.janfreyberg.com/shiny-elbow-kmeans">here</a>. You can also see the backend code <a href="https://www.github.com/janfreyberg/shiny-elbow-kmeans">here</a>, although if you wanted to include a k-means algorithm in your data you probably don’t want to use my code since for the purposes of making it interactive it’s not the prettiest implementation.</p>
<p>I’ll work on a bivariate version soon!</p>
<hr>
<!--break out of column and row, container, open new-->
</div></div></div>
<div class="container-fluid">
<div class="row">
<div class="col-lg-12">
<iframe class="shiny-embed" style="" src="https://shiny.janfreyberg.com/elbow-kmeans/">
</iframe>
<!--close breakout divs, open original divs-->
</div></div></div>
<div class="container"><div class="row"><div class="col-lg-12 text-center single-col">
<hr>
Fri, 27 Jan 2017 00:00:00 +0000
https://www.janfreyberg.com/blog/2017-01-27-interactive-univariate-clustering/
https://www.janfreyberg.com/blog/2017-01-27-interactive-univariate-clustering/Building this website<p>I could really like web development. I’ve created two websites, this one, and one for a charity I used to volunteer for, <a href="https://www.epafrica.org.uk/">Education Partnerships Africa</a>. The website for EPAfrica was primarily built by others, but I was involved in some of the detail work. It’s built with wordpress, which is great: it can be a simple drag/drop, but it can also change almost every detail. But I always found it really fiddly too, and it takes a long time to do small changes.</p>
<p>There are many reasons I wanted to build my personal website without a managed platform like wordpress. One, I like understanding anything I’m working on. Two, I want my page to be lightweight, and wordpress is anything but. Three, I usually code in <a href="https://www.atom.io">Atom</a>, and really like writing in it - it’s what I’m writing this blogpost in right now. And four, I want to have a good place to host the website<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>.</p>
<p>So I turned to github pages, which I had seen other scientists use for their website. I didn’t really know how it worked, but since it’s based on github, it would also have excellent version control. You can just upload your own fully written HTML to it, but that would make it very hard to blog with it. But it turns out it uses Jekyll, which is a nifty framework that puts together your website for you and uses variables to e.g. present your most recent blogposts on the front page (see <a href="/">here</a>). When you put a new blogpost - which is completely written in markdown - in your posts folder, jekyll will find it, get its title and content, and display it on the <a href="/">front page</a>, the <a href="/blog/">blog index</a>, or even a <a href="/tags/">tags page</a>.</p>
<p>I can highly recommend jekyll, since it applies much of the same logic that you use in regular scientific computing, but uses it for websites and blogging. Some of the introductions that helped me a lot are <a href="https://jmcglone.com/guides/github-pages/">this guide to getting started</a>, and <a href="https://github.com/volny/stylish-portfolio-jekyll">this template</a>. In particular, the template I used also uses bootstrap, which is a framework developed by twitter for laying out your website. It’s great.</p>
<p>I also really wanted to make R shiny visualisations a part of this blog, since I’ve been using it in teaching recently and like it for illustrating statistical concepts. So in addition to the website itself being hosted on github pages, I started an <a href="https://aws.amazon.com/free/">amazon web services free-tier computing instance</a>. This is essentially a linux computer in the cloud. I installed git, R, and shiny server on it, and I was good to go<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>. Simply cloning the <a href="https://www.github.com/janfreyberg/factorial-anova/">github repository</a> of one of my shiny visualisations will make it available at <a href="https://shiny.janfreyberg.com/factorial-anova/">shiny.janfreyberg.com</a>. I can then embed these in blogposts, such as <a href="/2016/11/16/visualising-a-2x2-anova/">this one</a>.</p>
<p>Having my own shiny server also means I will be able to host R-markdown documents such as presentations on it. I don’t have any of those yet - but I will eventually…</p>
<h4 id="footnotes">Footnotes</h4>
<p>PS: All pictures on this website are either taken by me or from the mailing list <a href="https://deathtothestockphoto.com/">Death To Stock Photo</a> mailing list.</p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>I don’t mind paying for hosting space, but I get so little traffic that I wanted to go free for now. Github pages seems like a good place to do that. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>This is a slight exaggeration, it actually took a some fiddling to get all the R packages I needed installed <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Sun, 20 Nov 2016 00:00:00 +0000
https://www.janfreyberg.com/blog/2016-11-20-how-i-built-this/
https://www.janfreyberg.com/blog/2016-11-20-how-i-built-this/Visualising Confidence Intervals<p>Confidence intervals are a nice way to present your results. They get you away from the dichotomous nature of <em>p</em>-values, and allow you to express the precision of the variable of interest. Whether it is the difference between two groups in a t-test or the slope of a linear regression, it helps the reader in understanding your data if you tell them how precisely you can estimate the data.</p>
<p>There is a common misconception, however, about confidence intervals: that they allow you to predict the range in which the true value falls with a certain probability. The idea that e.g. a 95% confidence interval of an average implies that there is a 95% chance that the true population average falls inside the interval is false.</p>
<p>This is because once you’ve completed a study, the confidence interval either includes the true value or not, so it’s not helpful to think it as a predictor of the possible range of values for the true mean. The confidence value actually refers to the fact that if you ran the same study many many times over, in e.g. 95% of cases the confidence interval would include the true mean.</p>
<p>Instead, confidence intervals allow you to think about the precision of your estimates. So if you estimate the difference between two groups to be 0.5 (of an arbitrary measure), and your 95% confidence interval ranges from 0.05 to 0.95, you’re data isn’t providing you with a particularly precise estimate, and even though you may have a statistically significant result ar your chosen alpha level, you may want to consider replicating your study in a bigger sample, or with less measurement error.</p>
<p>I have written a visualisation to demonstrate this idea. It simulates a study many many times over, and plots the resulting confidence interval (in this case a simple group difference) in the top plot. As you run it many times over, it accumulates a histogram of how often a value was inside the confidence interval. You will see that the true mean falls inside the confidence interval 95% of the time (or 99% if that is where you set your confidence level).</p>
<p>But you’ll also see that large confidence intervals result in a much larger range of values, and that reducing your confidence interval through increasing sample size means that values inside your confidence intervals are much less frequently far away from the true mean.</p>
<hr>
<!--break out of column and row, container, open new-->
</div></div></div>
<div class="container-fluid">
<div class="row">
<div class="col-lg-12">
<iframe class="shiny-embed" style="min-height: 850px;" src="https://shiny.janfreyberg.com/confidence-intervals/">
</iframe>
<!--close breakout divs, open original divs-->
</div></div></div>
<div class="container"><div class="row"><div class="col-lg-12 text-center single-col">
<hr>
<p>The code for this visualisation is available <a href="https://www.github.com/janfreyberg/confidence-interval-simulation/">here</a>, and you can view it below, or on a separate page <a href="https://shiny.janfreyberg.com/confidence-intervals/">here</a>. If you have any thoughts on improving the visualisation (much of which I gleamed from others’ online, but which I didn’t see anywhere in the way I wanted it), send me a tweet or open a Github issue.</p>
Sat, 19 Nov 2016 00:00:00 +0000
https://www.janfreyberg.com/blog/2016-11-19-visualising-confidence-intervals/
https://www.janfreyberg.com/blog/2016-11-19-visualising-confidence-intervals/