Jekyll2023-08-09T10:46:25+00:00http://www.pwills.com/feed.xmlpwills.comSource for pwills.comPeter Willspeter@pwills.comHuman-in-the-Loop ML2023-08-08T00:00:00+00:002023-08-08T00:00:00+00:00http://www.pwills.com/post/2023/08/08/human-in-the-loop<p>It’s interesting to look back and think about what I’ve learned since I started working
as a data scientist back in 2016. In school we’re taught about various algorithms and
models, their mathematical properties, and some common use-cases. What we’re not taught
is the nuances of actually turning these models into viable products.</p>
<p>My experience in industry has shown me that it can be essential to have a <strong>human in the
loop</strong> for many ML products. In this post, I’ll discuss when you do and don’t need to use
human-in-the-loop ML, how humans can be incorporated ML systems, and the pros and cons
of such an approach.</p>
<h1 id="when-should-you-use-a-human-in-the-loop">When should you use a human in the loop?</h1>
<p>Augmenting your machine systems with human input can have dramatic strengths, but also
significant drawbacks. It can often change the properties of your system (e.g. speed,
accuracy, explainability, cost) by orders of magnitude, so it merits careful
consideration of the pros and cons.</p>
<p>One dimension to consider is the <strong>importance of accuracy</strong> in your model. Serving ads and
recommending content on social media need not be high-accuracy - if even 80% of what you
recommend is relevant, the model can still have a strong positive impact on the
product. Contrast this with applications like driverless cars. In these cases, system
“misses” carry a very high cost, and so having human intervention as an option (or
necessity) is often helpful.</p>
<p>Another important consideration is <strong>cost and scale</strong>. Involving humans in a process is much
more costly than a raw machine learning model. It is also much harder to scale; building
out a large network of laborers involves complicated financial, logistical, and legal
considerations. This is more of a concern for organizations trying to go from 1 to 100,
rather than 0 to 1, which operate on a smaller more manageable scale. This suggests a
bootstrapping strategy I will discuss later on.</p>
<p>Of course, adding a human into an automated system will have dramatic impacts on
<strong>latency</strong>. A low-latency machine-learning system can return results in milliseconds; most
human-based systems will have SLAs<sup id="fnref:fn1" role="doc-noteref"><a href="#fn:fn1" class="footnote" rel="footnote">1</a></sup> on the order of hours. For applications like online
ad serving, which must entirely run within the time a webpage loads, human intervention
is therefore a non-starter. However, for applications like content moderation, where
accuracy trumps latency concerns, human intervention can be very useful. A hybrid
approach can also address this concern; a system where the initial recommendation is
provided and acted upon quickly, and human “review” of the action is then provided
within the much-slower SLA.</p>
<p>Finally, there is the more fuzzy notion of <strong>explainability and perception</strong>. If your model
is customer-facing and there is a high cost for individual model failures (e.g. medical
diagnosis models), then the customer will often demand an explanation for why a certain
failure occured. The same holds true for some cybersecurity applications; if an attack
passes through, and a customer demands an explanation, human filtering can prevent
embarassing situations where the output of the ML system seems obviously wrong to a
human, but we cannot explain why the system made the judgement it did. (This can be more
than embarassing; a handful of such incidents can be enough to drive away customers and
hurt the market’s perception of the product.)<sup id="fnref:fn2" role="doc-noteref"><a href="#fn:fn2" class="footnote" rel="footnote">2</a></sup></p>
<h1 id="how-do-you-incorporate-humans-into-your-ml-system">How do you incorporate humans into your ML system?</h1>
<p>There are a few different approaches to incorporating humans into an ML system, which
each have their own advantages and disadvantages.</p>
<h2 id="human-confirmation-before-output">Human confirmation before output</h2>
<p>One approach is to have an automated system that makes a suggestion to a human agent,
who then confirms or modifies the suggestion <em>before</em> it is sent to the user. This is the
approach used by Stitch Fix. Their recommendation engine sends recommendations to the
stylists, who have final say in selecting what clothing is sent to the customer.</p>
<p>This approach is the “strongest” in terms of human intervention. Each output will be
vetted by a human, and therefore the system will achieve maximum accuracy, but will have
much slower output time and higher cost. Such an approach will only be useful for
systems where the human-in-the-loop is a characteristic aspect of the product
(e.g. Stitch Fix) since the approach leads to a system that behaves quite differently
from a fully automated ML system.</p>
<h2 id="human-verification-after-output">Human verification after output</h2>
<p>A alternative is to have an ML system that generates outputs that are sent to the user,
and then verified by human agents some time later. For example, YouTube might run a
system that detects whether a video should be removed from the site.<sup id="fnref:fn3" role="doc-noteref"><a href="#fn:fn3" class="footnote" rel="footnote">3</a></sup> It can
immediately remove the video based on the system’s output, and then later have the video
undergo human review. This review could result in the video being reinstated if it is
determined to be appropriate.</p>
<p>This system has the benefit that the automated outputs are available immediately; in the
example above, videos deemed problematic by the model are removed within seconds. The
downside is that there is a time gap where the user is exposed to the system’s
(potentially erroneous) output. In the example above, there may be a few hours time
where a video is removed from the platform in error. That said, if your base model is
reasonably accurate, then such system misses become more rare and such an approach can
be very fruitful.</p>
<p>Most thoughtful system architects will employ <em>partial</em> verification. For example, in the
above example, if the model outputs >99.5% probability that a video should be removed,
then that video might not undergo human review; if the model outputs between 80% and
99.5% probability, the video would undergo review.</p>
<p>Partial verification is tuneable; one can cost by lowering the upper threshold (more
videos skip review) or increase recall by lowering the lower threshold (more borderline
videos get reviewed).</p>
<h2 id="product-bootstrapping">Product Bootstrapping</h2>
<p>Another partial approach is to use human intervention to bootstrap your organization. A
small startup with a good idea for an ML product might not have enough data for a highly
accurate model. For such an organization, the benefits of a high-touch approach outweigh
the costs. As the organization scales, they can transition away from the
human-in-the-loop model and focus on (mostly) independent automated systems.</p>
<p>Of course, this transition is not a trivial matter. It can often be difficult to remove
human verification and maintain system accuracy. However, as I said above, scale gives
advantages to machine learning products, and these advantages may make it possible for
the system to be freestanding (or at least much closer to freestanding) than it was
initially.</p>
<h1 id="oh-the-humanity">Oh, the humanity</h1>
<p>Up until now, we’ve been discussing the impact of the humans on the system. It would be
remiss of me not to discuss the impact of the system on the humans. I don’t have any
simple conclusions here, but it’s worth at least raising a few points for consideration.</p>
<p>Working <a href="https://rein.pk/replacing-middle-management-with-apis">below the API</a> can be challenging. The work is often hourly, with strict SLAs and
a focus on productivity metrics. This can generate a high-pressure environment.
Depending on the product area, hours can be irregular. The issues with the “<a href="https://www.nytimes.com/2023/04/13/magazine/gig-jobs-apps.html">gig economy</a>”
have been widely discussed in the media, and many of those issues are shared by the kind
of systems we’re describing here.</p>
<p>However, there are also many benefits. The work is often remote, and since it is hourly,
can be flexible and work with irregular schedules. Many of the stylists at Stitch Fix
are mothers that supplement the family income by styling part-time. Before such systems
existed, it would be very hard to find work that could be done from home in the three
hours between when the baby goes to sleep and when the mother does; ideally,
incorporating humans into ML systems can enable such work and be a win-win.</p>
<h1 id="conclusion">Conclusion</h1>
<p>Going from “we could solve this with ML!” to an actually-viable product is often<sup id="fnref:fn4" role="doc-noteref"><a href="#fn:fn4" class="footnote" rel="footnote">4</a></sup> a
bumpy road. Incorporating human feedback into an automated system is a key tool to help
ease this transition. I don’t have any easy recommendations here; whether and how you
should incorporate human input into your particular product is highly dependent on the
product and the market in which it is situated.</p>
<p>But you should consider it. I’ve seen human augmentation assist ML companies at every
stage of growth, from pre-seed to post-IPO. It is a tool that, in my opinion, every
technology strategist should have in their toolkit.</p>
<!----- Footnotes ----->
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:fn1" role="doc-endnote">
<p><a href="https://www.cio.com/article/274740/outsourcing-sla-definitions-and-solutions.html">Service-level agreements</a>, which “defines the level of service expected by a customer from a supplier”; in this case, the “level of service” refers to the latency of a system. <a href="#fnref:fn1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn2" role="doc-endnote">
<p>Even if a model is more accurate than the human-only alternative, explainability can still be an important psychological issue for customers. Consider a driverless car that has accident rates 1/10th those of an average driver; however, when it does crash, it does so seemingly at random. Public perception and adoption of such a product would (I predict) be poor, since when we are in such critical situations, we often rely on explanations to feel safe and in-control. Note that this may be less of an issue for internal-use models, where adoption can be decreed by management, and not driven by user perception. <a href="#fnref:fn2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn3" role="doc-endnote">
<p>I’m not saying this is what YouTube actually does. This is just an example. <a href="#fnref:fn3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn4" role="doc-endnote">
<p>Read: always. <a href="#fnref:fn4" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Peter Willspeter@pwills.comA powerful paradigm for ML systemsIt’s Been a While…2023-08-07T00:00:00+00:002023-08-07T00:00:00+00:00http://www.pwills.com/post/2023/08/07/its-been-awhile<p>Since the last time I wrote on this site, a lot has changed. A global pandemic has come
and (mostly) gone, which has given rise to remote work as a viable option for many in
the tech sector, but also pushed forward an economic chain of events that made it harder
for companies to get easy funding without demonstrating strong fiscal
responsibility.<sup id="fnref:fn1" role="doc-noteref"><a href="#fn:fn1" class="footnote" rel="footnote">1</a></sup> The industry looks very different today than it did back in
September of 2019, when I wrote my last post.</p>
<p>What have I been up to in that time? I spend the beginning of the pandemic working
remotely for Stitch Fix. During this time, my partner and I moved around the country,
spending a month or two in various places in California, Washington, and finally near my
childhood home in western New York State. Experiencing the support of being near my
family and dear friends of decades, as well as the newfound viability of a remote-first
career, led us to purchase a home in <a href="https://ecovillageithaca.org/">the Ecovillage at Ithaca</a>. We’ve been living here
for over a year now, and it’s been food for my soul to be near my people and live closer
to the land. Swimming in the pond, sweating in the sauna in winter, swimming in the
gorges in the summer, sharing meals with the community, going on night walks surrounded
by fireflies; it really is a magical place.</p>
<p>Around the time I moved to Ithaca, I also left Stitch Fix to start a new job as Senior
ML Engineer with <a href="https://abnormalsecurity.com/">Abnormal Security</a>. I enjoyed the high-impact nature of the role, and
the dynamic character of the organization, but after a year I was feeling really burned
out, and realized I needed some time to myself. Over the last three months I’ve taken
time meditate, ride my bicycle, and reflect on where I want to go next in my career.</p>
<p>That reflection has coincided with one of the hottest summers on record. Sea ice in the
Southern hemisphere is at a record low, and wildfires raging in Canadian forests have
frequently blanketed the Eastern US in dangerous levels of smoke. It is now glaringly
apparent that there is nowhere we can go to escape the impact of climate change - it
will have effects everywhere in the world, even in places like Ithaca that are better
insulated against the (local) effects.</p>
<p>Seeing this, and feeling the direct impact it has on my life, I think I need to try and
do what I can to contribute. As the summer winds down, I feel ready to dive back into
the professional world. To that end, I’m looking for a Senior ML/DS role in an
organization that is working to mitigate climate change.</p>
<p>If you think you have an opening where I could be a good fit, or would just like to
connect, don’t hesitate to reach out! My email is <a href="mailto:peter@pwills.com">peter@pwills.com</a> and I’m happy to put
some time on the calendar to chat.</p>
<p>To the next adventure!</p>
<!----- Footnotes ----->
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:fn1" role="doc-endnote">
<p>The Federal Reserve has increased interest rates in order to combat inflation. From <a href="https://www.bls.gov/opub/mlr/2023/beyond-bls/what-caused-inflation-to-spike-after-2020.htm">this study on the sources of inflation</a> by the US Bureau of Labor Statistics: “So, from this research, the authors find that three main components explain the rise in inflation since 2020: volatility of energy prices, backlogs of work orders for goods and service caused by supply chain issues due to COVID-19, and price changes in the auto-related industries.” <a href="#fnref:fn1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Peter Willspeter@pwills.comBlogging in Org Mode2019-09-24T00:00:00+00:002019-09-24T00:00:00+00:00http://www.pwills.com/post/2019/09/24/blogging-in-org<p>I recently transitioned from writing my posts directly in markdown to writing them in
<a href="https://orgmode.org/">org mode</a>, a document authoring system built in GNU Emacs. I learned a lot in the
process, and also built a new org exporter in the process, <a href="https://github.com/peterewills/ox-jekyll-lite"><code class="language-plaintext highlighter-rouge">ox-jekyll-lite</code>.</a><sup id="fnref:fn1" role="doc-noteref"><a href="#fn:fn1" class="footnote" rel="footnote">1</a></sup></p>
<h1 id="org-mode-and-the-meaning-of-life">Org Mode and the Meaning of Life</h1>
<h2 id="what-is-org-mode">What is Org Mode?</h2>
<p>Laozi said that the Tao that can be told is not the eternal Tao; I think we can safely
say the same of org mode. Org mode is many things to many people, but at it’s core it is
a tool for taking notes and organizing lists. Additional functionality allows for simple
text markup, links, inline images, rendered \(\LaTeX\) fragments, and so on. You can embed
and run code blocks within org files, using the powerful <a href="https://orgmode.org/worg/org-contrib/babel/"><code class="language-plaintext highlighter-rouge">org-babel</code></a> package. Some
people have even <a href="https://write.as/dani/writing-a-phd-thesis-with-org-mode">written
their Ph.D. thesis in org mode</a>. It’s an amazingly powerful tool, with a passionate
user base that is constantly expanding its capabilities.</p>
<h2 id="why-not-markdown">Why Not Markdown?</h2>
<p>I like to use org mode for my personal and professional note-taking because it has very
good folding features - you can hide all headings besides the one you’re focusing
on. You can even “narrow” your buffer, so that only the heading (“subtree”, in org-mode
parlance) that you’re working on is present at all.</p>
<p>Org mode also has some nice visual features for writing, such as:</p>
<ul>
<li>rendering \(\LaTeX\) fragments inline</li>
<li>styling <strong>bold</strong>, <u>underlined</u>, and <em>italicized</em> text properly</li>
<li>excellent automatic formatting of tables</li>
<li>code syntax highlighting in various languages</li>
<li>display of images inline</li>
</ul>
<p>I wrote in markdown (using <code class="language-plaintext highlighter-rouge">markdown-mode</code> within emacs) for some time, but once I saw
what org mode had to offer, I realized that I needed to transfer my blogging over to
org. In particular, the Emacs mode <code class="language-plaintext highlighter-rouge">markdown-mode</code> doesn’t have a lot of the features that
org mode does, such as inline rendering of math and images or well-built text folding. I
used org for notes, and I realized that it would be much easier to just write in org
instead of trying to get markdown mode to work the way I want it to.</p>
<p>Below is a short clip that shows just some of what org mode has to offer. You’ll
want to full-screen it to make the text legible.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/MV9LR2LCxAE" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
<p> </p>
<p>Overall, I find the experience of writing in org much more enjoyable than writing in
markdown. Plus, I love hacking on emacs, and moving my blogging workflow over to org
presented me with an opportunity to do just that! So of course, I couldn’t resist.</p>
<h1 id="org-export-and-jekyll">Org-Export and Jekyll</h1>
<h2 id="blogging-in-jekyll">Blogging in Jekyll</h2>
<p>The primary tool I use to generate my blog is a static-site generator called <a href="https://jekyllrb.com/">Jekyll</a>,
which is written in Ruby. I wrote <a href="/_posts/2017-12-29-website.md">a previous post</a> describing my process for setting up
my site. <a href="https://blog.getpelican.com/">Pelican</a> is a similar tool written in Python, and <a href="https://gohugo.io/">Hugo</a> is a
static-site generator written in Go. We’ll talk a bit more about Hugo later.</p>
<p>All of these tools allow the user to write content in simple markdown, with the site
generator doing most of the heavy lifting in generating a full static site behind the
scenes. In Jekyll, the user provided some basic configuration for each post, like a
title, date, and excerpt, and then the them determines the details on how the text is
rendered into fully styled HTML. I use the excellent <a href="https://mmistakes.github.io/minimal-mistakes/">minimal mistakes</a> theme.</p>
<p>Unfortunately, markdown is not a nicely unified language specification There are many
dialects of markdown, and each has subtle differences, so there is not, in general, one
markdown specification to rule them all. For example, so-called “GitHub-flavored
markdown”, which renders markdown from READMEs in GitHub repositories, has certian
quirks that are not shared by the markdown I write for this site. To further complicate
things, the static site generators often have their own quirks - Jekyll requires
particularly-formatted front-matter to specify the configuration for each post, which is
not part of the general markdown specification.</p>
<p>All that is to say, it wasn’t a trivial task to find something that converted org to the
exact markdown that I need for my site. But before we jump into the details there, we
should talk a bit about org exporters in general.</p>
<h2 id="org-export">Org-Export</h2>
<p>Org mode comes packaged with many built-in “exporters”, which convert from the org
format to other text formats, including HTML, \(\LaTeX\), iCalendar, and more. It <em>does</em>
come with a backend that converts org to markdown, which I hoped would be all that I
need to convert org to markdown.</p>
<p>Unfortunately, the built-in <code class="language-plaintext highlighter-rouge">ox-md</code> exporter doesn’t work very well, for a few reasons. It
falls back on using pure HTML (for example, to generate footnotes) when there are
markdown-native ways of accomplishing the same thing. Also, some things don’t work at
all - for example, equation exporting won’t work, since markdown requires you to enclose
LaTeX with <code class="language-plaintext highlighter-rouge">\\[</code> and <code class="language-plaintext highlighter-rouge">\\]</code>, whereas HTML only requires a single slash.<sup id="fnref:fn2" role="doc-noteref"><a href="#fn:fn2" class="footnote" rel="footnote">2</a></sup></p>
<p>A quick search will show that there are many tools built to address this problem. Org
exporter backends are designed to be easy to extend, and many users have extended the
markdown backend to work with specific static site generators. The most fully developed
of these is <a href="https://ox-hugo.scripter.co/"><code class="language-plaintext highlighter-rouge">ox-hugo</code></a>, which is built to work with the site generator Hugo. This
package in particular would be a big source of the transcoding functions I would use,
but since it is built to be tightly integrated with Hugo, I couldn’t just use it out of
the box.</p>
<p>Elsa Gonsiorowski developed a Jekyll-friendly org exporter, called <a href="https://www.gonsie.com/blorg/ox-jekyll.html"><code class="language-plaintext highlighter-rouge">ox-jekyll-md</code></a>, which
provided the basis for what I would eventually build. She also wrote <a href="https://www.gonsie.com/blorg/ox-jekyll.html">a blog post</a> about
it - if you’re interested in customizing org exported, I’d recommend giving it a read.</p>
<h2 id="building-ox-jekyll-lite">Building <code class="language-plaintext highlighter-rouge">ox-jekyll-lite</code></h2>
<p>There are some things that <code class="language-plaintext highlighter-rouge">ox-jekyll-md</code> does very well, including generating the
Jekyll-specific YAML front matter. However, I found that it lacks a few key features:</p>
<ul>
<li>handling footnotes in a markdown-native way</li>
<li>rendering MathJax delimiters with double slashes (to make them markdown-compatable)</li>
<li>exporting image links appropriately</li>
<li>export link paths relative to the Jekyll root directory</li>
</ul>
<p>Since these were essential to my blogging workflow, I forked that project and began work
on my org exporter, <a href="https://github.com/peterewills/ox-jekyll-lite"><code class="language-plaintext highlighter-rouge">ox-jekyll-lite</code></a>.</p>
<h3 id="customizing-an-org-export-backend">Customizing an Org Export Backend</h3>
<p>You can think of an org-export backend as a collection of rules for transforming org
files into other text format. For example, how should underlined text be handled? How
about code blocks? How about \(\LaTeX\) snippets? Each of these rules is encapsulated by a
so-called “transcoding function.”</p>
<p>Org export backends are built to be highly extensible. If you extend <code class="language-plaintext highlighter-rouge">ox-md</code>, for example,
then you “inherit” all the transcoding functions that it provides, and you can add or replace
only the functions you want to. For example, part of <code class="language-plaintext highlighter-rouge">ox-jekyll-lite</code> looks like</p>
<div class="language-elisp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nv">org-export-define-derived-backend</span> <span class="ss">'jekyll</span> <span class="ss">'md</span>
<span class="ss">:translate-alist</span>
<span class="o">'</span><span class="p">((</span><span class="nv">headline</span> <span class="o">.</span> <span class="nv">org-jekyll-lite-headline-offset</span><span class="p">)</span>
<span class="p">(</span><span class="nv">inner-template</span> <span class="o">.</span> <span class="nv">org-jekyll-lite-inner-template</span><span class="p">)))</span>
</code></pre></div></div>
<p>This tells us that we’re defining a backend named <code class="language-plaintext highlighter-rouge">jekyll</code>, which derives from the backend
named <code class="language-plaintext highlighter-rouge">md</code> (which, if you look, itself derives from the <code class="language-plaintext highlighter-rouge">html</code> backend).</p>
<p>In the code above, the <code class="language-plaintext highlighter-rouge">translate-alist</code> indicates that this backend handles <code class="language-plaintext highlighter-rouge">headline</code>
objects via the <code class="language-plaintext highlighter-rouge">org-jekyll-lite-headline-offset</code> method, and handles the <code class="language-plaintext highlighter-rouge">inner-template</code>
object via <code class="language-plaintext highlighter-rouge">org-jekyll-lite-inner-template</code>. These functions take in org elements,
returning text that will get dumped into the export buffer.</p>
<p>The transcoding function <code class="language-plaintext highlighter-rouge">org-jekyll-lite-underline</code> is a particularly simple example:</p>
<div class="language-elisp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">defun</span> <span class="nv">org-jekyll-lite-underline</span> <span class="p">(</span><span class="nv">underline</span> <span class="nv">contents</span> <span class="nv">info</span><span class="p">)</span>
<span class="s">"Transcode UNDERLINE from Org to Markdown.
CONTENTS is the text with underline markup. INFO is a plist
holding contextual information."</span>
<span class="p">(</span><span class="nb">format</span> <span class="s">"<u>%s</u>"</span> <span class="nv">contents</span><span class="p">))</span>
</code></pre></div></div>
<p>Extending a backend consists of figuring out which elements you want to handle via
special logic, then writing the appropriate transcoding functions for each.</p>
<h3 id="implementation-details-for-ox-jekyll-lite">Implementation Details for <code class="language-plaintext highlighter-rouge">ox-jekyll-lite</code></h3>
<p>Most of the more complicated transcoding functions in <code class="language-plaintext highlighter-rouge">ox-jekyll-lite</code> are not written by
me. They either come from <code class="language-plaintext highlighter-rouge">ox-jekyll-md</code>, or from <code class="language-plaintext highlighter-rouge">ox-hugo</code>. For example, I got the
transcoder for footnotes, and for \(\LaTeX\) snippets, from <code class="language-plaintext highlighter-rouge">ox-hugo</code>.</p>
<p>The most interesting addition that I made was to render file links relative to the root
directory of Jekyll, when possible. For example, if you have an image in your
<code class="language-plaintext highlighter-rouge">assets/images</code> folder, Jekyll wants you to link to it as <code class="language-plaintext highlighter-rouge">/assets/images/kitties.jpg</code>, not
with the full path relative to the root directory of your computer’s filesystem.</p>
<p>However, when I use <code class="language-plaintext highlighter-rouge">C-c C-l</code> (along with Helm) to add a link to an org file, it renders
the link with the absolute path.<sup id="fnref:fn3" role="doc-noteref"><a href="#fn:fn3" class="footnote" rel="footnote">3</a></sup> It’s important that the link is “correct” for my
machine, so that any images can render inline, and the links are clickable by me when
from my orgfile. But if the links are relative to my filesystem’s root in the markdown,
then they won’t work within the context of my site. So, we need to “fix” the links as we
export the post to markdown.</p>
<p>I don’t get too complicated here - I just have the user specify a custom variable
<code class="language-plaintext highlighter-rouge">org-jekyll-project-root</code>, which then gets pulled off of the beginning of file paths when
it is present.</p>
<p>For example, on my machine, this repository is located at
<code class="language-plaintext highlighter-rouge">~/code/jekyll/peterewills.github.io/</code>, and so if I link to the file
<code class="language-plaintext highlighter-rouge">~/code/jekyll/peterewills.github.io/assets/images/kitties.jpg</code> in my org file,
<code class="language-plaintext highlighter-rouge">ox-jekyll-lite</code> will, upon export, transform this to a link to <code class="language-plaintext highlighter-rouge">/assets/images/kitties.jpg</code>
in the markdown output. This approach is nice and simple, but it doesn’t handle relative
links, or the situation where you have multiple Jekyll projects.</p>
<p>Anyways, if you want to give it a try, you can clone it <a href="https://github.com/peterewills/ox-jekyll-lite">from GitHub</a> and check it
out. You can just load it up and use <code class="language-plaintext highlighter-rouge">C-c C-e j J</code> to export an org file to a markdown
buffer.</p>
<p>Finally, as a side note, I just have to give a shoutout to the excellent <a href="https://github.com/magnars/s.el"><code class="language-plaintext highlighter-rouge">s.el</code></a> and
<a href="https://github.com/magnars/dash.el"><code class="language-plaintext highlighter-rouge">dash</code></a>, which makes working in elisp infinitely more pleasant. Many thanks to Magnar
Sveen for building such nice tools for us all to use.</p>
<h1 id="my-blogging-workflow">My Blogging Workflow</h1>
<p>Now, my workflow for writing a post is pretty simple.</p>
<ol>
<li>Have brilliant idea</li>
<li>Make an org file in the <code class="language-plaintext highlighter-rouge">_posts</code> directory, named like <code class="language-plaintext highlighter-rouge">YYYY-MM-DD-post-name.org</code></li>
<li>Write brilliant words/equations/cat pictures/etc.</li>
<li>Export to markdown via <code class="language-plaintext highlighter-rouge">C-c C-e j j</code></li>
<li>Commit & push to GitHub</li>
<li>Profit!<sup id="fnref:fn4" role="doc-noteref"><a href="#fn:fn4" class="footnote" rel="footnote">4</a></sup></li>
</ol>
<p>The only additional complication, compared to a pure-markdown workflow, is the addition
of the export step; other than that, it’s identical. And now I can blog in wonderful,
beautiful org mode instead of clunky markdown.</p>
<p>An important caveat for anyone using org and Jekyll; in order to not have Jekyll stumble
over the org artifacts, you should add <code class="language-plaintext highlighter-rouge">*.org</code> and <code class="language-plaintext highlighter-rouge">ltximg</code> to the <a href="https://github.com/peterewills/peterewills.github.io/blob/master/_config.yml#L13-L17">list of excluded files</a> in
your Jekyll <code class="language-plaintext highlighter-rouge">_config.yml</code>. You can see mine <a href="https://github.com/peterewills/peterewills.github.io/blob/master/_config.yml">on GitHub</a>.</p>
<h1 id="conclusion">Conclusion</h1>
<p>If you are just starting to blog, and you love org mode, I’d recommend using Hugo to
build your site, so that you can use the excellent <code class="language-plaintext highlighter-rouge">ox-hugo</code>. It’s a truly org-centric
approach to building a static site, and it’s much more fully-featured than any of the
solutions I’ve found in Jekyll or Pelican.</p>
<p>But, you might want to use Jekyll, because it integrates automagically with GitHub
pages, or perhaps you just like some of the available themes or whatnot. If that’s the
case, then I think <code class="language-plaintext highlighter-rouge">org-jekyll-lite</code> is a reasonable solution for writing your posts in
org. It’s lightweight, and you’ll probably have to tweak it to fit your particular
needs, but it’s small enough that modifying it shouldn’t be too hard. Also, you can
always submit an issue on GitHub and I’ll see if I can help you out.</p>
<p>I hope this post has inspired you to explore more in org mode! It’s a great tool for
organizing notes, tracking agendas/calendars/TODO lists, and for general
writing.<sup id="fnref:fn5" role="doc-noteref"><a href="#fn:fn5" class="footnote" rel="footnote">5</a></sup> Happy blogging, and may the org be with you!</p>
<!----- Footnotes ----->
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:fn1" role="doc-endnote">
<p>As I explain later on, this tool was based on both <a href="https://github.com/gonsie/ox-jekyll-md"><code class="language-plaintext highlighter-rouge">ox-jekyll-md</code></a> and <a href="https://ox-hugo.scripter.co/"><code class="language-plaintext highlighter-rouge">ox-hugo</code></a>. <a href="#fnref:fn1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn2" role="doc-endnote">
<p>The double slash is required because markdown interprets the first slash as an escape character. <a href="#fnref:fn2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn3" role="doc-endnote">
<p>You can see an example of adding a link to an image in the org-mode demo video linked above. <a href="#fnref:fn3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn4" role="doc-endnote">
<p>This is actually a lie; I don’t make any money from this site. <a href="#fnref:fn4" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn5" role="doc-endnote">
<p>There’s also the entire subject of <a href="http://cachestocaches.com/2018/6/org-literate-programming/">literate programming</a>, in which code is interwoven with documentation, which I think is a really nice paradigm, and for which org is a natural fit. <a href="#fnref:fn5" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Peter Willspeter@pwills.comMy workflow for blogging in org mode, with jekyll and org-export.Your p-values Are Bogus2019-09-20T00:00:00+00:002019-09-20T00:00:00+00:00http://www.pwills.com/posts/2019/09/20/bogus<p>People often use a Gaussian to approximate distributions of sample means. This is
generally justified by the central limit theorem, which states that the sample mean of
an independent and identically distributed sequence of random variables converges to a
normal random variable in distribution.<sup id="fnref:fnote_clt" role="doc-noteref"><a href="#fn:fnote_clt" class="footnote" rel="footnote">1</a></sup> In hypothesis testing, we might use
this to calculate a \(p\)-value, which then is used to drive decision making.</p>
<p>I’m going to show that calculating \(p\)-values in this way is actually incorrect, and
leads to results that get <em>less</em> accurate as you collect more data! This has
substantial implications for those who care about the statistical rigor of their A/B
tests, which are often based on Gaussian (normal) approximations.</p>
<h1 id="a-simple-example">A Simple Example</h1>
<p>Let’s take a very simple example. Let’s say that the prevailing wisdom is that no more
than 20% of people like rollerskating. You suspect that the number is in fact much
larger, and so you decide to run a statistical test. In this test, you model each person
as a Bernoulli random variable with parameter \(p\). <strong>The null hypothesis \(H_0\) is
that \(p\leq 0.2\)</strong>. You decide to go out and ask 100 people their opinions on
rollerskating.<sup id="fnref:fnote_sample" role="doc-noteref"><a href="#fn:fnote_sample" class="footnote" rel="footnote">2</a></sup></p>
<p>You begin gathering data. Unbeknownst to you, it is <em>in fact</em> the case that a full 80%
of the population enjoys rollerskating. So, as you randomly ask people if they enjoy
rollerskating, you end up getting a lot of “yes” responses. Once you’ve gotten 100
responses, you start analyzing the data.</p>
<p>It turns out that you got 74 “yes” responses, and 26 “no” responses. Since you’re a
practiced statistician, you know that you can calculate a \(p\)-value by finding the
probability that a binomial random variable with parameter \(p_0=0.2\) would generate a
value \(k\geq74\) with \(n=100\). This probability is just</p>
\[p_\text{exact} = \text{Prob}(k\geq 74) = \sum_{k=74}^{n}{n \choose k} p_0^{k} (1-p_0)^{(n-k)}.\]
<p>However, you know that you can approximate a binomial distribution with a Gaussian of
mean \(\mu=np_0\) and variance \(\sigma^2=np_0(1-p_0)\), so you decide to calculate an
<em>approximate</em> \(p\)-value,</p>
\[p_\text{approx} = \frac{1}{\sqrt{2\pi np_0(1-p_0)}}\int_{k=74}^\infty \exp\left(-\frac{(k-np_0)^2}{2np_0(1-p_0)}\right).\]
<p>However, <strong>this approximation is actually incorrect, and will give you progressively
worse estimates of \(p_\text{exact}\).</strong> Let’s observe this in action.</p>
<h2 id="python-simulation-of-data">Python Simulation of Data</h2>
<p>We simulate data for values \(n=1\) through \(n=1000\), and compute the corresponding
exact and approximate \(p\)-value. We plot the log of the \(p\) value, since they get
very small very quickly.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">matplotlib</span> <span class="kn">import</span> <span class="n">pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">from</span> <span class="nn">scipy.stats</span> <span class="kn">import</span> <span class="n">norm</span><span class="p">,</span> <span class="n">binom</span>
<span class="n">plt</span><span class="p">.</span><span class="n">style</span><span class="p">.</span><span class="n">use</span><span class="p">(</span> <span class="p">[</span><span class="s">'classic'</span><span class="p">,</span> <span class="s">'ggplot'</span><span class="p">])</span>
<span class="n">p_true</span> <span class="o">=</span> <span class="mf">0.8</span>
<span class="n">n</span> <span class="o">=</span> <span class="mi">1000</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">binom</span><span class="p">.</span><span class="n">rvs</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">p_true</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="n">n</span><span class="p">)</span>
<span class="n">p0</span> <span class="o">=</span> <span class="mf">0.2</span>
<span class="n">p_vals</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span>
<span class="n">index</span><span class="o">=</span><span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="n">n</span><span class="p">),</span>
<span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s">'true p-value'</span><span class="p">,</span> <span class="s">'normal approx. p-value'</span><span class="p">]</span>
<span class="p">)</span>
<span class="k">for</span> <span class="n">n0</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="n">n</span><span class="p">):</span>
<span class="n">normal_dev</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">n0</span><span class="o">*</span><span class="n">p0</span><span class="o">*</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">p0</span><span class="p">))</span>
<span class="n">normal_mean</span> <span class="o">=</span> <span class="n">n0</span><span class="o">*</span><span class="n">p0</span>
<span class="n">k</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="n">data</span><span class="p">[:</span><span class="n">n0</span><span class="p">])</span>
<span class="c1"># the "survival function" is 1 - cdf, which is the p-value in our case
</span> <span class="n">normal_logpval</span> <span class="o">=</span> <span class="n">norm</span><span class="p">.</span><span class="n">logsf</span><span class="p">(</span><span class="n">k</span><span class="p">,</span> <span class="n">loc</span><span class="o">=</span><span class="n">normal_mean</span><span class="p">,</span> <span class="n">scale</span><span class="o">=</span><span class="n">normal_dev</span><span class="p">)</span>
<span class="n">true_logpval</span> <span class="o">=</span> <span class="n">binom</span><span class="p">.</span><span class="n">logsf</span><span class="p">(</span><span class="n">k</span><span class="o">=</span><span class="n">k</span><span class="p">,</span> <span class="n">n</span><span class="o">=</span><span class="n">n0</span><span class="p">,</span> <span class="n">p</span><span class="o">=</span><span class="n">p0</span><span class="p">)</span>
<span class="n">p_vals</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">n0</span><span class="p">,</span> <span class="s">'true p-value'</span><span class="p">]</span> <span class="o">=</span> <span class="n">true_logpval</span>
<span class="n">p_vals</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">n0</span><span class="p">,</span> <span class="s">'normal approx. p-value'</span><span class="p">]</span> <span class="o">=</span> <span class="n">normal_logpval</span>
<span class="n">p_vals</span><span class="p">.</span><span class="n">replace</span><span class="p">([</span><span class="o">-</span><span class="n">np</span><span class="p">.</span><span class="n">inf</span><span class="p">,</span> <span class="n">np</span><span class="p">.</span><span class="n">inf</span><span class="p">],</span> <span class="n">np</span><span class="p">.</span><span class="n">nan</span><span class="p">).</span><span class="n">dropna</span><span class="p">().</span><span class="n">plot</span><span class="p">(</span><span class="n">figsize</span> <span class="o">=</span> <span class="p">(</span><span class="mi">8</span><span class="p">,</span><span class="mi">6</span><span class="p">));</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">"Number of Samples"</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">"Log-p Value"</span><span class="p">);</span>
</code></pre></div></div>
<p>We have to drop <code class="language-plaintext highlighter-rouge">inf</code>s because after about \(n=850\) or so, the \(p\)-value actually
gets too small for <code class="language-plaintext highlighter-rouge">scipy.stats</code> to calculate; it just returns <code class="language-plaintext highlighter-rouge">-np.inf</code>.</p>
<p>The resulting plot tells a shocking tale:</p>
<p><img src="/assets/images/p-values.png" alt="P-value Divergence" /></p>
<p>The approximation diverges from the exact value! Seeing this, you begin to weep
bitterly. Is the Central Limit Theorem invalid? Has your whole life been a lie? It turns
out that the answer to the first is a resounding no, and the second… probably also
no. But then what is going on here?</p>
<h2 id="convergence-is-not-enough">Convergence Is Not Enough</h2>
<p>The first thing to note is that, mathematically speaking, the two \(p\)-values
\(p_\text{exact}\) and \(p_\text{approx}\) <strong>do, in fact, converge</strong>. That is to say,
as we increase the number of samples, their difference is approaching zero:</p>
\[\left| p_\text{exact} - p_\text{approx}\right| \rightarrow 0\]
<p>What I’m arguing, then, is that <strong>convergence is not enough</strong>.</p>
<p>If it were, then we could just approximate the true \(p\)-value with 0. That is, we
could report a \(p\)-value of \(p_\text{approx} = 0\), and claim that since our
approximation is converging to the actual value, it should be taken
seriously. Obviously, this should not be taken seriously as an approximation.</p>
<p>Our intuitive sense of “convergence”, the sense that \(p_\text{approx}\) is becoming “a
better and better approximation of” \(p_\text{exact}\) as we take more samples,
corresponds to the <em>percent error</em> converging to zero:</p>
\[\left| \frac{p_\text{approx} - p_\text{exact}}{p_\text{exact}}\right| \rightarrow 0.\]
<p>In terms of asymptotic decay, this is a stronger claim than convergence. Rather than
their difference converging to zero, which means it is \(o(1)\), we demand that their
difference converge to zero <em>faster than \(p_\text{exact}\)</em>,</p>
\[\left| p_\text{exact} - p_\text{approx}\right| = o\left(p_\text{exact}\right).\]
<p>It would also suffice to have an upper bound on the \(p\)-value; that is, if we could
say that \(p_\text{exact} < p_\text{approx}\), so \(p_\text{exact}\) is <em>at worst</em> our
approximate value \(p_\text{approx}\), and we knew that this held regardless of sample
size, then we could report our approximate result knowing that it was at worst a bit
conservative. However, as far as I can see, the central limit theorem and other similar
convergence results give us no such guarantee.</p>
<h2 id="implications">Implications</h2>
<p>What I’ve shown is that for the simple case above, Gaussian approximation is not a
strategy that will get you good estimates of the true \(p\)-value, especially for large
amounts of data. You will under-estimate your \(p\)-value, and therefore overestimate
the strength of evidence you have against the null hypothesis.</p>
<p>Although A/B testing is a slightly more complex scenario, I suspect that the same
problem exists in that realm. A refresher on a typical A/B test scenario: you, as the
administrator of the test, care about the difference between two sample means. If they
samples are from Bernoulli random variables (a good model of click-through rates), then
the <em>true</em> distribution of this difference is the distribution of the difference of
(scaled) binomial random variables, which is more difficult to write down and work
with. Of course, the Gaussian approximation is simple, since the difference of two
Gaussians is again a Gaussian.<sup id="fnref:fnote_AB" role="doc-noteref"><a href="#fn:fnote_AB" class="footnote" rel="footnote">3</a></sup></p>
<p>Most statistical tests are approximate in this way. For example, the \(\chi^2\) test for
goodness of fit is an approximate test. So what are we to make of the fact that this
approximation does not guarantee increasingly valid \(p\)-values? Honestly, I don’t
know. I’m sure that others have considered this issue, but I’m not familiar with the
thinking of the statistical community on it. (As always, please comment if you know
something that would help me understand this better.) All I know is that when doing
tests like this in the future, I’ll be much more careful about how I report my results.</p>
<h1 id="afterword-technical-details">Afterword: Technical Details</h1>
<p>As I said above, the two \(p\)-values do, in fact, converge. However, there is an
interesting mathematical twist in that <strong>the convergence is not guaranteed by the
central limit theorem.</strong> It’s a bit besides the point, and quite technical, but I found
it so interesting that I thought I should write it up.</p>
<p>As I said, this section isn’t essential to my central argument about the insufficiency
of simple convergence; it’s more of an interesting aside.</p>
<h2 id="limitations-of-the-central-limit-theorem">Limitations of the Central Limit Theorem</h2>
<p>To understand the problem, we have to do a deep dive into the details of the central
limit theorem. This will get technical. The TL;DR is that since our \(p\)-values are
getting smaller, the CLT doesn’t actually guarantee that they will converge.</p>
<p>Suppose we have a sequence of random variables \(X_1, X_2, X_3, \ldots\). These would
be, in the example above, the Bernoulli random variables that represent individual people’s
responses to your question about rollerskates. Suppose that these random variables are
independent and identically distributed, with mean \(\mu\) and finite variance
\(\sigma^2\).<sup id="fnref:fnote_bin" role="doc-noteref"><a href="#fn:fnote_bin" class="footnote" rel="footnote">4</a></sup></p>
<p>Let \(S_n\) be the sample mean of all the \(X_i\) up through \(n\):</p>
\[S_n = \frac{1}{n} \sum_{i=1}^n X_i.\]
<p>We want to say what distribution the sample mean converges to. First, we know it’ll
converge to something close to the mean, so let’s subtract that off so that it converges
to something close to zero. So now we’re considering \(S_n - \mu\). But we also know
that the standard deviation goes down like \(1/\sqrt{n}\), so to get it to converge to
something stable, we have to multiply by \(\sqrt{n}\). So now we’re considering the
shifted and scaled sample mean \(\sqrt{n}\left(S_n - \mu\right)\).</p>
<p>The central limit theorem states that this converges <strong>in distribution</strong> to a normal
random variable with distribution \(N(0, \sigma^2)\). Notationally, you might see
mathematicians write</p>
\[\sqrt{n}\left(S_n-\mu\right)\ \xrightarrow{D} N(0,\sigma^2).\]
<p>What does it mean that they converge <strong>in distribution</strong>? It means that, for a fixed
area, the areas under the respective curves converge. Note that <strong>we have to fix the
area</strong> to get convergence. Let’s look at some pictures. First, note that we can plot the
exact distribution of the variable \(\sqrt{n}(S_n-\mu)\); it’s just a binomial random
variable, appropriately shifted and scaled. We’ll plot this alongside the normal
approximation \(N(0,\sigma^2)\).</p>
<!-- I'd like to have this centered. -->
<p><img src="/assets/images/clt.gif" alt="CLT gif" /></p>
<p>The area under the shaded part of the normal converges to the area of the bars in that
same shaded region. This is what convergence in distribution means.</p>
<p>Now for the crux. As we gather data, it becomes more and more obvious that our null
hypothesis is incorrect - that is, we move further and further out into the tail of the
null hypothesis’ distribution for \(S_n\). This is very intuitive - as we gather more
data, we expect our \(p\)-value to go down. The \(p\)-value is a tail integral of the
distribution, so we expect to be moving further and further into the tail of the
distribution.</p>
<p>Here’s a gif, where the shaded region represents the \(p\)-value that we’re calculating:</p>
<!-- I'd like to have this centered. -->
<p><img src="/assets/images/p-val.gif" alt="p-value gif" /></p>
<p>As we increase \(n\), the area we’re integrating changes. So we don’t get convergence
guarantees from the CLT.</p>
<h2 id="the-berry-esseen-theorem">The Berry-Esseen Theorem</h2>
<p>It’s worth noting that there is a stronger statement of convergence that applies
specifically to the convergence of the binomial distribution to the corresponding
Gaussian. It is called the <strong>Barry-Esseen Theorem</strong>, and it states that the maximum
distance between the cumulative probability functions of the binomial and the
corresponding Gaussian is \(o(n^{-1/2})\). This claim, which is akin to uniform
convergence of functions (compare to the pointwise convergence of the CLT) does, in
fact, guarantee that our \(p\)-values will converge.</p>
<p>But, as I’ve said above, this is immaterial, albeit interesting; we know already that
the \(p\)-values converge, and we also know that this is not enough for us to be
reporting one as an approximation of the other.</p>
<!-------------------------------- FOOTER ---------------------------->
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:fnote_clt" role="doc-endnote">
<p>So long as the variance of the distribution being sampled is finite. <a href="#fnref:fnote_clt" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote_sample" role="doc-endnote">
<p>You should decide this number based on some alternative hypothesis and
a power analysis. Also, you should ensure that you are sampling people evenly -
going to a park, for example, might bias your sample towards those that enjoy
rollerskating. <a href="#fnref:fnote_sample" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote_AB" role="doc-endnote">
<p>I haven’t done a numerical test on this scenario because the true
distribution (the difference between two scaled binomials) is nontrivial to
calcualte, and numerical issues arise as we calculate such small \(p\)-values, which
SciPy takes care of for us in the above example. But as I said, I would be
unsurprised if our Gaussian-approximated \(p\)-values are increasingly poor
approximations of the true \(p\)-value as we gather more samples. <a href="#fnref:fnote_AB" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote_bin" role="doc-endnote">
<p>In our case, for a single Bernoulli random variable with parameter \(p\),
we have \(\mu=p\) and \(\sigma^2=p(1-p)\). <a href="#fnref:fnote_bin" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Peter Willspeter@pwills.comIs hypothesis testing built upon a house of lies? No, probably not. But still, read this article.DS Interview Study Guide Part II: Software Engineering2019-08-29T00:00:00+00:002019-08-29T00:00:00+00:00http://www.pwills.com/posts/2019/08/29/engineering<p>This post continues my series on data science interviews. One of the major difficulty of
doing data science interviews is that you must show expertise in a wide variety of
skills. In particular, I see four key subject areas that you might be asked about during
an interview:</p>
<ol>
<li>Statistics</li>
<li>Software Engineering/Coding</li>
<li>Machine Learning</li>
<li>“Soft” Questions</li>
</ol>
<p>This post focuses on software engineering & coding. It will be primarily a resource for
aggregating content that I think you should be familiar with. I will mostly point to
outside sources for technical exposition and practice questions.</p>
<p>I’ll link to these as appropriate throughout the post, but I thought it would be helpful
to put up front a list of the primary resources that I’ve used when studying for
interviews. Some of my favorites are:</p>
<ul>
<li><a href="https://www.amazon.com/Structures-Algorithms-Python-Michael-Goodrich/dp/1118290275/">Data Structures and Algorithms in Python</a>, for a good introduction to
data structures such as linked lists, arrays, hashmaps, and so on. It also can give
you good sense of how to write idiomatic Python code, for building fundamental
classes.</li>
<li><a href="https://sqlzoo.net/">SQLZoo</a> for studying SQL and doing practice questions. I particularly like
the “assessments”.</li>
<li><a href="http://www.crackingthecodinginterview.com/">Cracking the Coding Interview</a> for lots of practice questions organized by
subject, and good general advice for the technical interviewing process.</li>
</ul>
<p>I also use coding websites like LeetCode to practice various problems. I also look on
Glassdoor to see <a href="https://www.glassdoor.com/Interview/san-francisco-data-scientist-interview-questions-SRCH_IL.0,13_IM759_KO14,28.htm">what kinds of problems</a> people have been asked.</p>
<p>As always, I’m working to improve this post, so please do leave comments with feedback.</p>
<h1 id="what-languages-should-i-know">What Languages Should I Know?</h1>
<p>In this section of data science interviews, your are generally asked to implement things
in code. So, which language should you do it in? Generally, the best answer is
(unsurprisingly) that <strong>you should work in Python</strong>. The next most popular choice is R;
I’m not very familiar with R, so I can’t really speak to it’s capabilities.</p>
<p>There are a few reasons you should work in Python:</p>
<ol>
<li>It’s widely adopted within industry.</li>
<li>It has high-quality, popular packages for working with data (see <code class="language-plaintext highlighter-rouge">pandas</code>, <code class="language-plaintext highlighter-rouge">numpy</code>,
<code class="language-plaintext highlighter-rouge">scipy</code>, <code class="language-plaintext highlighter-rouge">statsmodels</code>, <code class="language-plaintext highlighter-rouge">scikit-learn</code>, <code class="language-plaintext highlighter-rouge">matplotlib</code>, etc).</li>
<li>It bridges the gap between academic work (e.g. using NumPy to build a fast solver for
differential equations) and industrial work (e.g. using Django to build webservices).</li>
</ol>
<p>This is far from an exhaustive list. Anyways, I mostly work in Python. I think it’s a
nice language because it is clear and simple to write.</p>
<p>If you want to use another language, you should make sure that you can do everything you
need to - this includes reading & writing data, cleaning/munging data, plotting,
implementing statistical and machine learning models, and leveraging basic data types
like hashmaps and arrays (more on those later).</p>
<p>I think if you wanted to do your interviews in R it would be fine, so long as you can do
the above. I would strongly recommend against languages like MATLAB, which are
proprietary and not open-source.</p>
<p>Languages like Java can be tricky since they might not have the data-oriented libraries
that Python has. For example, I’ve worked profesionally in Scala, and am very
comfortable manipulating data via the Spark API within it, but still wouldn’t want to
have to use it in an interview; it just isn’t as friendly for general-purpose hacking as
Python.</p>
<p>So is Python all you need? Well, not quite. You should also be familiar with SQL for
querying databases; we’ll get into that later. I don’t think the dialect you use
particularly matters. <a href="https://sqlzoo.net/">SQLZoo</a> works with MySQL, which is fine. Familiarity with
bash and shell-scripting is useful for a data scientist in their day-to-day work, but
generally isn’t asked about in interviews. For the interviews, I’d say if you know one
general-purpose language (preferably Python, or R if need be) and SQL, then you’ll be
fine.</p>
<h1 id="general-tips-for-coding-interviews">General Tips for Coding Interviews</h1>
<p>Coding interviews are notorious for being high-stress, so it’s important that you
practice in a way that will maximize your comfort during the interview itself - you
don’t want to add any unnecessary additional stress into an already difficult
situation. There are a wide variety of philosophies and approaches to preparing yourself
for and executing a successful interview. I’m going to talk about some points that
resonate with me, but I’d also recommend reading <a href="http://www.crackingthecodinginterview.com/">Cracking the Coding Interview</a>
for a good discussion. Of course, this isn’t the final word on the topic - there are
endless resources available online that address this.</p>
<h2 id="how-to-prepare">How to Prepare</h2>
<p>When preparing for the interview, make sure to practice in an environment similar to the
interview environment. There are a few aspects of this to keep in mind.</p>
<ul>
<li>Make sure that you replicate the <strong>writing environment</strong> of the interview. So, if
you’ll be coding on a whiteboard, try to get access to a whiteboard to practice. At
least practice on a pad of paper, so that you’re comfortable with handwriting code -
it’s really quite different than using a text editor. If you’ll be coding in a Google
Doc, practice doing that (protip: used a monospaced font). Most places I’ve
interviewed at don’t let you evaluate your code to test it, so you have to be prepared
for that.</li>
<li><strong>Time yourself!</strong> It’s important to make sure you can do these things in a reasonable
amount of time. Generally, these things last 45 minutes per “round” (with multiple
rounds for on-site interviews). Focus on being efficient at implementing simple ideas,
so that you don’t waste a bunch of time with your syntax and things like that.</li>
<li><strong>Practice talking.</strong> If you practice by coding silently by yourself, then it might
feel strange when you’re in the interview and have to talk through your process. The
best is if you can have a friend who is familiar with interviewing play the
interviewer, so that you can talk to them, get asked questions, etc. You can also
record yourself and just talk to the recorder, so that you get practice externalizing
your thoughts.</li>
</ul>
<p>There are some services online that will do “practice” interviews for you. When I was
practicing for a software engineer interview with Google, I used <a href="http://www.gainlo.co/#!/">Gainlo</a> for
this - they were kind of expensive, but you interview with real Google software
engineers, which I found helpful.</p>
<p>However, the interviews for a software engineering position at Google are very
standardized in format. I haven’t used any of the services that do this for data
science, and the interviews you’ll face are so varied. Therefore, I imagine it is harder
to do helpful “mock interviews”. If you’ve used any of these services, I’d be very
curious to hear about your experience.</p>
<h2 id="tips-for-interviewing">Tips for Interviewing</h2>
<p>There are some things it’s important to keep in mind as you do the interview itself.</p>
<ul>
<li><strong>Talk about your thought process.</strong> Don’t just sit sliently thinking, then go and
write something on the board. Let the interviewer into your mind so that they can see
how you are thinking about the problem. This is good advice at any point in a
technical interview.</li>
<li><strong>Start with a simple solution you have confidence in.</strong> If you know that you can
quickly write up a suboptimal solution (in this case, maybe insertion sort), then do
that! You can discuss <em>why</em> that solution is sub-optimal, and they will often
brainstorm with you about how to improve it. That said, if you are just as confident
in writing up something more optimal (say, quicksort) then feel free to jump right to
that.</li>
<li><strong>Sketch out your solution before doing real code.</strong> This is not necessary, but
sometimes for complicated stuff it’s nice to write out your approach in pseudocode
before jumping into real code. This can also help with exposing your thought process
to the interviewer, and making sure they’re on board with how you’re thinking about
it.</li>
<li><strong>Think about edge cases.</strong> Suppose they ask you to write a function that sorts a
list. What if you’re given an empty list? What if you’re given a list of
non-comparable things? (In Python, this might be a list of lists.) What does your
function do in this case? Is that what you <em>want</em> it to do? There’s no right answer
here, but you should definitely be thinking about this and asking the interview how
they want the function to behave on these cases.</li>
<li><strong>Be sure to do a time complexity analysis on your solution.</strong> They want to know that
you can think about efficiency, so unless they explicitly ask you not to do this, I’d
recommend it. We’ll discuss more about what this means below.</li>
</ul>
<p>For a more thorough discussion of preparation and day-of techniques, I’d recommend
<a href="http://www.crackingthecodinginterview.com/">Cracking the Coding Interview</a>.</p>
<h2 id="tips-for-coding">Tips for Coding</h2>
<p>There are few things specifically in how the interviewee writes code that I think are
worth mentioning. This kind of stuff usually isn’t a huge deal, but if you write good
code, it can show professionalism and help leave a good impression.</p>
<ul>
<li><strong>Name your variables well.</strong> If the variable is the average number of users per
region, use <code class="language-plaintext highlighter-rouge">num_users_per_region</code>, or <code class="language-plaintext highlighter-rouge">users_per_region</code>, not <code class="language-plaintext highlighter-rouge">avg_usr</code> or
<code class="language-plaintext highlighter-rouge">num_usr</code>. Unlike in mathematics, it’s good to have long, descriptive variables.</li>
<li><strong>Use built-ins when you can!</strong> Python already <em>has</em> functions for sorting, for
building cartesian products of lists, for implementing various models (in
<code class="language-plaintext highlighter-rouge">statsmodels</code> and <code class="language-plaintext highlighter-rouge">scikit-learn</code>), and endless other things. It also has some cool
data structures already implemented, like the <a href="https://docs.python.org/3.7/library/heapq.html"><code class="language-plaintext highlighter-rouge">heap</code></a> and
<a href="https://docs.python.org/3/library/queue.html"><code class="language-plaintext highlighter-rouge">queue</code></a>. Get to know the <code class="language-plaintext highlighter-rouge">itertools</code> module; it has lots of usefull stuff.
if you can use these built-ins effectively, it demonstrates skill and knowledge
without adding much effort on your part.</li>
<li><strong>Break things into functions.</strong> If one step of your code is sorting a list, and you
can’t use the built-in <code class="language-plaintext highlighter-rouge">sorted()</code> function, then write a separate function <code class="language-plaintext highlighter-rouge">def
sort()</code> before you write your main function. This increases both readability and
testability of code, and is essential for real-world software.</li>
<li><strong>Write idiomatic Python.</strong> This is a bit less important, but make sure to iterate
directly over iterables, don’t do <code class="language-plaintext highlighter-rouge">for i in range(len(my_iterable))</code>. Also,
familiarize yourself with <code class="language-plaintext highlighter-rouge">enumerate</code> and <code class="language-plaintext highlighter-rouge">zip</code> and know how to use them. Know how to
use list compreshensions, and be aware that you can do a similar thing for
dictionaries, sets, and even arguments of functions - for example, you can do
<code class="language-plaintext highlighter-rouge">max(item for item in l if item % 2 == 0)</code> to find the maximum even number in l. Know
how to do string formatting using either <code class="language-plaintext highlighter-rouge">.format()</code> for <code class="language-plaintext highlighter-rouge">f</code>-strings in Python
3.<sup id="fnref:fnote_py3" role="doc-noteref"><a href="#fn:fnote_py3" class="footnote" rel="footnote">1</a></sup></li>
</ul>
<p>I’m only scratching the surface of how to write good code. It helps to read code that
others have written to see what you don’t know. You can also look at code in large
open-source libraries.</p>
<p>With all that said, let’s move on to some of the content that might be asked about in
these interviews.</p>
<h1 id="working-with-data">Working with Data</h1>
<p>One of the fundamental tasks of a data scientist is to load, manipulate, clean, and
visualize data in various formats. I’ll go through some of the basic tasks that I think
you should be able to do, and either include or link to Python implementations. If you
work in R, or any other language, you should make sure that you can still do these
things in your preferred language.</p>
<p>In Python, the key technologies are the packages pandas (for loading, cleaning, and
manipulating data), numpy (for efficiently working with unlabeled numeric data), and
matplotlib (for plotting and visualizing data).</p>
<h2 id="loading--cleaning-data">Loading & Cleaning Data</h2>
<p><a href="https://www.datacamp.com/community/tutorials/pandas-read-csv">This tutorial on DataCamp</a> nicely deals with the basics of using
<code class="language-plaintext highlighter-rouge">pd.read_csv()</code> to load data into Pandas. It is also possible to load from other
formats, but in my experience writing to and from comma- or tab-separated plaintext is
by far the most common approach for datasets that fit in memory.<sup id="fnref:fnote_parquet" role="doc-noteref"><a href="#fn:fnote_parquet" class="footnote" rel="footnote">2</a></sup></p>
<p>For example, suppose you had the following data in a csv file:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>name,age,country,favorite color
steve,7,US,green
jennifer,14,UK,blue
franklin,,UK,black
calvin,22,US,
</code></pre></div></div>
<p>You can copy and paste this, into Notepad or whatever text editor you
like<sup id="fnref:fnote_emacs" role="doc-noteref"><a href="#fn:fnote_emacs" class="footnote" rel="footnote">3</a></sup>, and save it as <code class="language-plaintext highlighter-rouge">data.csv</code>.</p>
<p>You should be able to</p>
<ul>
<li>load in data from text, whether it is separated by commas, tabs, or some other
arbitrary character (sometimes things are separated by the “pipe” character <code class="language-plaintext highlighter-rouge">|</code>). In
this case, you can just do <code class="language-plaintext highlighter-rouge">df = pd.read_csv('data.csv')</code> to load it.</li>
<li>Filter for missing data. If you wanted to find the row(s) where the age is missing,
for example, you could do <code class="language-plaintext highlighter-rouge">df[df['age'].isnull()]</code></li>
<li>Filter for data values. For example, to find people from the US, do <code class="language-plaintext highlighter-rouge">df[df['country'] == 'US']</code></li>
<li>Replace missing data; use <code class="language-plaintext highlighter-rouge">df.fillna(0)</code> to replace missing data with zeros. Think for
yourself about how you would want to handle missing data in this case - does it make
sense to replace everything with zeros? What <em>would</em> make sense?</li>
</ul>
<p>Dealing with missing data is, in particular, an important problem, and not one that has
an easy answer. <a href="https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4">Towards Data Science</a> has a decent post on this
subject, but if you’re curious, there’s a lot to read about and learn here.</p>
<p>More advanced topics in pandas-fu include <a href="https://wesmckinney.com/blog/groupby-fu-improvements-in-grouping-and-aggregating-data-in-pandas/">using <code class="language-plaintext highlighter-rouge">groupby</code></a>, joining
dataframes (this is called a “merge” in pandas, but works the same as a SQL join), and
<a href="https://hackernoon.com/reshaping-data-in-python-fa27dda2ff77">reshaping data</a>.</p>
<p>As I said before, loading and manipulating data is one of the fundamental tasks of a
data scientist. You should probably be comfortable doing most or all of these tasks if
asked. Pandas can be a bit unintuitive, so I’d recommend practicing if you aren’t
already comfortable with it. Doing slicing and reshaping tasks in numpy is also an
important skill, so make sure you are comfortable with that as well.</p>
<h2 id="visualization">Visualization</h2>
<p>Another essential aspect of data work is visualization. Of course, this is an entire
field unto itself; here, I’ll mostly be focusing on the practical aspects of making
simple plots. If you want to start to learn more about the overarching principles of the
visual representation of data, <a href="https://www.edwardtufte.com/tufte/books_vdqi">Tufte’s book</a> is the classic in the field.</p>
<p>In Python, the fundamental tool used for data visualization is the library
<code class="language-plaintext highlighter-rouge">matplotlib</code>. There exist many other libraries for more complicated visualization tasks,
such as <code class="language-plaintext highlighter-rouge">seaborn</code>, <code class="language-plaintext highlighter-rouge">bokeh</code>, and <code class="language-plaintext highlighter-rouge">plotly</code>, but the only one that you really <em>need</em> to be
comfortable with (in my opinion) is <code class="language-plaintext highlighter-rouge">matplotlib</code>.</p>
<p>You should be comfortable with:</p>
<ul>
<li>plotting two lists against one another</li>
<li>changing the labels on the x- and y-axis of your plot, and adding a title</li>
<li>changing the x- and y-limits of your plot</li>
<li>plotting a bar graph</li>
<li>plotting a histogram</li>
<li>plotting two curves together, labelling them, and adding a legend</li>
</ul>
<p>I won’t go through the details here - I’m sure you can find many good guides to each of
these online. The <a href="https://matplotlib.org/3.1.1/tutorials/introductory/pyplot.html">matplotlib pyplot tutorial</a> is a good place to
start.<sup id="fnref:fnote_pyplot" role="doc-noteref"><a href="#fn:fnote_pyplot" class="footnote" rel="footnote">4</a></sup></p>
<p>It’s worth noting that you can plot directly from pandas, by doing <code class="language-plaintext highlighter-rouge">df.plot()</code>. This
just calls out to matplotlib and plots your dataframe; I will often find myself both
plotting from the pandas <code class="language-plaintext highlighter-rouge">DataFrame.plot()</code> method as well as directly using
<code class="language-plaintext highlighter-rouge">pyplot.plot()</code>. They work on the same objects, and so you can use them together to make
more complicated plots with multiple values plotted.</p>
<h1 id="data-structures--algorithms">Data Structures & Algorithms</h1>
<p>Designing and building effective software is predicated on a solid understanding of the
basic data structures that are available, and familiarity with the ways that they are
employed in common algorithms. For me, learning this material opened up the world of
software engineering - it illuminated the inner workings of computer languages. It also
helped me understand the pros and cons of various approaches to problems, in ways that I
wouldn’t have been able to before.</p>
<p>This subject is fundamental to software engineering interviews, but for data scientists,
its importance can vary drastically from role to role. For engineering-heavy roles, this
material can make up half or more of the interview, while for more statistician-oriented
roles, it might only be very lightly touched upon. You will have to use your judgement
to determine to what extent this material is important to you.</p>
<p>I learned this material when I was interviewing by reading the book <a href="https://www.amazon.com/Structures-Algorithms-Python-Michael-Goodrich/dp/1118290275/">Data Structures and
Algorithms in Python</a>.<sup id="fnref:fnote_dsa2" role="doc-noteref"><a href="#fn:fnote_dsa2" class="footnote" rel="footnote">5</a></sup> It’s really a great book - it has good, clear
explanations of all the important topics, including complexity analysis and some of the
basics of the Python language. I can’t recommend it highly enough if you want to get
more familiar with this material.<sup id="fnref:fnote_dsa" role="doc-noteref"><a href="#fn:fnote_dsa" class="footnote" rel="footnote">6</a></sup> You can buy it, or look around online for
the PDF - it shouldn’t be too hard to find.</p>
<h2 id="time-and-space-complexity-analysis">Time and Space Complexity Analysis</h2>
<p>Before you begin writing algorithms, you need to know how to analyze their
complexity. The “complexity” of an algorithm tells you how the amount of time (or space)
that the algorithm takes depends on the size of the input data.</p>
<p>It is formalized using the so-called “big-O” notation. The precise mathematical
definition of \(\mathcal{O}(n)\) is somewhat confusing, so you can just think of it
roughly as meaning that an algorithm that is \(\mathcal{O}(n)\) “scales like \(n\)”; so,
if you double the input size, you double the amount of time it takes. If an algorithm is
\(\mathcal{O}(n^3)\), then, doubling the input size means that you multiply the time it
takes by \(2^3 = 8\).<sup id="fnref:fnote_bigo" role="doc-noteref"><a href="#fn:fnote_bigo" class="footnote" rel="footnote">7</a></sup> You can see how even a \(\mathcal{O}(n^2)\) algorithm wouldn’t
work for large data; even if it runs in a reasonable amount of time (say, 5 seconds)for
10,000 points, it would take about 15,000 years to run on 1 billion data
points. Obviously, this is no good.</p>
<p>So complexity analysis is critical. You don’t want to settle for a \(\mathcal{O}(n^2)\)
solution when a \(\mathcal{O}(n)\) or \(\mathcal{O}(n \log n)\) solution is available. I
won’t get into how to do the analysis here, besides saying that I often like to annotate
my loops with their complexity when I’m writing things. For example, here’s a (slow)
approach to finding the largest k (unique) numbers in a list:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">get_top_k</span><span class="p">(</span><span class="n">k</span><span class="p">,</span> <span class="n">input_list</span><span class="p">):</span>
<span class="n">top_k</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">k</span><span class="p">):</span> <span class="c1"># happens k times
</span> <span class="n">remaining</span> <span class="o">=</span> <span class="p">[</span><span class="n">num</span> <span class="k">for</span> <span class="n">num</span> <span class="ow">in</span> <span class="n">input_list</span> <span class="k">if</span> <span class="n">num</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">top_k</span><span class="p">]</span> <span class="c1"># O(n)
</span> <span class="k">if</span> <span class="n">remaining</span><span class="p">:</span>
<span class="n">top_remaining</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="n">remaining</span><span class="p">)</span> <span class="c1"># O(n)
</span> <span class="n">top_k</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">top_remaining</span><span class="p">)</span> <span class="c1"># O(1)
</span> <span class="k">return</span> <span class="n">top_k</span>
</code></pre></div></div>
<p>I know that the outer loop happend <code class="language-plaintext highlighter-rouge">k</code> times, and since finding the maximum of a list is
\(\mathcal{O}(n)\), the total task is \(\mathcal{O}(nk)\).<sup id="fnref:fnote_asymptotics" role="doc-noteref"><a href="#fn:fnote_asymptotics" class="footnote" rel="footnote">8</a></sup> To learn
more about how to do complexity analysis, I’d look at <a href="https://www.amazon.com/Structures-Algorithms-Python-Michael-Goodrich/dp/1118290275/">DS&A</a>, <a href="http://www.crackingthecodinginterview.com/">Cracking the
Coding Interview</a>, or just look around online - I’m sure there are plenty of good
resources out there.</p>
<p>You can also consider not just the time of computation, but the amount of memory (space)
that your algorithm uses. This is not quite as common as time-complexity analysis, but
is still important to be able to do.</p>
<p>A very useful resource for anyone studying for a coding interview is the <a href="https://www.bigocheatsheet.com/">big-O cheat
sheet</a>, which shows the complexity of access, search, insertion, and deletion for
various data types, as well as the complexity of searching algorithms, and a lot more. I
often use it as a reference, but of course it’s important that you understand <em>why</em> (for
example) an array has \(\mathcal{O}(n)\) insertion. Just memorizing complexities won’t
help you much.</p>
<h2 id="arrays--hashmaps">Arrays & Hashmaps</h2>
<p>In my opinion, the two essential data structures for a data scientist to know are
the array and the hashmap. In Python, the <code class="language-plaintext highlighter-rouge">list</code> type is an array, while the <code class="language-plaintext highlighter-rouge">dict</code> type
is a hashmap. Since both are used so commonly, you have to know their properties if you
want to be able to design efficient algorithms and do your complexity analysis
correctly.</p>
<p><strong>Arrays</strong> are a data type where a piece of data (like a string) is linked to an index
(in Python, this is an integer, starting with 0). I won’t go too deep into the details
here, but for arrays, the important thing to know is that getting any element of an
array is easy (i.e. doing <code class="language-plaintext highlighter-rouge">mylist[5]</code> is \(\mathcal{O}(1)\), so it doesn’t depend on the
size of the array) but adding elements (particularly in the beginning or middle of the
array) is difficult; doing <code class="language-plaintext highlighter-rouge">mylist.insert(k, 'foo')</code> is \(\mathcal{O}(n-k)\), where
\(k\) is the position you wish to insert at.<sup id="fnref:fnote_linked" role="doc-noteref"><a href="#fn:fnote_linked" class="footnote" rel="footnote">9</a></sup></p>
<p>Arrays are what we usually use when we’re building unordered, unlabelled collections of
objects in Python. This is fine, since insertion at the end of an array is fast, and
we’re often accessing slices of arrays in a complicated fashion (particularly in
numpy). I generally use arrays by default, without thinking too much about it, and it
generally works out alright.</p>
<p><strong>Hashmaps</strong> also link values to keys, but in this case the key can be anything you
want, rather than having to be an ordered set of integers. In Python, you build them by
specifying the key and the value, like <code class="language-plaintext highlighter-rouge">{'key': 'value'}</code>. Hashmaps are magical in that
accessing elements <em>and</em> adding elements are both
\(\mathcal{O}(1)\).<sup id="fnref:fnote_array_hashmap" role="doc-noteref"><a href="#fn:fnote_array_hashmap" class="footnote" rel="footnote">10</a></sup> Why is this cool? Well, say you wanted to
store a bunch of people’s names and ages. You might think to do a list of tuples:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">names_ages</span> <span class="o">=</span> <span class="p">[(</span><span class="s">'Peter'</span><span class="p">,</span> <span class="mi">12</span><span class="p">),</span> <span class="p">(</span><span class="s">'Kat'</span><span class="p">,</span> <span class="mi">25</span><span class="p">),</span> <span class="p">(</span><span class="s">'Jeff'</span><span class="p">,</span> <span class="mi">41</span><span class="p">)]</span>
</code></pre></div></div>
<p>Then, if you wanted to find out Jeff’s age, you would have to iterate through the list
and find the correct tuple:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">name</span><span class="p">,</span> <span class="n">age</span> <span class="ow">in</span> <span class="n">name_ages</span><span class="p">:</span> <span class="c1"># happens n times
</span> <span class="k">if</span> <span class="n">name</span> <span class="o">==</span> <span class="s">'Jeff'</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Jeff's age is </span><span class="si">{</span><span class="n">age</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
</code></pre></div></div>
<p>This is \(\mathcal{O}(n)\) - not very efficient. With hashmaps, you can just do</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">name_ages</span> <span class="o">=</span> <span class="p">{</span><span class="s">'Peter'</span><span class="p">:</span> <span class="mi">12</span><span class="p">,</span> <span class="s">'Kat'</span><span class="p">:</span> <span class="mi">25</span><span class="p">,</span> <span class="s">'Jeff'</span><span class="p">:</span> <span class="mi">41</span><span class="p">}</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Jeff's age is </span><span class="si">{</span><span class="n">name_ages</span><span class="p">[</span><span class="s">'Jeff'</span><span class="p">]</span><span class="si">}</span><span class="s">"</span><span class="p">)</span> <span class="c1"># O(1)! Wow!
</span></code></pre></div></div>
<p>It might not be obvious how cool this is until you see how to use it in
problems. <a href="http://www.crackingthecodinginterview.com/">Cracking the Coding Interview</a> has lots of good problems on hashmaps,
but I’ll just reproduce some of the classics here. I think it’s worth knowing these,
because they really can give you an intuitive sense of when and how hashmaps are
valuable.</p>
<p>The first classic hashmap algorithm is <strong>counting frequencies of items in a list.</strong> That
is, given a list, you want to know how many times each item appears. You can do this via
the following:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">get_freqs</span><span class="p">(</span><span class="n">l</span><span class="p">):</span>
<span class="n">freqs</span> <span class="o">=</span> <span class="p">{}</span>
<span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">l</span><span class="p">:</span> <span class="c1"># happens O(n) times
</span> <span class="k">if</span> <span class="n">item</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">freqs</span><span class="p">:</span> <span class="c1"># This check is O(1)! Wow!
</span> <span class="n">freqs</span><span class="p">[</span><span class="n">item</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">freqs</span><span class="p">[</span><span class="n">item</span><span class="p">]</span> <span class="o">+=</span> <span class="mi">1</span> <span class="c1"># Also O(1)! Wow!
</span> <span class="k">return</span> <span class="n">freqs</span>
</code></pre></div></div>
<p>Try and think of how you’d do this <em>without</em> hashmaps. Probably, you’d sort the list,
and then look at adjacent values. But sorting is, at best \(\mathcal{O}(\log n)\). This
solution does it in \(\mathcal{O}(n)\)!</p>
<p>Another classic problem that is solved with hashmaps is to <strong>find all repeated elements
in a list.</strong> This is really just a variant of the last, where you look for elements that
have frequency greater than 1.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">get_repeated</span><span class="p">(</span><span class="n">l</span><span class="p">):</span>
<span class="n">f</span> <span class="o">=</span> <span class="n">get_freqs</span><span class="p">(</span><span class="n">l</span><span class="p">)</span>
<span class="k">return</span> <span class="p">[</span><span class="n">item</span> <span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">f</span> <span class="k">if</span> <span class="n">f</span><span class="p">[</span><span class="n">item</span><span class="p">]</span> <span class="o">></span> <span class="mi">1</span><span class="p">]</span>
</code></pre></div></div>
<p>Now, if you only need <em>one</em> repeated element, you can be efficient and just terminate on
the first one you find. For this, we’ll use a <code class="language-plaintext highlighter-rouge">set</code>, which is just a <code class="language-plaintext highlighter-rouge">dict</code> with values
of <code class="language-plaintext highlighter-rouge">None</code>. That is to say, <strong>sets are also hashmaps</strong>. The important thing to know is
that adding to them and checking if something is in them are both \(\mathcal{O}(1)\).</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">get_repeated</span><span class="p">(</span><span class="n">l</span><span class="p">):</span>
<span class="n">items</span> <span class="o">=</span> <span class="nb">set</span><span class="p">()</span>
<span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">l</span><span class="p">:</span> <span class="c1"># happens O(n) times
</span> <span class="k">if</span> <span class="n">item</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">items</span><span class="p">:</span> <span class="c1"># This check is O(1)! Wow!
</span> <span class="n">items</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">item</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">return</span><span class="p">(</span><span class="n">item</span><span class="p">)</span>
<span class="k">return</span> <span class="bp">None</span> <span class="c1"># if this happens, all elements are unique
</span></code></pre></div></div>
<p>The last one we’ll do is a bit trickier. You’re given a list of numbers, and a “target”,
and your task is to find a pair of numbers in the list that add up to the target. Try
and think for yourself how you’d do this - the fact you use hashmaps is a big hint. You
should be able to do it in \(\mathcal{O}(n)\).</p>
<p>Have you thought about it? When I first encountered this one I had to look up the
answer. But here’s how you do it in \(\mathcal{O}(n)\):</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">get_sum_pair</span><span class="p">(</span><span class="n">l</span><span class="p">,</span> <span class="n">target</span><span class="p">):</span>
<span class="n">nums_set</span> <span class="o">=</span> <span class="nb">set</span><span class="p">()</span>
<span class="k">for</span> <span class="n">num</span> <span class="ow">in</span> <span class="n">l</span><span class="p">:</span>
<span class="n">other_num</span> <span class="o">=</span> <span class="n">target</span><span class="o">-</span><span class="n">num</span>
<span class="k">if</span> <span class="n">other_num</span> <span class="ow">in</span> <span class="n">nums_set</span><span class="p">:</span>
<span class="k">return</span> <span class="p">(</span><span class="n">num</span><span class="p">,</span> <span class="n">other_num</span><span class="p">)</span>
<span class="n">nums_set</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">num</span><span class="p">)</span> <span class="c1"># no-op if num is already there
</span> <span class="k">return</span> <span class="bp">None</span>
</code></pre></div></div>
<p>Note that <code class="language-plaintext highlighter-rouge">other_num = target-num</code> is the number that you would need to complete the sum
pair; using a hashmap, you can check in \(\mathcal{O}(1)\) if you’ve already seen it!
Wow!</p>
<p>Hopefully you get it - hashmaps are cool. Go on LeetCode, or pop open <a href="https://www.amazon.com/Structures-Algorithms-Python-Michael-Goodrich/dp/1118290275/">your favorite
data structures book</a>, or even <a href="http://www.crackingthecodinginterview.com/">Cracking the Coding Interview</a>, and
get some practice with them.</p>
<h2 id="sorting--searching">Sorting & Searching</h2>
<p>Sorting and searching are two of the basic tasks you have to be familiar with for any
coding interview. You can go into a lot of depth with these, but I’ll stick to the
basics here, because that’s what I find most helpful.</p>
<h3 id="sorting">Sorting</h3>
<p><strong>Sorting</strong> is a nice problem in that the statement of the problem is fairly
straightforward; given a list of numbers, reorder the list so that every element is less
than or equal to the next. There are a number of approaches to sorting. The naive
approach is called <a href="https://en.wikipedia.org/wiki/Insertion_sort"><strong>insertion sort</strong></a>; for example, it is what most people
do when sorting a hand of cards. It has some advantages, but is \(\mathcal{O}(n^2)\) in
time, and so is not the most efficient available.</p>
<p>The two most common fast sorting algorithms are <a href="https://en.wikipedia.org/wiki/Quicksort"><strong>quicksort</strong></a> and
<a href="https://en.wikipedia.org/wiki/Merge_sort"><strong>mergesort</strong></a>. They are both \(\mathcal{O}(n \log n)\) in
time,<sup id="fnref:fnote_sort" role="doc-noteref"><a href="#fn:fnote_sort" class="footnote" rel="footnote">11</a></sup> and so scale close-to-linearly with the size of the list. I won’t go
into the implementation details here; there are plenty of good discussions of them
available on the internet.</p>
<p>When thinking about sorting, it’s also worth considering space complexity -
can you sort without needing to carry around a second sorted copy of the list? If so,
that’s a significant advantage, especially for larger lists. It’s also worth thinking
about worst-case vs. average performance - how does the algorithm perform on a randomly
shuffled list, and how does it perform on a list specifically designed to take the
maximum number of steps for that algorithm to sort? Quicksort, for example, is actually
\(\mathcal{O}(n^2)\) in the worst case, but is \(\mathcal{O}(n \log n)\) on
average. Again, you can look to the <a href="https://www.bigocheatsheet.com/">big-O cheat sheet</a> to make sure you’re
remembering all your complexities correctly.</p>
<h3 id="searching">Searching</h3>
<p>The problem of <strong>searching</strong> is often stated as <strong>given a sorted list <code class="language-plaintext highlighter-rouge">l</code> and an object
<code class="language-plaintext highlighter-rouge">x</code>, find the index at which an element <code class="language-plaintext highlighter-rouge">x</code> lives.</strong> (You should immediately ask: What
should I return if <code class="language-plaintext highlighter-rouge">x</code> is not in <code class="language-plaintext highlighter-rouge">l</code>?)The name of the game here is <strong>binary
search</strong>. You basically split the list, then if the number is greater than the split,
search the top; otherwise, search the bottom. This is an example of a <em>recursive
algorithm</em>, so the way it’s written can be a bit opaque to those not used to looking at
recursive code. Once I can wrap my head around it, I find it quite elegant. The
important thing to know is that this search is \(\mathcal{O}(\log n)\), which means that
you don’t touch every element in the list - it’s very fast, even for a large list. The
key to this is that the list is already sorted - if it’s not sorted, then you’re out of
luck; you’ve got to check every element to find <code class="language-plaintext highlighter-rouge">x</code>.</p>
<p>There are tons of examples of binary search in Python online, so I won’t put one
here. That said, I have found it interesting to see how thinking in terms of binary
search can help you in a variety of areas.</p>
<p>For example, suppose you had some eggs, and worked in a 40-story building, and wanted to
know the highest floor you could drop the egg off of without it breaking (it’s kind of a
dumb example cause the egg would probably break even on the first floor, but pretend
it’s a super-tough egg.) You could drop it from the first floor, and see what
happens. Say it doesn’t break. Then drop it from the 40th, and see what happens. Say it
does break. Then, you bisect and use the midpoint - drop from the 20th floor. If it
breaks here, you next try the 10th - if it doesn’t you next try the 30th. This allows
you to find the correct floor much faster than trying each floor in succession.</p>
<p>Sorting and searching are fundamental algorithms, and have been well studied for
decades. Having a basic fluency in them shows a familiarity with the field of computers
science that many employers like to see. In my opinion, <strong>you should be able to quickly
and easily implement the three sorting algorithms above, and binary search,</strong> in Python,
or whatever your language of choice is.</p>
<h1 id="working-with-sql">Working with SQL</h1>
<p>Finally, let’s talk a bit about SQL. SQL is a tool used to interact with so-called
“relational” databases, which just means that each row in a table has certain values
(columns), and that those values have the same type for each row (that is, the schema is
uniform throughout the table).<sup id="fnref:fnote_nosql" role="doc-noteref"><a href="#fn:fnote_nosql" class="footnote" rel="footnote">12</a></sup> It is not exactly a language, it’s more
like a family of languages. There are many “dialects” which all have slight differences,
but they behave the same with regards to core functionality; for example, you can do</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="k">column</span> <span class="k">FROM</span> <span class="k">table</span> <span class="k">WHERE</span> <span class="n">columns</span> <span class="o">=</span> <span class="s1">'value'</span>
</code></pre></div></div>
<p>in any SQL-like language.<sup id="fnref:fnote_ansi" role="doc-noteref"><a href="#fn:fnote_ansi" class="footnote" rel="footnote">13</a></sup> Modern data-storage and -access solutions like
Spark and Presto are very different from older databases in their underlying
architecture, but still use a SQL dialect for accessing data.</p>
<p>Solving problems in SQL involves thinking in a quite different way than solving a
similar problem on an array in Python. There is no real notion of iteration, or at least
it’s not easily accessible, so most of the complicated action happens via table joins. I
used <a href="https://sqlzoo.net/">SQLZoo</a>, and particularly the “assessments”, to practice my SQL and get it
up to snuff. LeetCode also has a SQL section (I think they call it “database”).</p>
<p>It’s essential to know SQL as a working data scientist. You’ll almost certainly use it
in your day-to-day activities. That said, it’s not always asked in the interviews, so
you might clarify with the company whether they will ask you SQL questions.</p>
<h2 id="a-note-on-dialects">A Note on Dialects</h2>
<p>There are many dialects of SQL, and changing the dialect changes things like (for
example) how you work with dates. It’s worth asking the company you’re interviewing with
what dialect they want you to know, if they have one in mind. If you’re just writing SQL
on a whiteboard, then I would be surprised if they were picky about this; I would just
say something like “here I’d use <code class="language-plaintext highlighter-rouge">DATE(table.dt_str)</code> or whatever the string-to-date
conversion function is in your dialect”. In this case it’s just details that move
around, but the big picture is generally the same for different dialects.</p>
<h1 id="conclusion">Conclusion</h1>
<p>Coding interviews are stressful. From what I can tell, that’s just the way it is. For
me, the best antidote to that is being well-prepared. I think companies are moving more
towards constructive, cooperative interview formats, and away from the classic Google
brain-teaser kind of questions, which helps with this, but you can still expect to be
challenged during these interviews.</p>
<p>Remember to be kind to yourself. You’ll probably fail many times before you
succeed. That’s fine, and is what happens to almost everyone. Just keep practicing, and
keep learning from your mistakes. Good luck!</p>
<!-------------------------------- FOOTER ---------------------------->
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:fnote_py3" role="doc-endnote">
<p>You should be using Python 3 at this point, but also be familiar with the
differences between 2 and 3, and be able to write code in Python 2 if need be. <a href="#fnref:fnote_py3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote_parquet" role="doc-endnote">
<p>For “big data” stored in the cloud, an efficient format called Parquet
is the standard. In my experience, however, it’s uncommon to work with parquet files
directly in Pandas; you often read them into a distributed framework like Spark and work
with them in that context. <a href="#fnref:fnote_parquet" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote_emacs" role="doc-endnote">
<p>The correct answer is, of course, emacs. <a href="#fnref:fnote_emacs" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote_pyplot" role="doc-endnote">
<p><code class="language-plaintext highlighter-rouge">pyplot</code> is an API within matplotlib that was designed in order to
mimic the MATLAB plotting API. It is generally what I use; I begin most of my matplotlib
work with <code class="language-plaintext highlighter-rouge">from matplotlib import pyplot as plt</code>. I only rarely need to <code class="language-plaintext highlighter-rouge">import
matplotlib</code> direct, and that’s generally for configuration work. <a href="#fnref:fnote_pyplot" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote_dsa2" role="doc-endnote">
<p>I read the book when preparing for a software engineer interview at
Google, so I picked up a lot more than was necessary for a data science interview. I
still find the material helpful, however, and it’s nice to be able to demonstrate
that you have gone above and beyond in a realm that data scientists sometimes
neglect (efficient software design). <a href="#fnref:fnote_dsa2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote_dsa" role="doc-endnote">
<p>It goes well beyond what you’ll need for a data science interview,
however - it gets into tree structures, graphs (and graph traversal algorithms), and
other more advanced topics. I’d recommend focusing on complexity analysis, arrays,
and hashmaps as the most important data structures that a data scientist will use
day-to-day. <a href="#fnref:fnote_dsa" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote_bigo" role="doc-endnote">
<p>This is only approximately true, or rather it is is <em>asymptotically</em>
true; this scaling law holds in the limit as \(n\rightarrow\infty\). <a href="#fnref:fnote_bigo" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote_asymptotics" role="doc-endnote">
<p>It’s a bit weird to use <em>both</em> \(n\) and \(k\) in your
complexity - mathematically, what this means is that we consider them separate
variables , and we can take the limit of either one independently from the
other. If, for example, you knew that \(k = n/4\), so you always wanted the top
quarter of the list, then this would be \(\mathcal{O}(n^2)\), since \(n/4 =
\mathcal{O}(n)\). <a href="#fnref:fnote_asymptotics" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote_linked" role="doc-endnote">
<p>I’m glossing over some details here - the numbers I quote above are for
a fixed-size array. So, if you build up an array by adding elements at the end, it
may seem like you get to just do a bunch of \(\mathcal{O}(1)\) <code class="language-plaintext highlighter-rouge">.append</code>s, but in
reality, you have to occasionally resize the array to make more space, which slows
things down to an average append time of \(\mathcal{O}(n)\). If you want a list-like
type where inserting elements is easy (\(\mathcal{O}(1)\)) but accessing elements is
difficult (\(\mathcal{O}(n)\)), then you want a <em>linked list</em>. Linked lists aren’t
as important for data scientists to use, so I won’t get into them much here. <a href="#fnref:fnote_linked" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote_array_hashmap" role="doc-endnote">
<p>You might wonder why we would ever use an array over a hashmap
if hashmaps are strictly superior with respect to their complexity. It’s a good
question. The answer is that arrays take up less space (they don’t have to store the
keys, only the values) and they are much easier to work with in code (they look
cleaner, and are more intuitive for unordered data). Furthermore, if you had a
hashmap that linked integers <code class="language-plaintext highlighter-rouge">0</code> through <code class="language-plaintext highlighter-rouge">10</code> to strings, and you wanted to change
the element at key <code class="language-plaintext highlighter-rouge">5</code>, then you’d have to go through what is currently at keys
<code class="language-plaintext highlighter-rouge">5</code> through <code class="language-plaintext highlighter-rouge">10</code>, and increment their keys by one, so you would end up back at an
inefficient insertion algorithm like you have with arrays. <a href="#fnref:fnote_array_hashmap" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote_sort" role="doc-endnote">
<p>This is true <em>on average</em>; see the section below for a discussion of
average vs. worst-case complexity. <a href="#fnref:fnote_sort" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote_nosql" role="doc-endnote">
<p>Non-relational database formats, like HBase and NoSQL, basically
function like giant hashmaps; they have a single “key”, and then the “value” can
contain arbitrary data - you don’t have to have certain columns in there. The
advantage of this is flexibility, but the disadvantage is that sorting and filtering
are slower because the database doesn’t have a pre-defined schema. <a href="#fnref:fnote_nosql" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote_ansi" role="doc-endnote">
<p>Technically, SQL is an ANSI Standard that many different dialects
implement - so, to call yourself a SQL dialect, you must have features defined by
this standard, like the <code class="language-plaintext highlighter-rouge">SELECT</code>, <code class="language-plaintext highlighter-rouge">FROM</code>, and <code class="language-plaintext highlighter-rouge">WHERE</code> clauses shown above. <a href="#fnref:fnote_ansi" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Peter Willspeter@pwills.comPart II of my guide to data science interviews, focusing on algorithms, data structures, and general programming knowledge and best practices.DS Interview Study Guide Part I: Statistics2019-08-24T00:00:00+00:002019-08-24T00:00:00+00:00http://www.pwills.com/posts/2019/08/24/stats<p>As I have gone through a couple rounds of interviews for data scientist
positions, I’ve been compiling notes on what I consider to be the essential
areas of knowledge. I want to make these notes available to the general public;
although there are many blog posts out there that are supposed to help one
prepare for data science interviews, I haven’t found any of them to be very
high-quality.</p>
<p>From my perspective, there are four key subject areas that a data scientist
should feel comfortable with when going into an interview:</p>
<ol>
<li>Statistics (including experimental design)</li>
<li>Machine Learning</li>
<li>Software Engineering (including SQL)</li>
<li>“Soft” Questions</li>
</ol>
<p>I’m going to go through each of these individually. This first post will focus
on statistics. We will go over a number of topics in statistics in no particular
order. Note that <strong>this post will not teach you statistics; it will remind you
of what you should already know.</strong></p>
<p>If you’re utterly unfamiliar with the concepts I’m mentioning, I’d recommend <a href="https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/index.htm">this
excellent MIT course on probability & statistics</a> as a good starting point. When I
began interviewing, I had never taken a statistics class before; I worked through the
notes, homeworks, and exams for this course, and at the end had a solid foundation to
learn the specific things that you need to know for these interviews. In my studying, I
also frequently use <a href="https://stats.stackexchange.com">cross-validated</a>, a website for asking and answering questions
about statistics. It’s good for in-depth discussions of subtle issues in
statistics. Finally, <a href="https://www.goodreads.com/book/show/619590.Bayesian_Data_Analysis">Gelman’s book</a> is the classic in Bayesian inference. If you
have recommendations for good books that cover frequentist statistics in a clear manner,
I’d love to hear them.</p>
<p>These are the notes that I put together in my studying, and I’m sure that there is
plenty of room for additions and corrections. I hope to improve this guide over time;
please let me know in the comments if there’s something you think should be added,
removed, or changed!</p>
<h1 id="the-central-limit-theorem">The Central Limit Theorem</h1>
<p>The Central Limit Theorem is a fundamental tool in statistical analysis. It states
(roughly) that when you add up a bunch of independent and identically distributed random
variables (with finite variance) then their sum will converge to a Gaussian
distribution.<sup id="fnref:fnote1" role="doc-noteref"><a href="#fn:fnote1" class="footnote" rel="footnote">1</a></sup></p>
<p>How is this idea useful to a data scientist? Well, one place where we see a sum of
random variables is in a <em>sample mean</em>. One consequence of the central limit theorem is
that the sample mean of a variable with mean \(\mu\) and variance \(\sigma^2\) will
itself have mean \(\mu\) and variance \(\sigma^2/n\), where \(n\) is the number of
samples.</p>
<p>I’d like to point out that this is pretty surprising. The distribution of the sum of two
random variables is not, in general, trivial to calculate. So it’s kind of awesome that,
if we’re adding up a large enough number of (independent and identically distributed)
random variables, then we <em>do</em>, in fact, have a very easy expression for the
(approximate) distribution of the sum. Even better, we don’t need to know much of
anything about the distribution of we’re sampling from, besides its mean and
variance - it’s other moments, or general shape, don’t matter for the CLT.</p>
<p>As we will see below, the simplification that the CLT introduces is the basis of one of
the fundamental hypothesis tests that data scientists perform: testing equality of
sample means. For now, let’s work through an example of the theorem itself.</p>
<h2 id="an-example">An Example</h2>
<p>Suppose that we are sampling a Bernoulli random variable. This is a 0/1 random
variable that is 1 with probability \(p\) and 0 with probability \(1-p\). If we
get the sequence of ten draws \([0,1,1,0,0,0,1,0,1,0]\), then our sample mean is</p>
\[\hat \mu = \frac{1}{10}\sum_{i=1}^{10} x_i = 0.4\]
<p>Of course, this sample mean is itself a random variable - when we report it, we
would like to report an estimate on its variance as well. The central limit
theorem tells us that this will, as \(n\) increases, converge to a Gaussian
distribution. Since the mean of the Bernoulli random variable is \(p\) and its
variance is \(p(1-p)\), we know that the distribution of the sample mean will
converge to a Gaussian with mean \(p\) and variance \(p(1-p)/n\). So we could
say that our estimate of the parameter \(p\) is 0.4 \(\pm\) 0.155. Of course,
we’re playing a bit loose here, since we’re using the estimate \(\hat p\) from
the data, as we don’t actually know the <em>true</em> parameter \(p\).</p>
<p>Now, a sample size of \(n=10\) is a bit small to be relying on a “large-\(n\)”
result like the CLT. Actually, in this case, we know the exact distribution of
the sample mean, since \(\sum_i x_i\) is binomially distributed with parameters
\(p\) and \(n\).</p>
<h2 id="other-questions-on-the-clt">Other Questions on the CLT</h2>
<p>I find that the CLT more comes up as a piece of context in other questions
rather than as something that gets asked about directly, but you should be
prepared to answer the following questions.</p>
<ul>
<li>
<p><strong>What is the central limit theorem?</strong> We’ve addressed this above - I doubt
they’ll be expecting a mathematically-correct statement of the theorem, but
you should know the gist of it, along with significant limitations (finite
variance being the major one).</p>
</li>
<li>
<p><strong>When can you <em>not</em> use the CLT?</strong> I think the key thing here is that you
have to be normalizing the data in an appropriate way (dividing by the sample
size), and that the underlying variance must be finite. The answer here can
get very subtle and mathematical, involving modes of convergence for random
variables and all that, but I doubt they will push you to go there, unless
you’re applying for a job specifically as a statistician.</p>
</li>
<li>
<p><strong>Give me an example of the CLT in use.</strong> The classic example here is the
distribution of the sample mean converging to a normal distribution as the
number of samples grows large.</p>
</li>
</ul>
<h1 id="hypothesis-testing">Hypothesis Testing</h1>
<p>Hypothesis testing (also known by the more verbose “null hypothesis significance
testing”) is a huge subject, both in scope and importance. We use statistics to
quantitatively answer questions based on data, and (for better or for worse) null
hypothesis significance testing is one of the primary methods by which we construct
these answers.</p>
<p>I won’t cover the background of NHST here. It’s well-covered in the MIT course; look at
<a href="https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/">the readings</a> to find the relevant sections. Instead of covering the background,
we’ll work through one exampleof a hypothesis test. It’s simple, but it comes up all the
time in practice, so it’s essential to know. I might go so far as to say that this is
the fundamental example of hypothesis testing in data science.</p>
<h2 id="an-example-1">An Example</h2>
<p>Suppose we have two buttons, one green and one blue. We put them in front of
two different samples of users. For simplicity, let’s say that each sample has
size \(n=100\). We observe that \(k_\text{green}\) 57 users click the green
button, and only \(k_\text{blue} = 48\) click the blue button.</p>
<p>Seems like the green button is better, right? Well, we want to be able to say
how <em>confident</em> we are of this fact. We’ll do this in the language of null
hypothesis significance testing. As you should (hopefully) know, in order to do NHST, we
need a null hypothesis and a test statistic; we need to know the test statistic’s
distribution (under the null hypothesis); and we need to know the probability of
observing a value “at least as extreme” as the observed value according to this
distribution.</p>
<p>I’m going to lay out a table of all the important factors here, and then discuss how we
use them to arrive at our \(p\)-value.</p>
<table>
<thead>
<tr>
<th>Description</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>Null Hypothesis</td>
<td>\(p_{blue} - p_{green} < 0\)</td>
</tr>
<tr>
<td>Test Statistic</td>
<td>\(\frac{k_\text{blue}}{n} - \frac{k_\text{green}}{n}\)</td>
</tr>
<tr>
<td>Test Statistic’s Distribution</td>
<td>\(N(0, (p_b(1-p_b) + p_g(1-p_g)) / n)\)</td>
</tr>
<tr>
<td>Test Statistic’s Observed Value</td>
<td>-0.09</td>
</tr>
<tr>
<td>\(p\)-value</td>
<td>0.1003</td>
</tr>
</tbody>
</table>
<p>There are a few noteworthy things here. First, we really want to know whether
\(p_g > p_b\), but that’s equivalent to \(p_b-p_g < 0\). Second, we assume that
\(n\) is large enough so that \(k/n\) is approximately normally distributed,
with mean \(\mu = p\) and variance \(\sigma^2 = p(1-p)/n\). Third, since the
differences of two normals is itself a normal, the test statistic’s distribution
is (under the null hypothesis) a normal with mean zero and the variance given
(which is the sum of the two variances of \(k_b/n\) and \(k_g/n\)).</p>
<p>Finally, we don’t actually know \(p_b\) or \(p_g\), so we can’t really compute
the \(p\)-value; what we do is we say that \(k_b/n\) is “close enough”” to
\(p_b\) and use it as an approximation. That gives us our final \(p\)-value.</p>
<p>The \(p\)-value was calculated in Python, as follows:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">scipy.stats</span> <span class="kn">import</span> <span class="n">norm</span>
<span class="n">pb</span> <span class="o">=</span> <span class="mf">0.48</span>
<span class="n">pg</span> <span class="o">=</span> <span class="mf">0.57</span>
<span class="n">n</span> <span class="o">=</span> <span class="mi">100</span>
<span class="n">sigma</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">sqrt</span><span class="p">((</span><span class="n">pb</span><span class="o">*</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">pb</span><span class="p">)</span> <span class="o">+</span> <span class="n">pg</span><span class="o">*</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">pg</span><span class="p">))</span><span class="o">/</span><span class="n">n</span><span class="p">)</span>
<span class="n">norm</span><span class="p">.</span><span class="n">cdf</span><span class="p">(</span><span class="o">-</span><span class="mf">0.09</span><span class="p">,</span> <span class="n">loc</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="n">scale</span> <span class="o">=</span> <span class="n">sigma</span><span class="p">)</span> <span class="c1"># 0.10034431272089045</span></code></pre></figure>
<p>Calculating the CDF of a normal at \(x=-0.09\) tells us the probability that the test
statistic is less than or equal to \(-0.09\), which is to say the probability that our
test statistic is at least as extreme as the observed value. This probability is
precisely our \(p\)-value.</p>
<p>So what’s the conclusion? Well, often times a significance level is set before the test
is performed; if the \(p\)-value is not below this threshold, then the null hypothesis
is not rejected. Suppose we had set a significance level of 0.05 before the test began -
then, with this data, we would not be able to reject the null hypothesis, which is that
the buttons are equally appealing to users.</p>
<p>Phew! I went through that pretty quick, but if you can’t follow the gist of what
I was doing there, I’d recommend you think through it until it is clear to
you. You will be faced with more complicated situations in practice; it’s
important that you begin by understanding the most simple situation inside out.</p>
<h2 id="other-topics-in-hypothesis-testing">Other Topics in Hypothesis Testing</h2>
<p>Some important follow-up questions you should be able to answer:</p>
<ul>
<li>
<p><strong>What are Type I & II error? What is a situation where you would be more concerned
with Type I error? Vice versa?</strong> These are discussed <a href="https://en.wikipedia.org/wiki/Type_I_and_type_II_errors#Type_I_error">on Wikipedia</a>. Type I error
is false-positive error. You might be very concerned with Type I error if you are
interviewing job candidates; it is very costly to hire the wrong person for the job,
so you really want to avoid false positives. Type II error is false-negative error. If
you are testing for a disease that is deadly but has a simple cure, then you would
certainly NOT want to have a false negative result of the test, since that would
result in an easily-avoidable negative outcome.</p>
</li>
<li>
<p><strong>What is the <em>power</em> of a test? How do you calculate it?</strong> The power of a test is the
probability that you will reject the null hypothesis, given an alternative
hypothesis. Therefore, to calculate the power, you need an alternative hypothesis; in
the example above, this would look like \(p_b-p_g = -0.1\). Although these alternative
hypothesis are often somewhat ad-hoc, the power analysis depends critically upon
them. Google will turn up plenty of videos and tutorials on calculating the power of a
test.</p>
</li>
<li>
<p><strong>What is the significance of a test?</strong> This is the same as the
\(p\)-value threshold below which we reject the null
hypothesis. (In)famously, 0.05 has become the de-facto standard throughout
many sciences for significance levels worthy of publication.</p>
</li>
<li>
<p><strong>Gow would you explain a p-value to a lay person</strong>? Of course, you should
have a solid understanding of the statistical definition of the
\(p\)-value. A generally accepted answer is “a \(p\)-value quantifies the
evidence for a hypothesis - closer to zero means more evidence.” Of course,
this is wrong on a lot of levels - it’s actually quantifying evidence
<em>against</em> the null hypothesis, not <em>for</em> the alternative. For what it’s
worth, I’m not convinced there’s a great answer to that one; it’s an
inherently technical quantity that is frequently misrepresented and abused by
people trying to (falsely) simplify its meaning.</p>
</li>
<li>
<p><strong>If you measure many different test statistics, and get a \(p\)-value for each (all
based on the same null hypothesis), how do you combine them to get an aggregate
\(p\)-value?</strong> This one is more of a bonus question, but it’s worth knowing. It’s
actually not obvious how do to this, and the true \(p\)-value depends on how the tests
depend on each other. However, you can get an upper-bound (worst-case estimate) on the
aggregate \(p\)-value by adding together the different \(p\)-values. The validity of
this bound results from the inclusion-exclusion principle.</p>
</li>
</ul>
<h1 id="confidence-intervals">Confidence Intervals</h1>
<p>Confidence intervals allow us to state a statistical result as a range, rather than a
single value. If we count that 150 out of 400 people sample randomly from a city
identify themselves as male, then our best estimate of the fraction of women in the city
is 250/400, or 5/8. But we only looked at 400 people, so it’s reasonable to expect that
the true value might be a bit more or less than 5/8. Confidence intervals allow us to
quantify this width in a statistically rigorous way.</p>
<p>As per usual, we won’t actually introduce the concepts here - I’ll refer you to the
<a href="https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/">readings from the MIT course</a> for an introduction. We’ll focus on working through
an example, and looking at some different approaches.</p>
<h2 id="the-exact-method">The Exact Method</h2>
<p>Suppose that we want to find a 95% confidence inverval on the female fraction in the
city discussed above. This corresponds to a significance level of \(\alpha/2\). One way
to get the <strong>exact confidence inverval</strong> is to use the CDF of our test statistic, but
substitute in the observed parameter for the true parameter, and then invert it to find
where it hits \(\alpha/2\) and \(1-\alpha/2\). That is, we need to find the value
\(p_l\) that solves the equation</p>
\[CDF\left(n, p_l\right) = \alpha/2\]
<p>and the value \(p_u\) that solves the equation</p>
\[CDF\left(n, p_u\right) = 1 - \alpha/2.\]
<p>In these, \(CDF(n,p)\) is the cumulative distribution function of our test statistic,
assuming that the true value of \(p\) is in fact the observed value \(\hat p\). This is
a bit confusing, so it’s worth clarifying. In our case, the sample statistic is the
sample mean of \(n\) binomial random variables, so this CDF is the CDF of the sample
mean of \(n\) binomial random variables with parameter \(5/8\). Solving the two
equations above would give us our confidence inverval \([p_l, p_u]\).</p>
<p>It took me a bit of work to see that solving the above two equations would in fact give
us bounds that satisfy the definitions of a \(1-\alpha\) confidence interval, which says
that, were we to run many experiments, we would find that the true value of \(p\) would
fall between \(p_l\) and \(p_u\) with the probability</p>
\[P\left(p_l\leq p \leq p_u\right) = 1-\alpha.\]
<p>If you’re into this sort of thing, I’d suggest you take some time thinking through why
inverting the CDF as above guarantees bounds \([p_l, p_u]\) that solve the above
equaiton.</p>
<p>Although it is useful for theoretical analysis, I rarely use this method in
practice, because I often do not actually know the true CDF of the statistic
I am measuring. Sometimes I do know the true CDF, but even in such cases, the
next (approximate) method is generally sufficient.</p>
<h2 id="the-approximate-method">The Approximate Method</h2>
<p>If your statistic can be phrased as a sum, then its distribution approaches a normal
distribution.<sup id="fnref:fnote2" role="doc-noteref"><a href="#fn:fnote2" class="footnote" rel="footnote">2</a></sup> This means that you can solve the above equations for a normal
CDF rather than the true CDF of the sum (in the case above, a binomial CDF).</p>
<p>How does this help? For a normal distribution, the solutions for the above equations to
find lower and upper bounds are well known. In particular, the inverval
\([\mu-\sigma,\mu+\sigma]\), also called a \(1\sigma\)-interval, covers about 68% of the
mass (probability) of the normal PDF, so if we wanted to find a confidence interval of
level \(0.68\), then we know to use the bounds \((\overline x-\sigma, \overline
x+\sigma)\), where \(\overline x\) is our estimate of the true mean \(\mu\).</p>
<p>This sort of result is very powerful, because it saves us from having to do any
inversion by hand. A table below indicates the probability mass contained in various
symmetric intervals on a normal distribution:</p>
<table>
<thead>
<tr>
<th>Inverval</th>
<th>Width<sup id="fnref:fnote3" role="doc-noteref"><a href="#fn:fnote3" class="footnote" rel="footnote">3</a></sup></th>
<th>Coverage</th>
</tr>
</thead>
<tbody>
<tr>
<td>\([\mu-\sigma,\mu+\sigma]\)</td>
<td>\(1\sigma\)</td>
<td>0.683</td>
</tr>
<tr>
<td>\([\mu-2\sigma,\mu+2\sigma]\)</td>
<td>\(2\sigma\)</td>
<td>0.954</td>
</tr>
<tr>
<td>\([\mu-3\sigma,\mu+3\sigma]\)</td>
<td>\(3\sigma\)</td>
<td>0.997</td>
</tr>
</tbody>
</table>
<p>Let’s think through how we would use this in the above example, where we give a
confidence interval on our estimate of the binomial parameter \(p\).</p>
<p>A binomial distribution has mean \(\mu=np\) and variance \(\sigma^2=np(1-p)\). Since
the sample statistical \(\hat p\) is just the binomial divided by \(n\), it has mean
\(\mu=p\) and variance \(\sigma^2 = p(1-p)/n\). The central limit theorem tells us that
the distribution of \(\hat p\) will converge to a normal with just these parameters.</p>
<p>Suppose we want an (approximate) 95% confidence interval on the percentage of women in
the population of our city; the table above tells us we can just do a two-sigma
interval. (This is not <em>exactly</em> a 95% confidence interval; it’s a bit over, as we see
in the table above). The parameter \(\hat p\) has mean \(\mu= p\) and variance
\(\sigma^2 = p(1-p)/n\).<sup id="fnref:fnote4" role="doc-noteref"><a href="#fn:fnote4" class="footnote" rel="footnote">4</a></sup> In our case, \(\hat p=5/8\), so our confidence
interval is \(5/8 \pm 15/1280 \approx 0.625 \pm 0.0117\). Note that we approximated
\(p\) with our experimental value \(\hat p\); the theoretical framework that allows us
to do this substitution is beyond the scope of this article, but is nicely covered in
the MIT readings (Reading 22, in particular).</p>
<h2 id="the-bootstrap-method">The Bootstrap Method</h2>
<p>The previous approach relies on the accuracy of approximating our statistic’s
distribution by a normal distribution. Bootstrapping is a pragmatic, flexible
approach to calculating confidence intervals, which makes no assumptions on the
underlying statistics we are calculating. We’ll go into more detail on
bootstrapping in general below, so we’ll be pretty brief here.</p>
<p>The basic idea is to repeatedly pull 400 samples <em>with replacement</em> from the sampled
data. For each set of 400 samples, we get an estimate \(\hat p\), and thus can build an
empirical distribution on \(\hat p\). Of course, the CLT indicates that this empirical
distribution should look a lot like a gaussian distribution with mean \(\mu= p\) and variance
\(\sigma^2 = p(1-p)/n\)..</p>
<p>Once you have bootstrapped an empirical distribution for your statistic of interest (in
the example above, this is the percentage of the population that is women), then you can
simply find the \(\alpha/2\) and \(1-\alpha/2\) percentiles, which then become your
confidence interval. Although in this case our empirical distribution is (approximately)
normal, it’s worth realizing that we can reasonably calculate percentiles <em>regardless</em>
of what the empirical distribution is; this is why bootstrapping confidence intervals
are so flexible.</p>
<p>As you’ll see below, the downside of bootstrapping confidence intervals is that
it requires some computation. The amount of computation required can be
anywhere from trivial to daunting, depending on how many samples you want in
your empirical distribution. Another downside is that their statistical interpretation
is not exactly in alignment with the definition of a confidence interval, but I’ll leave
the consideration of that as an exercise for the reader.<sup id="fnref:fnotez" role="doc-noteref"><a href="#fn:fnotez" class="footnote" rel="footnote">5</a></sup> <a href="https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/MIT18_05S14_Reading24.pdf">One of the MIT
readings</a> has an in-dpeth discussion of confidence intervals generated via the
bootstrap method.</p>
<p><strong>Overall, I would recommend using the approximate method when you have good reason to
believe your sample statistic is approximately normal, or bootstrapping otherwise.</strong> Of
course, the central limit theorem can provide some guarantees about the asympototic
distribution of certain statistics, so it’s worth thinking through whether that applies
to your situations.</p>
<h2 id="other-topics-in-confidence-intervals">Other Topics in Confidence Intervals</h2>
<ul>
<li>
<p><strong>What is the definition of a confidence interval?</strong> This is a bit more technical, but
it’s essential to know that it is <strong>not</strong> “there is a 95% probability that the true
parameter is in this range.” Actually, what it means is that “if you reran the
experiment many times, then 95% of the time, the true value of the parameter you’re
estimating would fall in this range.” It’s worth noting that the <em>range</em> is the random
variable here - the parameter itself (the true percentage of the population that
identifies as female, in our example) is fixed.</p>
</li>
<li>
<p><strong>How would this change if you wanted a <em>one-sided</em> confidence interval?</strong>
This one isn’t too bad - you just solve either \(CDF(n,p_l) = \alpha\) or
\(CDF(n,p_u) = 1-\alpha\) for a lower- or upper-bounded interval,
respectively.</p>
</li>
<li>
<p><strong>What is the relationship between confidence intervals and hypothesis testing?</strong>
There are many ways to answer this question; it’s a good one to ponder in order to get
a deeper understanding of the two topics. One connection is the relationship between
confidence intervals and rejection regions in NHST - Reading 22 in the MIT course
addresses this one nicely.</p>
</li>
</ul>
<h1 id="bootstrapping">Bootstrapping</h1>
<p>Bootstrapping is a technique that allows you to get insight into the quality of your
estimates, based only on the data you have. It’s a key tool in a data scientist’s
toolbag, because we frequently don’t have a clear theoretical understanding of our
statistics, and yet we want to provide uncertainty estimates. To understand how it
works, let’s look through an example.</p>
<p>In the last section, we sampled 400 people in an effort to understand what percentage of
a city’s population identified as female. Since 250 of them identified themselves as
female, our estimate of the raio for the total population is \(5/8\). This estimate it
itself a random variable; if we had sampled different people, we might have ended up
with a different number. What if we want to know the distribution of this estimate? How
would we go about getting that?</p>
<p>Well, the obvious way is to go out and sample 400 more people, and repeat this over and
over again, until we have many such fractional estimates. But what if we don’t have
access to sampling more people? The natural thing is to think that we’re out of luck -
without the ability to sample further, we can’t actually understand more about the
distribution of our parameter (ignoring, for the moment, that we have lots of
theoretical knowledge about it via the CLT).</p>
<p>The idea behind bootstrapping is simple. Sample from the data you already have, with
replacement, a new sample of 400 people. This will give you an estimate of the female
fraction that is distinct from your original estimate, due to the replacement in your
sampling. You can repeat this process as many times as you like; you will then get an
empirical distribution whic approaches the true distribution of the statistic.<sup id="fnref:fnote4:1" role="doc-noteref"><a href="#fn:fnote4" class="footnote" rel="footnote">4</a></sup></p>
<p>Bootstrapping has the advantage of belig flexible, although it does have its
limitations. Rather than get too far into the weeds, I’ll just point you to the
<a href="https://en.wikipedia.org/wiki/Bootstrapping_(statistics)">Wikipedia article on bootstrapping</a>. There are also tons of resources about this
subject online. Try coding it up for yourself! By the time you’re interviewing, you
should be able to write a bootstrapping algorithm quite easily.</p>
<p><a href="https://machinelearningmastery.com/a-gentle-introduction-to-the-bootstrap-method/">Machine Learning Mastery</a> has a good introduction to bootstrapping that uses the
scikit-learn API. <a href="https://towardsdatascience.com/an-introduction-to-the-bootstrap-method-58bcb51b4d60">Towards Data Science</a> codes it up directly in NumPy, which is a
useful thing to know how to be able to do. Asking someone to code up a bootstrapping
function would be an entirely reasonable interview questions, so it’s something you
should be comfortable doing.</p>
<h2 id="other-topics-in-bootstrapping">Other Topics in Bootstrapping</h2>
<ul>
<li><strong>When would you <em>not</em> want to use bootstrapping?</strong> It might not be feasible when it
is very costly to calculate your sample statistic. To get accurate estimates you’ll
need to calculate your statistic thousands of times, so it might not be feasible if it
takes minutes or hours to calculate a single sample. Also, it is often difficult to
get strong theoretical guarantees about probabilities based on bootstrapping, so if
you need a highly statistically rigorous approach, you might be better served with
something more analytical. Finally, if you know the distribution of your statistic
already (for example, you know from the CLT that it is normally distributed) then you
can get better (more accurate) uncertainty estimates from an analytical approach.</li>
</ul>
<h1 id="linear-regression">Linear Regression</h1>
<p>Regression is the study of the relationship between variables; for example, we
might wish to know how the weight of a person relates to their height. <em>Linear</em>
regression assumes that your input (height, or \(h\)) and output (weight, or
\(w\)) variables are <em>linearly related</em>, with slope \(\beta_1\), intercept
\(\beta_0\), and noise \(\epsilon\).</p>
\[w = \beta_1\cdot h + \beta_0 + \epsilon.\]
<p>A linear regression analysis helps the user discover the \(\beta\)s in the
above equation. This is just the simplest application of LR; in reality, it is
quite flexible and can be used in a number of scenarios.</p>
<p>Linear regression is another large topic that I can’t really do justice to in this
article. Instead, I’ll just go through some of the common topics, and introduce the
questions you should be able to address. As is the case with most of these topics, you
can look at the <a href="https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/index.htm">MIT Statistics & Probability course</a> for a solid academic
introduction to the subject. You can also dig through <a href="https://en.wikipedia.org/wiki/Linear_regression">the Wikipedia article</a> to get
a more in-depth picture. The subject is so huge, and there’s so much to learn about it,
that you really can spend as much time as you want digging into it - I’m just going to
gesture at some of the simpler aspects of it.</p>
<h2 id="calculating-a-linear-regression">Calculating a Linear Regression</h2>
<p>Rather than go through an example here, I’ll just refer you to the many available guides
that show you how to do this in code. Of course, you could do it in raw NumPy, solving
the normal equations explicitly, but I’d recommend using scikit-learn or statsmodels, as
they have much nicer interfaces, and give you all sorts of additional information about
your model (\(r^2\), \(p\)-value, etc.)</p>
<p><a href="https://realpython.com/linear-regression-in-python/">Real Python</a> has a good guide to coding this up - see the section “Simple Linear
Regression with scikit-learn.” <a href="https://www.geeksforgeeks.org/linear-regression-python-implementation/">GeeksForGeeks</a> does the solution in raw NumPy; the
equations won’t be meaningful for you until you read up on the normal equation and how
to analytically solve for the optimal LR coefficients. If you want something similar in
R, or Julia, or MATLAB,<sup id="fnref:fnoted" role="doc-noteref"><a href="#fn:fnoted" class="footnote" rel="footnote">6</a></sup> then I’m sure it’s out there, you’ll just have to go do
some Googling to find it.</p>
<h2 id="a-statistical-view">A Statistical View</h2>
<p>This subject straddles the boundary between statistics and machine-learning. It has been
quite thoroughly studied from a statistical point of view, and there are some iportant
results that you should be familiar with when thinking about linear regression from a
statistical frame.<sup id="fnref:fnotec" role="doc-noteref"><a href="#fn:fnotec" class="footnote" rel="footnote">7</a></sup></p>
<p>Let’s look back at our foundational model for linear regression. LR assumes
that your input \(x\) and output \(y\) are related via</p>
\[y_i = \beta_1\cdot x_i + \beta_0 + \epsilon_i,\]
<p>where \(\epsilon_i\) are i.i.d., distributed as \(N(0, \sigma^2)\). Since the
\(\epsilon\) are random variables, the \(\beta_j\) are themselves random
variables. One important question is whether there is, in fact, any
relationship between our variables at all. If there is not, then we should
\(\beta_1\) close to 0,<sup id="fnref:fnoteb" role="doc-noteref"><a href="#fn:fnoteb" class="footnote" rel="footnote">8</a></sup> but they will not ever be exactly zero. One important
statistical technique in LR is <strong>doing a hypothesis test against the null
hypothesis that \(\beta_1 = 0\)</strong>. When a package like scikit-learn returns a
“\(p\)-value of the regression”, this is the \(p\)-value they are talking
about.</p>
<p>Like I said before, there is a lot more to know about the statistics of linear
regression than just what I’ve said here. You can learn more about the statistics of LR
by looking at the <a href="https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/MIT18_05S14_Reading25.pdf">MIT course notes on the subject</a>, or by digging through your
favorite undergraduate statistics book - most of them should have sections covering it.</p>
<h2 id="validating-your-model">Validating Your Model</h2>
<p>Once you’ve calculated your LR, you’d like to validate it. This is very important to
do - if you’re asked to calculate a linear regression in an interview, you should always
go through the process of validating it after you’ve done the calculation.</p>
<p>I’d generally go through the following steps:</p>
<ul>
<li>If it’s just a simple (one independent variable) linear regression, then plot the two
variables. This should give you a good sense of whether it’s a good idea to use linear
regression in the first place. If you have multiple independent variables, you can
make separate plots for each one.</li>
<li>Look at your \(r^2\) value. Is it reasonably large? Remember, closer to 1 is
better. If it’s small, then doing a linear regression hasn’t helped much.</li>
<li>You can look at the \(p\)-value to see if it’s difference from zero is
statistically significant (see the section below). Also, you can have a very
significant \(p\)-value while still having a low \(r^2\), so be cautious in your
interpretation of this one.</li>
<li>You can also look at the RMSE of your model, but this number is not scaled between 0
and 1, so a “good” RMSE is highly dependent on the units of your indepedent variable.</li>
<li>Plot your residuals, for each variable. The residual is just the input minus
the value predicted by your model, a.k.a. the error of your model. Plotting
each residual isn’t really feasible if you have hundreds of independent
variables, but it’s a good idea if your data is small enough. You should be
looking for “homoskedasticity” - that the variance of the error is uniform
across the range of the independent variable. If it’s not, then certain
things you’ve calculated (for example, the \(p\)-value of your regression)
are no longer valid. You might also see that your errors have a bias that
changes as the \(x_i\) changes; this means that there’s some more complicated
relationship between \(y\) and \(x_i\) that your regression did not pick up.</li>
</ul>
<p>Some of the questions below address the assumptions of linear regression; you
should be familiar with them, and now how to test for them either before or
after the regression is performed, so that you can be confident that your model
is valid.</p>
<h2 id="basic-questions-on-lr">Basic Questions on LR</h2>
<p>Hopefully you’ve familiarized yourself with the basic ideas behind linear
regression. Here are some conceptual questions you should be able to answer.</p>
<ul>
<li>
<p><strong>How are the \(\beta\)s calculated?</strong> Practically, you let the library
you’re using take care of this. But behind the scenes, generally it’s solving
the so-called “normal equations”, which give you the optimal (highest
\(r^2\)) parameters possible. You can use gradient descent to approximate
the optimal solution when the design matrix is too large to invert; this is
available via the <code class="language-plaintext highlighter-rouge">SGDRegressor</code> model in scikit-learn.</p>
</li>
<li>
<p><strong>How do you decide if you should use linear regression?</strong> The best case is
when the data is 2- or 3-dimensional; then you can just plot the data and see
if it looks like “linear plus noise”. However, if you have lots of
independent variables, this isn’t really an option. In such a case, you
should look perform a linear regression analysis, and then look at the errors
to verify that they look normally distributed and homoskedastic (constant
variance).</p>
</li>
<li>
<p><strong>What does the \(r^2\) value of a regression indicate?</strong> The \(r^2\) value
indicates “how much of the variance of the output data is explained by the
regression.” That is, your output data \(y\) has some (sample) variance, just
on its own. Once you discover the linear relationship and subtract it off,
then the remaining error \(y - \beta_0 - \beta_1x\) still has some variance,
but hopefully it’s lower - \(r^2\) is one minus the ratio of the original to
the remaining variance. When \(r^2=1\), then your line is a perfect fit of
the data, and there is no remaining error. It is often used to explain the
“quality” of your fit, although this can be a bit treacherous - see
<a href="https://en.wikipedia.org/wiki/Anscombe%27s_quartet">Anscombe’s Quartet</a> for examples of very different situations with the
same \(r^2\) value.</p>
</li>
<li>
<p><strong>What are the assumptions you make when doing a linear regression?</strong> The
Wikipedia article <a href="https://en.wikipedia.org/wiki/Linear_regression#Assumptions">addresses this point</a> quite thoroughly. This is worth
knowing, because you don’t just want to jump in and blindly do LR; you want
to be sure it’s actually a reasonable approach.</p>
</li>
<li>
<p><strong>When is it a bad idea to do LR?</strong> When you do linear regression, you’re assuming a
certain relationship between your variables. Just the parameters and output of your
regression won’t tell you whether the data really are appropriate for a linear
model. <a href="https://en.wikipedia.org/wiki/Anscombe%27s_quartet">Anscombe’s Quartet</a> is a particularly striking example of how the output of
a linear regression analysis can look similar but in fact the quality of the analysis
can be radically different. Beyond this, it is a bad idea to do LR whenever the
assumptions of LR are violated by the data; see the above bullet for more info there.</p>
</li>
<li>
<p><strong>Can you do linear regression on a nonlinear relationship?</strong> In many cases,
yes. What we need is for the model to be linear in the parameters \(\beta\);
if, for example, you are comparing distance and time for a constantly
accelerating object \(d = 1/2at^2\), and you want to do regression to
discover the acceleration \(a\), then you can just use \(t^2\) as your
independent variable. The model relating \(d\) and \(t^2\) is linear in the
acceleration \(a\), as required.</p>
</li>
<li>
<p><strong>What does the “linear” in linear regression refer to?</strong> This one might seem
trivial, but it’s a bit of a trick question; the relationship \(y =
2\log(x)\) might not appear linear, but in fact it can be obtained via a
linear regression, by using \(\log(x)\) as the input variables, rather than
\(x\). Of course, for this to work, you need to know ahead of time that you
want to compare against \(\log(x)\), but this can be discovered via
trial-and-error, to some extent. So the “linear” <em>does</em>, as you’d expect,
mean that the relationship between independent and dependent variable is
linear, but you can always <em>change</em> either of them and re-calculate your
regression.</p>
</li>
</ul>
<h2 id="handling-overfitting">Handling Overfitting</h2>
<p>Overfitting is a very important to understand, and is a fundamental challenge in machine
learning and modeling. I’m not going to go into great detail on it here; more
information will be presented in the machine learning section of the guide. There are
some techniques for handling it that are particular to LR, which is what I’ll talk about
here.</p>
<p><a href="https://realpython.com/linear-regression-in-python/">RealPython</a> has good images showing examples of over-fitting. You can
handle it by building into your model a “penalty” on the \(\beta_i\)s; that is,
tell your model “I want low error, <strong>and</strong> I don’t want large coefficients.**
The balance of these preferences is determined by a parameter, often denoted by
\(\lambda\).</p>
<p>Since you have many \(\beta\)s, in general, you have to combine them in some
fashion. Two such ways to calculate the measure of “overall badness” (which I’ll call
\(OB\)) are</p>
\[OB = \sqrt{ \beta_1^2 + \beta_2^2 + \ldots + \beta_n^2 }\]
<p>or</p>
\[OB = |\beta_1| + |\beta_2| + \ldots + |\beta_n|.\]
<p>The first will tend to be emphasize outliers; that is, it is more sensitive to
single large \(\beta\)s. The second considers all the \(\beta\)s more
uniformly. If you use the first, it is called “ridge regression”, and if you
use the second it is called “LASSO regression.”</p>
<p>In mathematics, these denote the \(\ell_1\) and \(\ell_2\) norms of the vectors
of \(\beta\)s; you can in theory use \(\ell_p\) norms for any \(p\), even
\(p=0\) (count the number of non-zero \(\beta\)s to get the overall badness) or
\(p=\infty\) (take the largest \(\beta\) as the overall badness). However, in
practice, LASSO and ridge regression are already implemented in common
packages, so it’s easy to use them right out of the box.</p>
<p>As usual, there is a LOT to learn about how LASSO and ridge regression change your
output, and what kinds of problems they can address (and/or create). I’d highly
recommend searching around the internet to learn more about them if you aren’t already
confident in your understanding of how they work.</p>
<h2 id="logistic-regression">Logistic Regression</h2>
<p>Logistic regression is a way of modifying linear regression models to get a
classification model. The statistics of logistic regression are, generally speaking, not
as clean as those of linear regression. It will be covered in the machine learning
section, so we won’t discuss it here.</p>
<h1 id="bayesian-inference">Bayesian Inference</h1>
<p>Up until now this guide has primarily focused on frequentist topics in
statistics, such as hypothesis testing and the frequentist approach to
confidence intervals. There is an entire world of Bayesian statistical
inference, which differs significantly from the frequentist approach in both
philosophy and technique. I will only touch on the most basic application of
Bayesian reasoning in this guide.</p>
<p>In this section, I will mostly defer to outside sources, who I think speak more
eloquently on the topic than I can. Some companies (such as Google, or so I’m told) tend
to focus on advanced Bayesian skills in their data science interviews; if you want to
really learn the Bayesian approach, I’d reccomend <a href="https://www.goodreads.com/book/show/619590.Bayesian_Data_Analysis">Gelman’s book</a>, which is a
classic in the field.</p>
<h2 id="bayesian-vs-frequentist-statistics">Bayesian vs Frequentist Statistics</h2>
<p>It’s worth being able to clearly discuss the difference in philosophy and approach
between the two schools of statistics. I particularly like the discussion in the MIT
course notes. They state, more or less, that while the Bayesians like to reason from
Bayes theorem</p>
\[P(H|D) = \frac{ P(D|H)P(H)}{P(D)},\]
<p>the frequentist school thinks that “the probability of the hypothesis” is a nonsense
concept - it is not a well-founded probablistic value, in the sense that there is no
repeatable experiment you can run in which to gather relative frequency counts and
calculate probabilities. Therefore, the frequentists must reason directly from
\(P(D|H)\), the probability of the data given the hypothesis, which is just the
\(p\)-value. The upside of this is that the probabilistic interpretation of \(P(D|H)\)
is clean and unambiguous; the downside is that it is easy to misunderstand, since what
we really think we want is “the probability that the hypothesis is true.”</p>
<p>If you want to know more about this, there are endless discussions of it all over the
internet. Like many such dichotomies (emacs vs. vim, overhand vs underhand toilet paper,
etc.) it is generally overblown - a working statistician should be familiar with, and
comfortable using, both frequentist <em>and</em> Bayesian techniques in their analysis.</p>
<h2 id="basics-of-bayes-theorem">Basics of Bayes Theorem</h2>
<p>Bayes theorem tells us how to update our belief in light of new evidence. You
should be comfortably applying Bayes theorem in order to answer basic
probability questions. The classic example is the “base rate fallacy”:</p>
<p>Consider a routine screening test for a disease. Suppose the frequency of the
disease in the population (base rate) is 0.5%. The test is highly accurate with
a 5% false positive rate and a 10% false negative rate. You take the test and
it comes back positive. What is the probability that you have the disease?</p>
<p>The answer is NOT 0.95, even though the test has a 5% false positive rate. You should be
able to clearly work through this problem, building probability tables and using Bayes
theorem to calculate the final answer. The problem is worked through in the <a href="https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/MIT18_05S14_Reading3.pdf">MIT stats
course readings</a> (see Example 10), so I’ll defer to them for the details.</p>
<h2 id="updating-posteriors--conjugate-priors">Updating Posteriors & Conjugate Priors</h2>
<p>The above approach of calculating out all the probabilites by hand works reasonbly well
when there are only a few possible outcomes in the probability space, but it doesn’t
scale well to large (discrete) probability spaces, and won’t work at all in continuous
probability spaces. In such situations, you’re still fundamentally relying on Bayes
theorem, but the way it is applied looks quite different - you end up using sums and
integrals to calculate the relevant terms.</p>
<p>Again, I’ll defer to the <a href="https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/">MIT stats course readings</a> for the details - readings 12
and 13 are the relevant ones here.</p>
<p>It’s particularly useful to be familiar with the concept of <strong>conjugate
priors</strong>. In general, updating your priors involves computing an integral,
which as anyone who has taken calculus knows can be a pain in the ass. When
sampling from a distribution and estimating the parameters, there are certain
priors for which the updates based on successive samples work out to be very
simple.</p>
<p>For an example of this, suppose you’re flipping a biased coin and trying to
figure out the bias. This is equivalent to sampling a binomial distribution and
trying to estimate the parameter \(p\). If your prior is uniform (flat across
the interval \([0,1]\)), then after \(N\) flips, \(k\) of which come up heads,
your posterior probability density on \(p\) will be</p>
\[f(p) \propto p^{k}((1-p)^{N-k}.\]
<p>This is called a <strong>\(\beta\) distribution</strong>. It is kind of magical that we can
calculate this without having to do any integrals - this is because the
\(\beta\) distribution is “conjugate to” the binomial distribution. It’s
important that we started out with a uniform distribution as our prior - if we
had chosen an arbitrary prior, the algebra might not have worked out as
nicely. In particular, if we start with a non-\(\beta\) prior, then this trick
won’t work, because our prior will not be conjuage to the binomial distribution.</p>
<p>The other important conjugate pair to know is that of the Gaussian
distribution; it is, in fact, conjuage to itself, so if you estimate the
parameters of a normal distribution, those estimates are themselves normal, and
updating your belief about the parameters based on new draws from the normal
distribution is as simple as doing some algebra.</p>
<p>There are many good resources available online and in textbooks discussing
conjuage priors; <a href="https://en.wikipedia.org/wiki/Conjugate_prior">Wikipedia</a> is a good place to start.</p>
<h1 id="maximum-likelihood-estimation">Maximum Likelihood Estimation</h1>
<p>We discussed before the case where you have a bunch of survey data, and want to estimate
the proportion of the population that identifies as female. Statistically speaking,
this proportion is a <em>parameter</em> of the probability distribution over gender identity in
the that geographical region. We’ve intuitively been saying that if we see 250 out of
400 respond that they are female, then our best estimate of the proportion is 5/8. Let’s
get a little more formal about why exactly this is our best estimate.</p>
<p>First of all, I’m going to consider a simplified world in which there are only two
genders, male and female. I do this to simplify the statistics, not because it is an
accurate model of the world. In this world, if the <em>true</em> fraction of the population
that identifies as female is 0.6, then there is some non-zero probability that you would
draw a sample of 400 people in which 250 identify as female. We call this the
<em>likelihood</em> of the parameter 0.6. In particular, the binomial distribution tells us
that</p>
\[\mathcal{L}(0.6|n_\text{female}=250) = {400 \choose 250} \,0.6^{250}\, (1-0.6)^{400-250}\]
<p>Of course, I could calculate this for any parameter in \([0,1]\); if I were very far
from 5/8, however, then this likelihood would be very small.</p>
<p>Now, a natural question to ask is “which parameter \(p\) would give us the highest
likelihood?” That is, which parameter best fits our data? That is the
<strong>maximum-likelihood estimate</strong> of the parameter \(p\). The actual calculation of that
maximum involves some calculus and a neat trick involving logarithms, but I’ll refer the
reader <a href="https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/MIT18_05S14_Reading10b.pdf">elsewhere</a> for those details. It’s worth noting that the MLE is often our
intuitive “best guess” at the parameter; in this case, as you might anticipate,
\(p=5/8\) maximizes the likelihood of seeing 250 people out of 400 identify as female.</p>
<p>I won’t give any question here, because I honestly have not seen any in my searching
around. Even so, I think it’s an important concept to be familiar with. Maximum
likelihood estimation often provides a theoretical foundation for our intuitive
estimates of parameters, and it’s helpful to be able to justify yourself in this
framework.</p>
<p>For example, if you’re looking at samples from an exponential distribution, and you want
to identify the parameter \(\lambda\), you might guess that since the mean of an
exponential random variable is \(\mu= 1/\lambda\), a good guess would be \(\lambda
\approx 1/\overline x\), where \(\overline x\) is your sample mean. In fact you would be
correct, and this is the MLE for \(\lambda\); you should be familiar with this way of
thinking about parameter estimation.</p>
<h1 id="experimental-design">Experimental Design</h1>
<p>Last, but certainly not least, is the large subject of experimental design. This is a
more nebulous topic, and therefore harder to familiarize yourself with quickly, than the
others we’ve discussed so far.</p>
<p>If we have some new feature, we might have reason to think it will be good to include in
our product. For example, Facebook rolled out a “stories” feature some time ago (I
honestly couldn’t tell you what it does, but it’s some thing that sits on the top of
your newsfeed). However, before they expose this to all their users, they want to put it
out there “in the wild” and see how it performs. So, they run an experiment.</p>
<p>Designing this experiment in a valid way is essential to getting meaningful, informative
results. An interview question at Facebook might be: <strong>How will you analyze if launching
stories is a good idea? What data would you look at?</strong> The discussion of this question
could easily fill a full 45-minute interview session, as there are many nuances and
details to examine.</p>
<p>One basic approach would be to randomly show the “stories” feature to some people, and
not to others, and then see how it affects their behavior. This is an A/B test. Some
questions you should be thinking about are:</p>
<ul>
<li><strong>What metrics will we want to track in order ot measure the effect of stories?</strong> For
example, we might measure the time spent on the site, the number of clicks, etc.</li>
<li><strong>How should we randomize the two groups?</strong> Should we randomly choose every time someone
visits the site whether to show them stories or not? Or should we make a choice for
each <em>user</em> and fix that choice? Generally, user-based randomization is preferable,
although sometimes it’s hard to do across devices (think about why this is).</li>
<li><strong>How long should we run the tests? How many people should be in each group?</strong> This
decision is often based on a <em>power calculation</em>, which gives us the probability of
rejecting the null hypothesis, given some alternative hypothesis. I personally am not
a huge fan of these because the alternative hypothesis is usually quite ad-hoc, but it
is the standard, so it’s good to know how to do it. For example, you might demand that
your test be large enough that if including stories increases site visit time by at
least one minute, our A/B test will detect that with 90% probability.</li>
<li><strong>When can we stop the test?</strong> The important thing to note here is that you <strong>cannot</strong>
just stop the test once the results look good - you have to decide beforehand how long
you want it to run.</li>
<li><strong>How will you deal with confounding variables?</strong> What if, due to some techincal
difficulty, you end up mostly showing stories to users at a certain time of day, or in
a certain geographical region? There are a variety of approaches here, and I won’t get
into the details, but it’s essential that you be able to answer this concern clearly
and thoroughly.</li>
</ul>
<p>It’s also worth considering scenarios where you have to analyze data after the fact in
order to perform “experiments”; sometimes you want to know (for example) if the color of
a product has affected how well it sold, and you want to do so using existing sales
data. What limitations might this impose? A key limitation is that of confounding
variables - perhaps the product in red mostly sold in certain geographic regions,
whereas the blue version sold better in other geographic regions. What impact will this
have on your analysis?</p>
<p>There are many other considerations to think about around experimental design. I don’t
have any particular posts that I like; I’d recommend searching around Google to find
more information on the topic.</p>
<p>If you have any friends that do statistics professionally, I’d suggest sketching our a
design for the above experiment and talking through it with them - the ability to think
through an experimental design is something that is best developed over years of
professional experience.</p>
<h1 id="conclusion">Conclusion</h1>
<p>This guide has focused on some of the basic aspects of statistics that get covered in
data science interviews. It is far from exhaustive - different companies focus on
different skills, and will therefore be asking you about different statistical concepts
and techniques. I haven’t discussed time-dependent statistics at all - Markov chains,
time-series analysis, forecasting, and stochastic processes all might be of interest to
employers if they are relevant to the field of work.</p>
<p>Please let me know if you have any corrections to what I’ve said here. I’m far
from a statistician, so I’m sure that I’ve made lots of small (and some large)
mistakes!</p>
<p>Stay tuned for the rest of the study guide, which should be appearing in the
coming months. And finally, best of luck with your job search! It can be a
challenging, and even demoralizing experience; just keep learning, and don’t
let rejection get you down. Happy hunting!</p>
<!-------------------------------- FOOTER ---------------------------->
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:fnote1" role="doc-endnote">
<p>Of course, the actual statement is careful about the mode of
convergence, and the fact that it is actually an appropriately-normalized
version of the distribution that converges, and so on. <a href="#fnref:fnote1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote2" role="doc-endnote">
<p>Again, we’re being loose here - it has to have finite variance, and
the convergence is only in a specific sense. <a href="#fnref:fnote2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote3" role="doc-endnote">
<p>I’m being a little loose with definitions here - the width of a
\(2\sigma\) inverval is actually \(4\sigma\), but I think most would still
describe it using the phrase “two-sigma”. <a href="#fnref:fnote3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote4" role="doc-endnote">
<p>As usual, we’re being a bit sloppy - we’re just using the sample variance in
place of the true variance and pretending this is correct. This will work if the
number of samples \(n\) is large. If you need confidence intervals with few (say,
less than 15) samples, I recommend you look into confidence intervals based on the
student-t distribution. <a href="#fnref:fnote4" class="reversefootnote" role="doc-backlink">↩</a> <a href="#fnref:fnote4:1" class="reversefootnote" role="doc-backlink">↩<sup>2</sup></a></p>
</li>
<li id="fn:fnotez" role="doc-endnote">
<p>In doing bootstrapping, we’re really trying to find the distribution of our
statistic \(\hat S\). So, what we find via this method are bounds \((l,u)\) such that
\(P(l\leq \hat S \leq u)\geq C\). How does this relate to the definition of a
confidence interval? This is a somewhat theoretic exercise, but can be helpful in
clarifying your understanding of the more technical aspects of confidence interval
computation. <a href="#fnref:fnotez" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnoted" role="doc-endnote">
<p>Why are you using MATLAB? Stop that. You’re not in school anymore. <a href="#fnref:fnoted" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnotec" role="doc-endnote">
<p>Some of the issues that arise here (for example, over- and
under-fitting) have solutions that are more practical and less theoretical and
statistical in nature - these will be covered in more depth in the machine
learning portion of this guide, and so we don’t go into too much detail in this
section. <a href="#fnref:fnotec" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnoteb" role="doc-endnote">
<p>\(\beta_0\) just represents the difference in the mean of the two
variables, so it could be non-zero even if the two are independent. <a href="#fnref:fnoteb" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Peter Willspeter@pwills.comPart I of my guide to data science interviews, focusing on statistics and experimental design.New Paper: Metrics For Graph Comparison2019-07-05T00:00:00+00:002019-07-05T00:00:00+00:00http://www.pwills.com/posts/2019/07/05/metrics-paper<p>I just put a <a href="https://www.biorxiv.org/content/10.1101/611509v1">new paper up on the arXiv</a>, and so I thought I would share it
here. This was the final paper I wrote for my Ph.D., and it’s the one I’m most proud
of. The paper is called “Metrics for Graph Comparison: a Practitioner’s Guide.”</p>
<h1 id="the-basic-idea">The Basic Idea</h1>
<p>Suppose you have two graphs, or even just a single graph that is changing in time. For
example, you might have a social network between students at a school that evolves as
time passes. Below, we see the social network for a particular French elementary school,
which is evolving as the day passes. Each vertex is a person, and each edge indicates
face-to-face contact.</p>
<p><img src="/assets/images/research/class_graphs.png" alt="Primary School Graphs" /></p>
<p>One important question that we must answer is “how much did the graph change between
times \(t\) and \(t+1\)?” Said another way, how similar are graphs \(G_t\) and
\(G_{t+1}\)? The central subjects of this paper are the many methods available for
comparing graphs.</p>
<p>We study these methods both by looking at empirical examples like the one above, as well
as by doing a large study of the statistics of comparing various random graph
models. Which graph comparison tool can best distinguish an Erdos-Renyi random graph
from a stochastic blockmodel? What about comparing a random graph with fixed degree
distribution to a preferential attachment graph? Using Monte Carlo simulation of the
graphs, we are able to answer these questions and gain insight into the behavior of our
distances when they are used on a variety of different structures and geometries.</p>
<p>One important focus of the paper is on practicality, and so we only look at distances
that are linear or near-linear (i.e. \(O(n)\) or \(O(n \log n)\)) in the number of
vertices in the graph.<sup id="fnref:fnote1" role="doc-noteref"><a href="#fn:fnote1" class="footnote" rel="footnote">1</a></sup> More computationally expensive distances may be of
theoretical interest, but for the graphs used in business, which often range upwards of
1 million vertices, they are not feasible to use.</p>
<h1 id="findings">Findings</h1>
<p>There is a lot of nuance in the interpretation of these comparisons - it’s not as
simplea as “method X is the best”. The results depend strongly on the geometric
structural differences you with to learn about the graph. Do you care about total
connectivity? Then just use a simple edit distance. If you care about the community
structure of a graph, then you should probably use a spectral distance.</p>
<p>That said, we find that spectral methods (which are quite standard, and have been around
for some time) are strong performers all around. They are robust, flexible, and have the
added benefit of easy implementation - fast spectral algorithms are ubiquitous in modern
computing packages such a MATLAB, SciPy, and Julia.</p>
<p>For example, here is a plot showing how well the different distances are able to discern
an Erdos-Renyi random graph from a stochastic blockmodel.</p>
<p><img src="/assets/images/metric_comparison_plot.png" alt="ER_SBM_Comparison" /></p>
<p>Higher numbers mean that the distances can more reliably discern between the two
populations. We see that the adjacency spectral distance \(\lambda^A\) and the
normalized Laplacian spectral distance \(\lambda^{\mathcal L}\) are most reliably able
to pick out the community structure that differentiates between these two models. This
is not surprising, as the spectra of the graph has a direct interpretation in terms of
vibrational modes, which depend critically upon community structure.</p>
<p>If you want to know more, check out <a href="https://www.biorxiv.org/content/10.1101/611509v1">the full paper</a>. The above result is just one of
a large collection of findings that we lay out. As I said before, the idea isn’t to come
to a single conclusion; it is to survey the landscape and to compare and contrast these
different tools.</p>
<h1 id="conclusion">Conclusion</h1>
<p>In research, so many people spend so much time developing new methods, and I always
think to myself, “How does this compare to the standard method? Is it actually an
improvement?” This paper attempts to take stock of a number of standard and cutting-edge
methods in graph comparison, and see what works best. After spending some time doing a
theoretical analysis of a particular graph distance metric (see <a href="https://arxiv.org/abs/1707.07362">my previous paper</a>)
I was curious to see how all the tools available compared to one another.</p>
<p>Also, I’ve implemented many of these distances in my Python library <a href="https://www.github.com/peterewills/netcomp">NetComp</a>, which
you can get via <code class="language-plaintext highlighter-rouge">pip install netcomp</code>. Check it out, and feel free to post issues and/or
PRs if you want to add to/modify the library.</p>
<p>Let me know in the comments what you think! Or feel free to email me if you
have more detailed questions about graph metrics. Happy Friday!</p>
<!-------------------------------- FOOTER ---------------------------->
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:fnote1" role="doc-endnote">
<p>This is paired with the assumption that the graph is sparse, so the
number of edges is \(O(n \log n)\) <a href="#fnref:fnote1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Peter Willspeter@pwills.comA brief discussion of my latest paper, which benchmarks various metrics used to compare complex networks, also known as graphs.Types as Propositions2018-11-30T00:00:00+00:002018-11-30T00:00:00+00:00http://www.pwills.com/posts/2018/11/30/types<p>Some of the most meaningful mathematical realizations that I’ve had have been
unexpected connections between two topics; that is, realizing that two concepts
that first appeared quite distinct are in fact one and the same. In our first
linear algebra courses, we learn that manipulations of matrices is, in fact,
equivalent to solving systems of equations. In quantum mechanics, we see that
<a href="https://en.wikipedia.org/wiki/Observable">physically observable quantities</a> are, mathematically speaking, linear
operators (I still don’t quite grok this one). And, my personal favorite
example, we learn in functional analysis that the linear functionals in the dual
space of a Hilbert space are themselves in perfect correspondence with the
functions in the original space.<sup id="fnref:fnote1" role="doc-noteref"><a href="#fn:fnote1" class="footnote" rel="footnote">1</a></sup></p>
<p>Recently, I’ve stumbled upon another such result, which has captured my
attention for a while. The result, often referred to as Curry-Howard
correspondence, is the statement that propositions in a formal logical system
are equivalent to types in the simply typed lambda calculus. Loosely, this means
that <strong>logical statements are equivalent to data types</strong>!</p>
<p>Let’s unpack that a bit; “propositions” are just statements in a logical
system.<sup id="fnref:fnote15" role="doc-noteref"><a href="#fn:fnote15" class="footnote" rel="footnote">2</a></sup> In mathematics, for example, one might put forward the
proposition “no even numbers are prime,” or “14 is greater than 18”. Note that
propositions need not be <em>true</em>; in fact, some logical systems support
propositions that cannot even be determined to be true or false.<sup id="fnref:fnote2" role="doc-noteref"><a href="#fn:fnote2" class="footnote" rel="footnote">3</a></sup>
“Types” can be though of as types in a computing language; <code class="language-plaintext highlighter-rouge">Integer</code>, <code class="language-plaintext highlighter-rouge">Boolean</code>,
and so on. We will have much more to say about types as we move forward, but for
now, hold in your mind the conventional notion of types as defined in a language
such as Java or Python (or better yet, Haskell).</p>
<p>How on earth could these two be in correspondence? On the surface, they appear
entirely separate concepts. In this post, I’ll spend some time unpacking what
this equivalence is actually saying, using a simple example. I am far from a
full understanding of it, but as usual, I write about it in the hopes that I’ll
be forced to clarify what I <em>do</em> understand, or even better, be corrected by
someone more knowledgable than myself.</p>
<p>Speaking of those more knowledgable than myself, there are various resources
online that I found very helpful in understanding the correspondence:
<a href="https://www.youtube.com/watch?v=IOiZatlZtGU&t=1176s">Philip Wadler’s talk</a> on the subject is a great starting point, and there
are a number of <a href="http://lambda-the-ultimate.org/node/1532">useful</a> <a href="https://stackoverflow.com/questions/2969140/what-are-the-most-interesting-equivalences-arising-from-the-curry-howard-isomorp">discussions</a> <a href="https://stackoverflow.com/questions/2829347/a-question-about-logic-and-the-curry-howard-correspondence">available</a> on StackExchange and
various functional programming forums.</p>
<h2 id="an-example">An Example</h2>
<p>I was confused by the idea of propositions as types when I first encountered it,
and after learning more, I believe that the root of my confusion lies in the
fact that types such as <code class="language-plaintext highlighter-rouge">Integer</code>, <code class="language-plaintext highlighter-rouge">Boolean</code>, and <code class="language-plaintext highlighter-rouge">String</code>, which we are
familiar with from programming, correspond to very trivial propositions, making
them poor examples. We’ll have to introduce something a bit fancier; a
<em>conditional type</em>. For example, <code class="language-plaintext highlighter-rouge">OddInt</code> might be odd Integers, and <code class="language-plaintext highlighter-rouge">PrimeInt</code>
might be prime integers. We’ll approximate these conditional types with custom
classes in Scala. Classes and types are <a href="https://stackoverflow.com/questions/5031640/what-is-the-difference-between-a-class-and-a-type-in-scala-and-java">different beasts</a>, of course, but
we will ignore that distinction in this post.<sup id="fnref:fnote3" role="doc-noteref"><a href="#fn:fnote3" class="footnote" rel="footnote">4</a></sup></p>
<p>Let’s consider one conditional type in particular: <code class="language-plaintext highlighter-rouge">BigInteger</code>. This type
(actually a class in this example) is defined as follows:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="k">class</span> <span class="nc">BigInteger</span> <span class="o">(</span><span class="k">val</span> <span class="nv">value</span><span class="k">:</span> <span class="kt">Int</span><span class="o">)</span> <span class="o">{</span>
<span class="k">private</span> <span class="k">final</span> <span class="k">val</span> <span class="nv">LOWER_BOUND</span> <span class="k">=</span> <span class="mi">10000</span>
<span class="nf">if</span> <span class="o">(</span><span class="n">value</span> <span class="o"><</span> <span class="nc">LOWER_BOUND</span><span class="o">)</span> <span class="o">{</span>
<span class="k">throw</span> <span class="k">new</span> <span class="nc">IllegalArgumentException</span><span class="o">(</span><span class="s">"Too small!"</span><span class="o">)</span>
<span class="o">}</span>
<span class="k">override</span> <span class="k">def</span> <span class="nf">toString</span> <span class="k">=</span> <span class="n">s</span><span class="s">"BigInteger($value)"</span>
<span class="o">}</span></code></pre></figure>
<p>One could then instantiate a <code class="language-plaintext highlighter-rouge">BigInteger</code> as follows:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="k">val</span> <span class="nv">big</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">BigInteger</span><span class="o">(</span><span class="mi">10001</span><span class="o">)</span>
<span class="c1">// res0: BigInteger(10001)</span>
<span class="k">val</span> <span class="nv">small</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">BigInteger</span><span class="o">(</span><span class="mi">500</span><span class="o">)</span>
<span class="c1">// java.lang.IllegalArgumentException: Too small!</span></code></pre></figure>
<p>Now the fundemanetal question: what proposition corresponds to this type? In
simple scenarios like this, the corresponding proposition is that the type can
be <em>inhabited</em>; that is, there exists a value that satisfies that type. For
example, the type <code class="language-plaintext highlighter-rouge">BigInteger</code> corresponds to the claim “there exists an integer
\(i\) for which \( i > 10,000 \)”. Obviously, such an integer exists, and the
fact that we can instantiate this type indicates that it corresponds to a true
proposition. Alternatively, consider a type <code class="language-plaintext highlighter-rouge">WeirdInteger</code>, which is an integer
satisfying <code class="language-plaintext highlighter-rouge">i < 3 && i > 5</code>. We can define the type well enough, but there are
no values which satisfy it; it is an uninhabitable type, and so corresponds to a
false proposition.</p>
<h2 id="functions-and-implication">Functions and Implication</h2>
<p>Let’s make things a little more interesting. In programming languages, there are
not only primitive types like <code class="language-plaintext highlighter-rouge">Integer</code> and <code class="language-plaintext highlighter-rouge">Boolean</code>, but there are also
<strong>function types</strong>, which are the types of functions. For example, in Scala, the
function <code class="language-plaintext highlighter-rouge">def f(x: Int) = x.toString</code> has type <code class="language-plaintext highlighter-rouge">Int => String</code>, which is to say
it is a function that maps integers to strings.</p>
<p>What sort of propositions would <em>functions</em> correspond to? It turns out that
functions naturally map to <em>implication</em>. In some ways, the correspondence here
is very natural. Consider the conditional type <code class="language-plaintext highlighter-rouge">BigInteger</code>, and the conditional
type <code class="language-plaintext highlighter-rouge">BiggerInteger</code>. The definition of the latter should look familiar, from
above:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="k">class</span> <span class="nc">BiggerInteger</span> <span class="o">(</span><span class="k">val</span> <span class="nv">value</span><span class="k">:</span> <span class="kt">Int</span><span class="o">)</span> <span class="o">{</span>
<span class="k">private</span> <span class="k">final</span> <span class="k">val</span> <span class="nv">LOWER_BOUND</span> <span class="k">=</span> <span class="mi">20000</span>
<span class="nf">if</span> <span class="o">(</span><span class="n">value</span> <span class="o"><</span> <span class="nc">LOWER_BOUND</span><span class="o">)</span> <span class="o">{</span>
<span class="k">throw</span> <span class="k">new</span> <span class="nc">IllegalArgumentException</span><span class="o">(</span><span class="s">"Too small!"</span><span class="o">)</span>
<span class="o">}</span>
<span class="k">override</span> <span class="k">def</span> <span class="nf">toString</span> <span class="k">=</span> <span class="n">s</span><span class="s">"BiggerInteger($value)"</span>
<span class="o">}</span></code></pre></figure>
<p>Now, we can write a function that maps <code class="language-plaintext highlighter-rouge">BigInteger</code> to <code class="language-plaintext highlighter-rouge">BiggerInteger</code>:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="k">def</span> <span class="nf">makeBigger</span><span class="o">(</span><span class="n">b</span><span class="k">:</span> <span class="kt">BigInteger</span><span class="o">)</span><span class="k">:</span> <span class="kt">BiggerInteger</span> <span class="o">=</span>
<span class="k">new</span> <span class="nc">BiggerInteger</span><span class="o">(</span><span class="nv">b</span><span class="o">.</span><span class="py">value</span> <span class="o">*</span> <span class="mi">2</span><span class="o">)</span></code></pre></figure>
<p>Recall that the proposition corresponding to the type <code class="language-plaintext highlighter-rouge">BigInteger</code> is the
statement “there exists an integer greater than 10,000”, and the proposition
corresponding to <code class="language-plaintext highlighter-rouge">Bigger</code> is the statement “there exists an integer greater than
20,000”; the proposition corresponding to the function type <code class="language-plaintext highlighter-rouge">BigInteger =>
BiggerInteger</code> is then just the statement “the existence of an integer above
10,000 implies the existence of an integer above 20,000”. And note that, as it
should be for an implication, we do not care whether there actually <em>does</em> exist
an integer above 10,000; we simply know that <em>if</em> one exists, then its existence
implies the existence of an integer above 20,000.</p>
<p>To be a bit more explicit, the function that we wrote above can be thought of as
a <strong>proof</strong> of the implication; in particular, if we suppose that there exists
an \(i\) such that \(i > 10,000\), then clearly \(2i > 20,000\), and so
if we let \(j=2i\), then we have proven the existence of a \(j\) such that
\(j > 20,000\). This is what the theoretical computer scientists mean when
they say that “programs are proofs”.</p>
<p>Of course, Scala is not a proof-checking language, and cannot tell during
compilation that the function <code class="language-plaintext highlighter-rouge">makeBigger</code> is valid; we would need a much richer
type system to be able to validate such functions. Consider that the following
function compiles with no problem, although there are no input values for which
it will not throw a (runtime) exception:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="k">def</span> <span class="nf">wonky</span><span class="o">(</span><span class="n">b</span><span class="k">:</span> <span class="kt">BigInteger</span><span class="o">)</span><span class="k">:</span> <span class="kt">BiggerInteger</span> <span class="o">=</span>
<span class="k">new</span> <span class="nc">BiggerInteger</span><span class="o">(</span><span class="nv">b</span><span class="o">.</span><span class="py">value</span> <span class="o">%</span> <span class="mi">1000</span><span class="o">)</span></code></pre></figure>
<h3 id="wait-what">Wait… what?</h3>
<p>If you think about it a bit more, it’s sort of a weird example; you
could map <em>any</em> type to <code class="language-plaintext highlighter-rouge">BiggerInteger</code>, just by doing <code class="language-plaintext highlighter-rouge">def f[A](a:A):
BiggerInteger = new BiggerInteger(20001)</code>. This is because the proposition that
corresponds to <code class="language-plaintext highlighter-rouge">BiggerInteger</code> is true (the type is inhabitable), and if B is
true, then A implies B for any A at all.</p>
<p>Common languages such as Haskell only express very trivial propositions with
their types; there does exist one uninhabitable type (<code class="language-plaintext highlighter-rouge">void</code>), but I have not
found much use for it in practice. The benefit of using conditional types for
these examples is that we can explore at least some types which have
corresponding <em>false</em> propositions, such as <code class="language-plaintext highlighter-rouge">WeirdInteger</code>, which are integers
<code class="language-plaintext highlighter-rouge">i</code> which satisfy <code class="language-plaintext highlighter-rouge">i < 3 && i > 5</code>.</p>
<h2 id="in-conclusion">In Conclusion</h2>
<p>Seeing all this, you can begin to get a sense of how computer-assisted proof
techniques might arise out of it. If the fact that a program compiles is
equivalent to the truth the corrsponding proposition, then all we need is a
language with a rich enough type system to express interesting
statements. Examples of languages used in this way include <a href="https://coq.inria.fr/">Coq</a> and
<a href="https://en.wikipedia.org/wiki/Agda_(programming_language">Agda</a>. A thorough discussion of such languages is beyond both the scope of
this post and my understanding.</p>
<p>I think what keeps me interested in this subject is that it still remains quite
opaque to me; I’ve struggled to even come up with these simple (and flawed)
examples of how Curry-Howard correspondence plays out in practice. I hope that
anyone reading this who understand the subject better than I do will leave a
detailed list of my misunderstandings, so that I can better grasp this
mysterious and fascinating topic.</p>
<!-------------------------------- FOOTER ---------------------------->
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:fnote1" role="doc-endnote">
<p>This statement is difficult to understand without background in
functional analysis, but it is in fact one of the most beautiful examples of
such an equivalence result. <a href="#fnref:fnote1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote15" role="doc-endnote">
<p>I’m being a bit sloppy here. The type of logic we’re talking about
here is not classical logic, but rather in the sense of <a href="https://en.wikipedia.org/wiki/Natural_deduction">natural deduction</a>. <a href="#fnref:fnote15" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote2" role="doc-endnote">
<p>Such systems are called undecidable; see
<a href="https://en.wikipedia.org/wiki/Decidability_(logic)">the wiki entry on decidability</a> for more information. <a href="#fnref:fnote2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote3" role="doc-endnote">
<p>We won’t be careful about whether the idea of conditional types
presented here corresponds well with conditional types as they are actually
implemented in programming languages such as <a href="https://github.com/Microsoft/TypeScript/pull/21316">Typescript</a>. <a href="#fnref:fnote3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Peter Willspeter@pwills.comWhat is the connection between data types and logical propositions? Surprisingly, it runs quite deep. This post explores and illuminates that link.Inverse Transform Sampling in Python2018-06-24T00:00:00+00:002018-06-24T00:00:00+00:00http://www.pwills.com/posts/2018/06/24/sampling<p>When doing data work, we often need to sample random variables. This is easy to
do if one wishes to sample from a Gaussian, or a uniform random variable, or a
variety of other common distributions, but what if we want to sample from an
arbitrary distribution? There is no obvious way to do this within
<code class="language-plaintext highlighter-rouge">scipy.stats</code>. So, I build a small library, <a href="https://www.github.com/peterewills/itsample"><code class="language-plaintext highlighter-rouge">inverse-transform-sample</code></a>,
that allows for sampling from arbitrary user provided distributions. In use, it
looks like this:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="n">pdf</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">np</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="o">-</span><span class="n">x</span><span class="o">**</span><span class="mi">2</span><span class="o">/</span><span class="mi">2</span><span class="p">)</span> <span class="c1"># unit Gaussian, not normalized
</span><span class="kn">from</span> <span class="nn">itsample</span> <span class="kn">import</span> <span class="n">sample</span>
<span class="n">samples</span> <span class="o">=</span> <span class="n">sample</span><span class="p">(</span><span class="n">pdf</span><span class="p">,</span><span class="mi">1000</span><span class="p">)</span> <span class="c1"># generate 1000 samples from pdf </span></code></pre></figure>
<p>The code is available <a href="https://www.github.com/peterewills/itsample">on GitHub</a>. In this post, I’ll outline the theory of
<a href="https://en.wikipedia.org/wiki/Inverse_transform_sampling">inverse transform sampling</a>, discuss computational details, and outline some
of the challenges faced in implementation.</p>
<h2 id="introduction-to-inverse-transform-sampling">Introduction to Inverse Transform Sampling</h2>
<p>Suppose we have a probability density function \(p(x)\), which has an
associated cumulative density function (CDF) \(F(x)\), defined as usual by</p>
\[F(x) = \int_{-\infty}^x p(s)ds.\]
<p>Recall that the cumulative density function \(F(x)\) tells us <em>the probability
that a random sample from \(p\) is less than or equal to x</em>.</p>
<p>Let’s take a second to notice something here. If we knew, for some x, that
\(F(x)=t\), then drawing \(x\) from \(p\) is in some way <strong>equivalent to
drawing \(t\) from a uniform random variable on \([0,1]\)</strong>, since the CDF for
a uniform random variable is \(F_u(t) = t\).<sup id="fnref:fnote1" role="doc-noteref"><a href="#fn:fnote1" class="footnote" rel="footnote">1</a></sup></p>
<p>That realization is the basis for inverse transform sampling. The procedure is:</p>
<ol>
<li>Draw a sample \(t\) uniformly from the inverval \([0,1]\).</li>
<li>Solve the equation \(F(x)=t\) for \(x\) (invert the CDF).</li>
<li>Return the resulting \(x\) as the sample from \(p\).</li>
</ol>
<h2 id="computational-considerations">Computational Considerations</h2>
<p>Most of the computational work done in the above algorithm comes in at step 2,
in which the CDF is inverted.<sup id="fnref:fnote2" role="doc-noteref"><a href="#fn:fnote2" class="footnote" rel="footnote">2</a></sup> Consider Newton’s method, a typical
routine for finding numerical solutions to equations: the approach is iterative,
and so the function to be inverted, in our case the CDF \(F(x)\), is evaluated
many times. Now, in our case, since \(F\) is a (numerically computed) integral
of \(p\), this means that we will have to run our numerical quadrature routine
once for each evaluation of \(F\). Since we need <em>many</em> evaluations of \(F\)
for a single sample, this can lead to a significant slowdown in sampling.</p>
<p>Again, the pain point here is that our CDF \(F(x)\) is slow to evaluate,
because each evaluation requires numerical quadrature. What we need is an
approximation of the CDF that is fast to evaluate, as well as accurate.</p>
<h3 id="chebyshev-approximation-of-the-cdf">Chebyshev Approximation of the CDF</h3>
<p>I snooped around on the internet a bit, and found <a href="https://github.com/scipy/scipy/issues/3747">this feature request</a> for
scipy, which is related to this same issue. Although it never got off the
ground, I found an interesting link to <a href="https://arxiv.org/pdf/1307.1223.pdf">a 2013 paper by Olver & Townsend</a>, in
which they suggest using Chebyshev polynomials to approximate the PDF. The
advantage of this approach is that the integral of a series of Chebyshev
polynomials is known analytically - that is, if we know the Chebyshev expansion
of the PDF, we automatically know the Chebyshev expansion of the CDF as
well. This should allow us to rapidly invert the (Chebyshev approximation of
the) CDF, and thus sample from the distribution efficiently.</p>
<h3 id="other-approaches">Other Approaches</h3>
<p>There are also less mathematically sophisticated approaches that immediately
present themselves. One might consider solving \(F(x)=t\) on a grid of \(t\)
values, and then building the function \(F^{-1}(x)\) by interpolation. One
could even simply transform the provided PDF into a histogram, and then use the
functionality built in to <code class="language-plaintext highlighter-rouge">scipy.stats</code> for sampling from a provided histogram
(more on that later). However, due to time constraints,
<code class="language-plaintext highlighter-rouge">inverse-transform-sample</code> only includes the numerical quadrature and Chebyshev
approaches.</p>
<h2 id="implementation-in-python">Implementation in Python</h2>
<p>The implementation of this approach is not horribly sophisticated, but in
exchange it exhibits that wonderful readability characteristic of Python
code. The complexity is the highest in the methods implementing the
Chebyshev-based approach; those without a background in numerical analysis may
wonder, for example, why the function is evaluted on <a href="https://en.wikipedia.org/wiki/Chebyshev_nodes">that particularly strange
set of nodes</a>.</p>
<p>In the quadrature-based approach, both the numerical quadrature and root-finding
are both done via <code class="language-plaintext highlighter-rouge">scipy</code> library (<code class="language-plaintext highlighter-rouge">scipy.integrate.quad</code> and
<code class="language-plaintext highlighter-rouge">scipy.optimize.root</code>, respectively). When using this approach, one can set the
boundaries of the PDF to be infinite, as <code class="language-plaintext highlighter-rouge">scipy.integrate.quad</code> supports
improper integrals. In the <a href="https://github.com/peterewills/itsample/blob/master/example.ipynb">notebook of examples</a>, we show that the samples
generated by this approach do, at least in the eyeball norm, conform to the
provided PDF. As we expected, this approach is slow - it takes about 7 seconds to generate
5,000 samples from a unit normal.</p>
<p>As with the quadrature and root-finding, pre-rolled functional from <code class="language-plaintext highlighter-rouge">scipy</code> was
used to both compute and evaluate the Chebyshev approximants. When approximating
a PDF using Chebyshev polynomials, finite bounds must be provided. A
user-determined tolerance determines the order of the Chebyshev approximation;
however, rather than computing a true error, we simply use the size of the last
few coefficients of the Chebyshev coefficients as an approximation. Since this
approach differs from the previousl only in the way that the CDF is constructed,
we use the same function <code class="language-plaintext highlighter-rouge">sample</code> for both approaches; an option
<code class="language-plaintext highlighter-rouge">chebyshev=True</code> will generate a Chebyshev approximant of the CDF, rather than
using numerical quadrature.</p>
<p>I hoped that the Chebyshev approach would improve on this by an order of
magnitude or two; however, my hopes were thwarted. The implementation of the
Chebyshev approach is faster by perhaps a factor of 2 or 3, but does not offer
the kind of improvement I had hoped for. What happened? In testing, a single
evaluation of the Chebyshev CDF was not much faster than a single evaluation of
the quadrature CDF. The advantage of the Chebyshev CDF comes when one wishes to
evaluate a long, vectorized set of inputs; in this case, the Chebyshev CDF is
orders of magnitude faster than quadrature. But <code class="language-plaintext highlighter-rouge">scipy.optimize.root</code> does not
appear to take advantage of vectorization, which makes sense - in simple
iteration schemes, the value at which the next iteration occurs depends on the
outcome of the current iteration, so there is not a simple way to vectorize the
algorithm.</p>
<h2 id="conclusion">Conclusion</h2>
<p>I suspect that the reason this feature is absent from large-scale library like
<code class="language-plaintext highlighter-rouge">scipy</code> and <code class="language-plaintext highlighter-rouge">numpy</code> is that it is difficult to build a sampler that is both fast
and accurate over a large enough class of PDFs. My approach sacrifices speed;
other approximation schemes may be very fast, but may not provide the accuracy
guarantees needed by some users.</p>
<p>What we’re left with is a library that is useful for generating small numbers
(less than 100,000) of samples. It’s worth noting that in the work of Olver &
Townsend, they seem to be able to use the Chebyshev approach to sample orders of
magnitude faster than my impelmentation, but sadly their Matlab code is nowhere
to be found in the Matlab library <a href="http://www.chebfun.org/"><code class="language-plaintext highlighter-rouge">chebfun</code></a>, which is the location
advertised in their work. Presumably they implemented their own root-finder, or
Chebyshev approximation scheme, or both. There’s a lot of space for improvement
here, but I simply ran out of time and energy on this one; if you feel inspired,
<a href="https://github.com/peterewills/itsample#contributing">fork the repo</a> and submit a pull request!</p>
<!-------------------------------- FOOTER ---------------------------->
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:fnote1" role="doc-endnote">
<p>This is only true for \(t\in [0,1]\). For \(t<0\),
\(F_u(t)=0\), and for \(t>1\), \(F_u(t)=1\). <a href="#fnref:fnote1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote2" role="doc-endnote">
<p>The inverse of the CDF is often called the percentile point function,
or PPF. <a href="#fnref:fnote2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Peter Willspeter@pwills.comExplanation of, and code for, a small Python tool for sampling from arbitrary distributions.Algorithmic Musical Genre Classification2018-06-06T00:00:00+00:002018-06-06T00:00:00+00:00http://www.pwills.com/posts/2018/06/06/genre<p>If you are not automatically redirected, please <a href="/portfolio/genre_cls">click here</a></p>
<meta http-equiv="refresh" content="0;url=/portfolio/genre_cls" />Peter Willspeter@pwills.comA summary of a project of mine in which I build an algorithmic classifier that identifies the genre of a piece of music based directly on the waveform.