Jekyll2020-12-31T01:01:46+00:00http://www.pwills.com/feed.xmlpwills.comSource for pwills.comPeter Willspeter@pwills.comTaking a Stand2020-11-04T00:00:00+00:002020-11-04T00:00:00+00:00http://www.pwills.com/post/2020/11/04/taking-a-stand<p>This election cycle has made me see how tribalistic I can be in my perception of
politics. It’s very easy for me to fall into a point of view in which there is a “good
team” and a “bad team.” This perspective is harmful in that it causes other painful
emotions to arise (e.g. anger), but also in that it prevents me from contributing
meaningfully to the political discourse and being a citizen that contributes to our
country’s progress.</p>
<p>For me, the way to alleviate these tribalistic ideas is to replace them with a
meaningful and well-articulated set of ideas and opinions. I think it’s time for me to
put myself out there as a person with opinions and ideas on how to improve our country.</p>
<p>So, I’m going to outline my ideas in this post. These views will certainly change over
time, and I hope to become better informed on all of them. That said, I think the
foundational idea of democracy is that aggregating the opinions of many
only-partly-informed citizens is better than only considering the opinions of an
extremely-well-informed elite. So, in that spirit, I will endeavor to articulate my
only-partly-informed viewpoint.</p>
<p>Below, I will state my opinions as fact. I think it makes for a better articulation than
if I’m couching each sentence with “in my opinion…” and “as far as I can see…”, but
the reader should be aware that this is implied; I’m far from an expert in most of these
issues, and I am not citing most of my claims. If you think I’m wrong, then I probably
am; please let me know in the comments, preferably with some reference to back up your
correction. I’ll happily modify this post as I update my understanding.</p>
<h1 id="driving-values">Driving Values</h1>
<p>The various positions I take on issues below are all instrumental; that is, they aim to
be in service of certain core values that I hold. It’s important to lay these out before
discussing individual points, because they frame any discussions that will be had.</p>
<p>I value <strong>mutual respect and kindness</strong>. I think that society should be built on a
foundation of people respecting one another, and being kind to one another. This doesn’t
have to mean that we love one another or even particularly like one another; it just
means that we have to recognize other people as human beings, with wants, hopes, and
lives just as valid as our own.</p>
<p>I care about <strong>liberty</strong>. That is, I want people to have the freedom to do what they want to
do without unreasonable restriction. For example, what religion you practice, who you
choose to have sex with, and what drugs you choose to put in your body should not be
regulated by the government.<sup id="fnref:fn1" role="doc-noteref"><a href="#fn:fn1" class="footnote">1</a></sup></p>
<p>I care about <strong>economic opportunity</strong>. I want all people in our society to have access to
basic comforts, such as sufficient food, clothing, shelter, and medical care.</p>
<p>I care about <strong>equality</strong>. This goes hand in hand with both liberty and mutual respect. I
think that all people are fundamentally deserving of certain rights, irrespective of
gender, race, intelligence, etc.<sup id="fnref:fn2" role="doc-noteref"><a href="#fn:fn2" class="footnote">2</a></sup></p>
<p>I care about <strong>security</strong>. We should strive to provide a life without unnecessary physical
or psychological dangers.</p>
<h1 id="economic-issues">Economic Issues</h1>
<p>I think economic issues are fundamental. Many of the issues we face as a society are
symptoms of underlying economic issues. If the people in our society felt like
they have economic opportunity, and are confident that their children will have this
opportunity as well, then our society would function much more smoothly.</p>
<p>For exmaple, it seems to me that our nations current drift towards partisan politics and
an anger-driven national discourse is driven by the lack of economic opportunity
available to much of the lower and middle class in American today. We can try and make
changes to the structure of government, or elect different leaders (both of which I
think are good ideas), but ultimately if we do not address the root (economic) causes of
people’s discontent, then we will never have a functioning governmental system.</p>
<h2 id="education">Education</h2>
<p>Education is a huge part of the economic life of an American. A college degree is seen
as crucial to economic success, and various studies have shown that having a bachelor’s
degree increases lifetime earnings, with some studies claiming to have established
causality in this relationship.<sup id="fnref:fn3" role="doc-noteref"><a href="#fn:fn3" class="footnote">3</a></sup></p>
<p>But, college is getting more and more expensive, and the marginal value of a college
degree is decreasing as more and more people get them. Now students graduate from
college with five- or six-figure debt, which prevents them from building wealth that can
provide economic security for them as they age and retire.</p>
<p>Bernie Sanders famously campaigned on free college for all. I think <strong>free college is a
bad solution</strong>. It would funnel large amounts of taxpayer money towards colleges that
charge too much and deliver too little in terms of valuable skills. We need to
transition our society away from its obsession with the four-year college degree, and
move towards a trade-school model like <a href="https://www.bmbf.de/en/the-german-vocational-training-system-2129.html">that implemented in Germany.</a></p>
<p>Some argue that college is about more than just job training; it is about developing
critical thinking skills, and becoming a fully functional citizen. I agree that these
skills are fundamental to civic participation; I don’t see most colleges doing an
acceptable job of providing their students with them. For example, the state could run
free community-based classes that teach critical thinking - these would be independent
of whatever vocational training a person decides to undertake.</p>
<h2 id="taxation-spending--the-social-safety-net">Taxation, Spending, & the Social Safety Net</h2>
<p>A strong social safety net is an excellent way to prevent economic distress. This will
require more taxes, but this doesn’t need to be burdensome on the middle (or even
upper-middle) classes, if done correctly.<sup id="fnref:fn4" role="doc-noteref"><a href="#fn:fn4" class="footnote">4</a></sup></p>
<p>This social safety net should include expansion of medicare and social
security/welfare. We should provide free childcare for young children, to allow mothers
to participate in the workforce unhindered. I would love to see universal basic income
implemented in an effective way; we would need to significantly increase taxes in order
to do so, however, and probably on more than just the very highest earners, so it’s not
obvious to me that it’s actually a good idea.</p>
<p>To fund these endeavors, the government needs more revenue. We need to simplify the tax
code, and close loopholes. We need to increase taxes on the very highest earners and
corporations, and enforce the simplified tax code in order to ensure that those taxes
are actually paid. We should also significantly reduce defense spending to free up money
for domestic social spending.<sup id="fnref:fn5" role="doc-noteref"><a href="#fn:fn5" class="footnote">5</a></sup></p>
<h2 id="regulation--deregulation">Regulation & Deregulation</h2>
<p>I think it is an essential role of govenment to regulate certian aspects of the
market. For example, there are shared goods that the market does not incentivize
individual actors to protect, but they are of high value to society as a
whole. Environmental protection regulation is an example of this; I think we need to
implement thorough and carefully thought-out environmental regulation, which expands on
our existing system.</p>
<p>Another key area where the government needs to regulate is in antitrust. The government
should protect and promote a competitive marketplace. The current antitrust law, written
in the era of the railroad barons, is badly outdated and in need of an overhaul in order
to address potential anticompetitive behaviors of modern technology companies.</p>
<p>Conversely, there are many places where the government heavily regulates that inhibit
economic activity and actually prevent the market from creating value. An example of
this is in urban land use; we need less regulation on building and zoning in urban areas
so that builders can generate a supply to meet the growing demand, and undercut the
exploding housing costs in many large American cities. Rent control is <u>not</u> a good
solution for this; it’s simply a supply-and-demand problem, and we need to increase the
supply.</p>
<h2 id="health-care">Health Care</h2>
<p>Health care should be mentioned, as it related to regulation and deregulation, although
I don’t actually have a strong opinion on it. I have heard some solid arguments that the
tangled relationship between US health insurance companies and the health care sector is
a drive of our current explosion in health care costs, and that if we removed some of
the barries put in place then we could have a more efficient market for health care,
that would provide better value.</p>
<p>However, we tend to be bad at even <em>thinking</em> of health care as a good; for example, we
rarely do a cost-benefit consideration of chemotherapy for a loved one, we generally say
“do whatever it takes.” For health care to function as a market, we would need to start
considering seriously (for example) whether it’s worth $500,000 to extend the life of a
75-year-old by another 8 years.</p>
<p>The opposite end of the spectrum is single-payer healthcare. This might actually be a
good way to get costs down because then the government, as the single customer of
health-care, would have a lot of bargaining power and be able to bring down the price
they pay for services provided. However, if they don’t do this effectively, then a <u>lot</u>
of taxpayer money would be going to services that may not be worth it (unnecessary
procedures or imaging, for example).</p>
<h1 id="governance">Governance</h1>
<p>The US has a representative government, and we need to make sure that our elected
officials are incentivized to genuinely reflect the views, opinions, and values of the
population that they represent.</p>
<p>I’m going to argue here for changes we should make to our existing system. I will try to
focus on changes that could be enacted legislatively, rather than by constiutional
amendment, because it’s very difficult to gather the consensus needed to enact the
latter, particularly in our current political environment.<sup id="fnref:fn6" role="doc-noteref"><a href="#fn:fn6" class="footnote">6</a></sup></p>
<h2 id="campaign-finance-reform">Campaign Finance Reform</h2>
<p>One of the key things that incentivizes elected officials is campaign finance. They need
to please their campaign donors, so that they can raise money to support their
re-election, and election of their party members.</p>
<p>We need to find a way to reduce the amount of money that flows through elections. It is
not always obvious how to make this happen, but one thing that seems clear is that we
should overturn <a href="https://en.wikipedia.org/wiki/Citizens_United_v._FEC">the Citizens United ruling</a> that grants free-speech rights to
corporations, allowing for unchecked corporate political spending. One solution would be
to cap the political donations by individuals & corporations to any campaigns or
political action committees at a relatively small amount (say, $5,000).</p>
<p>This, however, runs into free speech concerns that I’m not entirely settled on;
shouldn’t I be able to spend my money on television advertisements saying (within
reason) whatever I like? If I genuinely think that Michael Dukakis is a threat to
American democracy, shouldn’t I be able to freely promote that message?</p>
<p>It’s not obvious how to handle this, but I think we need to grapple with it in order to
re-establish integrity for our elected officials.</p>
<h2 id="legislative-gridlock">Legislative Gridlock</h2>
<p>Legislative gridlock is a big challenge to progress. It <em>appears</em> that it is more
significant now than it has been in the past, but I’m not certain of that. We have seen
evidence that congress is more polarized, and that there is a trend away from compromise
and towards parties voting as predictable blocks on legislation.<sup id="fnref:fn7" role="doc-noteref"><a href="#fn:fn7" class="footnote">7</a></sup></p>
<p>This is something we need to address. There may be changes we can make to the
legislative process that encourage compromise, and that would be a positive
step. However, my belief is that this polarization and partisanship ultimately flows
from the people themselves. Politicians are afraid to compromise because they know that
if they work across the aisle, they will be demonized by their constituents and not
re-elected.</p>
<p>One way to reduce polarization is to avoid focus on already-politicized issues. For
example, if Democrats were to relax their traditional position on gun control, then
perhaps they would have more leverage to push for liberal economic policies that would
benefit lower-income Americans. Although some of these policies have been politicized
(e.g. single-payer healthcare) some of them have not, and maintain a fairly bipartisan
support base (e.g. universal basic income).</p>
<p>It’s worth noting that we can also sidestep an ineffective legislature by allowing the
private sector to address problems. This will work, sometimes; for example, SpaceX has a
promising new satellite internet technology (Starlink) that I hope will soon provide
broadband internet to any area with a clear view of the sky; this would work around our
nation’s embarassingly poor broadband infrastructure (and lack of any political will to
address it). Some problems, however, are not well-addressed by the private sector
(e.g. nature conservation, antitrust law) because market forces work against them.</p>
<h2 id="court-packing">Court-Packing</h2>
<p>If our legislative branch is not functioning, then the executive and judicial branches
are encouraged to pick up the slack. This has resulted in presidents from Obama onward
severly expanding executive power via executive order, and also a focus on the
appointment of politically-motivated judges as a partisan strategy.</p>
<p>We need to change our system so that it is robust to a partisan judiciary; right now,
and for the foreseeable future, this is the reality of our situation. Assuming the
officials that are empowered to appoint judges are elected fairly,<sup id="fnref:fn8" role="doc-noteref"><a href="#fn:fn8" class="footnote">8</a></sup>
then our goal is to ensure that judges are appointed at a roughly consistent rate.</p>
<p>There are a few strategies that would encourage this. One strategy is to have
term-limits in place, rather than the current lifetime appointments. Another is to
increase the size of important courts, most notably the supreme court, so that the churn
of judges happens at a more consistent rate.</p>
<p>Finally, we cannot let the Senate control approval of appointments. The Senate gives
equal voice to <u>states</u>, rather than equal voice to individuals. So, it will always be
biased towards the lower-population-density areas, and therefore not be representative
of the will of the people. Such an institution should not hold control over who gets
appointed to the judiciary, especially in our current era of increasing judicial power.</p>
<p>One solution would be to lower the threshold for approval of judges; for example, only
require that 35% approval of judiciary appointments. This, however, has its own
downsides, since it would allow for appointment of even-more-partisan judges to the
bench. I don’t know a better solution right now, but I think that it is a problem we
need to address.</p>
<h2 id="gerrymandering">Gerrymandering</h2>
<p>Gerrymandering has long been a strategy used to bias legislative bodies. It is
problematic in that it can make that legislative body less representational of the
population, which undermines the fundamental dynamics of democracy that support fair
governance.</p>
<p>In short, we cannot allow the drawing of district lines to be a process controlled by an
inherently partisan legislative body (the state legislatures, in the US). I suspect that
there exist processes that guarantee a fair drawing of district lines; we should codify
those processes into law. Again, this is not something we can leave up to partisan
elected officials; we need to restrict it via a process that disallows such partisan
strategies.</p>
<h1 id="social-issues">Social Issues</h1>
<p>Social issues are a primary focus of politics in the US, and (from what I can see) are a
major driver of the partisan polarization we see today. They are also touchy, which is
to say that expressing certain opinions on social issues can have severe repurcussions
for people in their personal and professional lives.</p>
<p>Because of that, I am going to refrain from going into much detail on these issues. I am
happy to discuss them in private, but I simply don’t trust our current social climate to
handle reasonable, well-thought-out discussion on these issues in the public sphere.</p>
<p>This has been true throughout, but it’s worth emphasizing here; these are <em>just my
opinions</em>, and I would enjoy the opportunity to change them. If you diagree with me, I
hope you will engage with me so that we can understand one another better, and hopefully
teach eachother a thing or two. I certainly have a lot to learn on all these topics.</p>
<h2 id="race--policing">Race & Policing</h2>
<p>Racism is a significant issue in America. Slavery is a horrific part of our national
heritage,<sup id="fnref:fn9" role="doc-noteref"><a href="#fn:fn9" class="footnote">9</a></sup> and it reverberates throughout our culture today. We should
always work towards the goal of a society where opportunities (economic, social, etc.)
are not limited by skin color or heritage.</p>
<p>There is a lot of focus lately on the interactions of policing and race. I do support a
restructuring of the America policing system; we give too much authority, and too little
oversight, to police officers. This level of authority and oversight is appropriate when
handling certain issues, but is entirely absurd when (for example) an officer is
handling a routine traffic stop.</p>
<p>A key element of police reform should be a severe reduction in the kinds of situations
that armed police officers handle. Armed officers, like we currently have, generally
have good ability in self-defense and are trained in it, but have very little
social/interpersonal skill, as evidenced by their recent trend of poor decision making
and inability to defuse tense situations that lead to violence and death. Traffic stops,
domestic disputes, etc. should be handled by public servants that have the appropriate
social skills.</p>
<p>That said, I’m not sure that policing is actually the most important driver of racism in
America. Improving economic opportunities for <u>all</u> Americans will go a long way towards
providing opportunities for marginalized groups. Even so, there are attitudes in our
society that will not be addressed simply by economic changes. I don’t have a good
answer for that problem, but I do think it is a fundamental one we must tackle if we
wish to function in alignment with the ideal that “all people are created equal.”</p>
<h2 id="gun-control">Gun Control</h2>
<p>The other social issue I will mention is gun control. This is an issue that contributes
<u>strongly</u> to polarization, and prevents liberals from making inroads into rural
communities. I think we should do everything we can to prevent people from having an
unreasonable ability to harm one another. That said, I don’t think gun control should be
a high priority.</p>
<p>About 38,000 people have died so far in 2020 due to gun violence. How many of these
would have been prevented if we could outlaw tactical/assault weapons? Only 16,000 of
these weren’t suicides, which are generall not done with such weapons. I’d estimate that
outlawing tactical and assault weapons would save fewer than 5,000 lives per year, and I
would guess it would be <u>much</u> fewer, perhaps a reduction of 1,000 lives per year. Compare
that to (for example) obesity, which kills about 300,000 people a year. Smoking kills
over 400,000 people a year.</p>
<p>I think that liberals over-prioritize gun control in their agenda, and it hurts their
ability to enact other, much more important and impactful aspects of their
platform. Frankly, I wish they would give it a rest.</p>
<h1 id="conclusion">Conclusion</h1>
<p>Again, these things I’ve been stating as fact are, actually, <em>just my opinions</em>. I have
taken up this imperative structure to embody my belief that we can only act from our
current, limited point of view, and that I cannot let the incompleteness of my knowledge
prevent me from having positions and taking action on issues that I see in the world.</p>
<p>The flipside of this is that we always have to be willing to listen, and learn, and
change our opinions. Perhaps (for example) gun violence <u>is</u> one of the most important
social issues facing our society today. I would be interested to see arguments of this,
and I aspire to be open-minded to any arguments that go against my existing
opinions. The things that are inarguable are <u>values</u>;<sup id="fnref:fn10" role="doc-noteref"><a href="#fn:fn10" class="footnote">10</a></sup> those are inherent in
us, and cannot be proven or disproven. As for <u>how</u> we go about enacting those values,
well, that must always be flexible and open to change.</p>
<!----- Footnotes ----->
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:fn1" role="doc-endnote">
<p>Of course, there are limits; for example, I believe sex should be performed only with mutual consent, and young children should be prevented from using certain drugs. <a href="#fnref:fn1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn2" role="doc-endnote">
<p>This gets complicated, though; what about people in vegitative states? What about animals? I’m glossing over some nuance here. <a href="#fnref:fn2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn3" role="doc-endnote">
<p>I should look up citations for this. <a href="#fnref:fn3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn4" role="doc-endnote">
<p>This is an area where I am opining without concrete numbers to back it up. However, most tax analyses are done in a blatantly partisan way; it’s very difficult to find an analysis of taxation and spending that is does not have ulterior motives. That said, if you have any recommendations, I’d love to hear them. <a href="#fnref:fn4" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn5" role="doc-endnote">
<p>Our national security moving forward does not depend on having better airplanes or tactical equipment; it depends on intelligence and information security (cybersecurity). I actually think we should invest <u>more</u> heavily in intelligence and infosec. I wholeheartedly believe that we should support our troops; I think the best way to do so is to avoid needless conflict, and ensure that these troops have sufficient economic, social, and medical security when they arrive back home. <a href="#fnref:fn5" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn6" role="doc-endnote">
<p>For example, I think that a two-party system does not necessarily encourage the best representation, but this is so baked in to how our voting systems work that I don’t really discuss it here. In that particular case, I also am not really confident that a many-party system (a la Israel) is actually more effective or representative. <a href="#fnref:fn6" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn7" role="doc-endnote">
<p>Vox recently <a href="https://www.vox.com/polyarchy/2018/5/31/17406590/local-national-political-institutions-polarization-federalism">wrote an interesting article</a> on how our polarization may result from the fact that our system was designed for local political institutions, but most people now focus primarily on national politics. <a href="#fnref:fn7" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn8" role="doc-endnote">
<p>It is my opinion that this is <u>not</u> currently the case, due to e.g. gerrymandering of congressional districts. <a href="#fnref:fn8" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn9" role="doc-endnote">
<p>I would be remiss to not also mention the other horrifici aspect of our national heritage: the systematic extermination of the indigenous peoples that inhabited North America previous to the arrival of Europeans. For a heartbreaking account of this, I recommend <a href="https://www.amazon.com/Bury-My-Heart-Wounded-Knee/dp/0099526409/ref=tmm_pap_swatch_0?_encoding=UTF8&qid=&sr=">Bury My Heart at Wonded Knee</a> by Dee Alexander Brown. <a href="#fnref:fn9" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn10" role="doc-endnote">
<p>Values can change, of course, but they are not subject to evidence in the same way that strategies are. I cannot prove to you that (for example) freedom is more important than security; it is simply an opinion that one holds. That said, if we spend time with one another, and maintain an open-minded attitude, then we tend to absorb one another’s values, which is a process that leads us towards a more harmonious society. <a href="#fnref:fn10" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Peter Willspeter@pwills.comBlogging in Org Mode2019-09-24T00:00:00+00:002019-09-24T00:00:00+00:00http://www.pwills.com/post/2019/09/24/blogging-in-org<p>I recently transitioned from writing my posts directly in markdown to writing them in
<a href="https://orgmode.org/">org mode</a>, a document authoring system built in GNU Emacs. I learned a lot in the
process, and also built a new org exporter in the process, <a href="https://github.com/peterewills/ox-jekyll-lite"><code class="language-plaintext highlighter-rouge">ox-jekyll-lite</code>.</a><sup id="fnref:fn1" role="doc-noteref"><a href="#fn:fn1" class="footnote">1</a></sup></p>
<h1 id="org-mode-and-the-meaning-of-life">Org Mode and the Meaning of Life</h1>
<h2 id="what-is-org-mode">What is Org Mode?</h2>
<p>Laozi said that the Tao that can be told is not the eternal Tao; I think we can safely
say the same of org mode. Org mode is many things to many people, but at it’s core it is
a tool for taking notes and organizing lists. Additional functionality allows for simple
text markup, links, inline images, rendered \(\LaTeX\) fragments, and so on. You can embed
and run code blocks within org files, using the powerful <a href="https://orgmode.org/worg/org-contrib/babel/"><code class="language-plaintext highlighter-rouge">org-babel</code></a> package. Some
people have even <a href="https://write.as/dani/writing-a-phd-thesis-with-org-mode">written
their Ph.D. thesis in org mode</a>. It’s an amazingly powerful tool, with a passionate
user base that is constantly expanding its capabilities.</p>
<h2 id="why-not-markdown">Why Not Markdown?</h2>
<p>I like to use org mode for my personal and professional note-taking because it has very
good folding features - you can hide all headings besides the one you’re focusing
on. You can even “narrow” your buffer, so that only the heading (“subtree”, in org-mode
parlance) that you’re working on is present at all.</p>
<p>Org mode also has some nice visual features for writing, such as:</p>
<ul>
<li>rendering \(\LaTeX\) fragments inline</li>
<li>styling <strong>bold</strong>, <u>underlined</u>, and <em>italicized</em> text properly</li>
<li>excellent automatic formatting of tables</li>
<li>code syntax highlighting in various languages</li>
<li>display of images inline</li>
</ul>
<p>I wrote in markdown (using <code class="language-plaintext highlighter-rouge">markdown-mode</code> within emacs) for some time, but once I saw
what org mode had to offer, I realized that I needed to transfer my blogging over to
org. In particular, the Emacs mode <code class="language-plaintext highlighter-rouge">markdown-mode</code> doesn’t have a lot of the features that
org mode does, such as inline rendering of math and images or well-built text folding. I
used org for notes, and I realized that it would be much easier to just write in org
instead of trying to get markdown mode to work the way I want it to.</p>
<p>Below is a short clip that shows just some of what org mode has to offer. You’ll
want to full-screen it to make the text legible.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/MV9LR2LCxAE" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
<p> </p>
<p>Overall, I find the experience of writing in org much more enjoyable than writing in
markdown. Plus, I love hacking on emacs, and moving my blogging workflow over to org
presented me with an opportunity to do just that! So of course, I couldn’t resist.</p>
<h1 id="org-export-and-jekyll">Org-Export and Jekyll</h1>
<h2 id="blogging-in-jekyll">Blogging in Jekyll</h2>
<p>The primary tool I use to generate my blog is a static-site generator called <a href="https://jekyllrb.com/">Jekyll</a>,
which is written in Ruby. I wrote <a href="/_posts/2017-12-29-website.md">a previous post</a> describing my process for setting up
my site. <a href="https://blog.getpelican.com/">Pelican</a> is a similar tool written in Python, and <a href="https://gohugo.io/">Hugo</a> is a
static-site generator written in Go. We’ll talk a bit more about Hugo later.</p>
<p>All of these tools allow the user to write content in simple markdown, with the site
generator doing most of the heavy lifting in generating a full static site behind the
scenes. In Jekyll, the user provided some basic configuration for each post, like a
title, date, and excerpt, and then the them determines the details on how the text is
rendered into fully styled HTML. I use the excellent <a href="https://mmistakes.github.io/minimal-mistakes/">minimal mistakes</a> theme.</p>
<p>Unfortunately, markdown is not a nicely unified language specification There are many
dialects of markdown, and each has subtle differences, so there is not, in general, one
markdown specification to rule them all. For example, so-called “GitHub-flavored
markdown”, which renders markdown from READMEs in GitHub repositories, has certian
quirks that are not shared by the markdown I write for this site. To further complicate
things, the static site generators often have their own quirks - Jekyll requires
particularly-formatted front-matter to specify the configuration for each post, which is
not part of the general markdown specification.</p>
<p>All that is to say, it wasn’t a trivial task to find something that converted org to the
exact markdown that I need for my site. But before we jump into the details there, we
should talk a bit about org exporters in general.</p>
<h2 id="org-export">Org-Export</h2>
<p>Org mode comes packaged with many built-in “exporters”, which convert from the org
format to other text formats, including HTML, \(\LaTeX\), iCalendar, and more. It <em>does</em>
come with a backend that converts org to markdown, which I hoped would be all that I
need to convert org to markdown.</p>
<p>Unfortunately, the built-in <code class="language-plaintext highlighter-rouge">ox-md</code> exporter doesn’t work very well, for a few reasons. It
falls back on using pure HTML (for example, to generate footnotes) when there are
markdown-native ways of accomplishing the same thing. Also, some things don’t work at
all - for example, equation exporting won’t work, since markdown requires you to enclose
LaTeX with <code class="language-plaintext highlighter-rouge">\\[</code> and <code class="language-plaintext highlighter-rouge">\\]</code>, whereas HTML only requires a single slash.<sup id="fnref:fn2" role="doc-noteref"><a href="#fn:fn2" class="footnote">2</a></sup></p>
<p>A quick search will show that there are many tools built to address this problem. Org
exporter backends are designed to be easy to extend, and many users have extended the
markdown backend to work with specific static site generators. The most fully developed
of these is <a href="https://ox-hugo.scripter.co/"><code class="language-plaintext highlighter-rouge">ox-hugo</code></a>, which is built to work with the site generator Hugo. This
package in particular would be a big source of the transcoding functions I would use,
but since it is built to be tightly integrated with Hugo, I couldn’t just use it out of
the box.</p>
<p>Elsa Gonsiorowski developed a Jekyll-friendly org exporter, called <a href="https://www.gonsie.com/blorg/ox-jekyll.html"><code class="language-plaintext highlighter-rouge">ox-jekyll-md</code></a>, which
provided the basis for what I would eventually build. She also wrote <a href="https://www.gonsie.com/blorg/ox-jekyll.html">a blog post</a> about
it - if you’re interested in customizing org exported, I’d recommend giving it a read.</p>
<h2 id="building-ox-jekyll-lite">Building <code class="language-plaintext highlighter-rouge">ox-jekyll-lite</code></h2>
<p>There are some things that <code class="language-plaintext highlighter-rouge">ox-jekyll-md</code> does very well, including generating the
Jekyll-specific YAML front matter. However, I found that it lacks a few key features:</p>
<ul>
<li>handling footnotes in a markdown-native way</li>
<li>rendering MathJax delimiters with double slashes (to make them markdown-compatable)</li>
<li>exporting image links appropriately</li>
<li>export link paths relative to the Jekyll root directory</li>
</ul>
<p>Since these were essential to my blogging workflow, I forked that project and began work
on my org exporter, <a href="https://github.com/peterewills/ox-jekyll-lite"><code class="language-plaintext highlighter-rouge">ox-jekyll-lite</code></a>.</p>
<h3 id="customizing-an-org-export-backend">Customizing an Org Export Backend</h3>
<p>You can think of an org-export backend as a collection of rules for transforming org
files into other text format. For example, how should underlined text be handled? How
about code blocks? How about \(\LaTeX\) snippets? Each of these rules is encapsulated by a
so-called “transcoding function.”</p>
<p>Org export backends are built to be highly extensible. If you extend <code class="language-plaintext highlighter-rouge">ox-md</code>, for example,
then you “inherit” all the transcoding functions that it provides, and you can add or replace
only the functions you want to. For example, part of <code class="language-plaintext highlighter-rouge">ox-jekyll-lite</code> looks like</p>
<div class="language-elisp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nv">org-export-define-derived-backend</span> <span class="ss">'jekyll</span> <span class="ss">'md</span>
<span class="ss">:translate-alist</span>
<span class="o">'</span><span class="p">((</span><span class="nv">headline</span> <span class="o">.</span> <span class="nv">org-jekyll-lite-headline-offset</span><span class="p">)</span>
<span class="p">(</span><span class="nv">inner-template</span> <span class="o">.</span> <span class="nv">org-jekyll-lite-inner-template</span><span class="p">)))</span>
</code></pre></div></div>
<p>This tells us that we’re defining a backend named <code class="language-plaintext highlighter-rouge">jekyll</code>, which derives from the backend
named <code class="language-plaintext highlighter-rouge">md</code> (which, if you look, itself derives from the <code class="language-plaintext highlighter-rouge">html</code> backend).</p>
<p>In the code above, the <code class="language-plaintext highlighter-rouge">translate-alist</code> indicates that this backend handles <code class="language-plaintext highlighter-rouge">headline</code>
objects via the <code class="language-plaintext highlighter-rouge">org-jekyll-lite-headline-offset</code> method, and handles the <code class="language-plaintext highlighter-rouge">inner-template</code>
object via <code class="language-plaintext highlighter-rouge">org-jekyll-lite-inner-template</code>. These functions take in org elements,
returning text that will get dumped into the export buffer.</p>
<p>The transcoding function <code class="language-plaintext highlighter-rouge">org-jekyll-lite-underline</code> is a particularly simple example:</p>
<div class="language-elisp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">defun</span> <span class="nv">org-jekyll-lite-underline</span> <span class="p">(</span><span class="nv">underline</span> <span class="nv">contents</span> <span class="nv">info</span><span class="p">)</span>
<span class="s">"Transcode UNDERLINE from Org to Markdown.
CONTENTS is the text with underline markup. INFO is a plist
holding contextual information."</span>
<span class="p">(</span><span class="nb">format</span> <span class="s">"<u>%s</u>"</span> <span class="nv">contents</span><span class="p">))</span>
</code></pre></div></div>
<p>Extending a backend consists of figuring out which elements you want to handle via
special logic, then writing the appropriate transcoding functions for each.</p>
<h3 id="implementation-details-for-ox-jekyll-lite">Implementation Details for <code class="language-plaintext highlighter-rouge">ox-jekyll-lite</code></h3>
<p>Most of the more complicated transcoding functions in <code class="language-plaintext highlighter-rouge">ox-jekyll-lite</code> are not written by
me. They either come from <code class="language-plaintext highlighter-rouge">ox-jekyll-md</code>, or from <code class="language-plaintext highlighter-rouge">ox-hugo</code>. For example, I got the
transcoder for footnotes, and for \(\LaTeX\) snippets, from <code class="language-plaintext highlighter-rouge">ox-hugo</code>.</p>
<p>The most interesting addition that I made was to render file links relative to the root
directory of Jekyll, when possible. For example, if you have an image in your
<code class="language-plaintext highlighter-rouge">assets/images</code> folder, Jekyll wants you to link to it as <code class="language-plaintext highlighter-rouge">/assets/images/kitties.jpg</code>, not
with the full path relative to the root directory of your computer’s filesystem.</p>
<p>However, when I use <code class="language-plaintext highlighter-rouge">C-c C-l</code> (along with Helm) to add a link to an org file, it renders
the link with the absolute path.<sup id="fnref:fn3" role="doc-noteref"><a href="#fn:fn3" class="footnote">3</a></sup> It’s important that the link is “correct” for my
machine, so that any images can render inline, and the links are clickable by me when
from my orgfile. But if the links are relative to my filesystem’s root in the markdown,
then they won’t work within the context of my site. So, we need to “fix” the links as we
export the post to markdown.</p>
<p>I don’t get too complicated here - I just have the user specify a custom variable
<code class="language-plaintext highlighter-rouge">org-jekyll-project-root</code>, which then gets pulled off of the beginning of file paths when
it is present.</p>
<p>For example, on my machine, this repository is located at
<code class="language-plaintext highlighter-rouge">~/code/jekyll/peterewills.github.io/</code>, and so if I link to the file
<code class="language-plaintext highlighter-rouge">~/code/jekyll/peterewills.github.io/assets/images/kitties.jpg</code> in my org file,
<code class="language-plaintext highlighter-rouge">ox-jekyll-lite</code> will, upon export, transform this to a link to <code class="language-plaintext highlighter-rouge">/assets/images/kitties.jpg</code>
in the markdown output. This approach is nice and simple, but it doesn’t handle relative
links, or the situation where you have multiple Jekyll projects.</p>
<p>Anyways, if you want to give it a try, you can clone it <a href="https://github.com/peterewills/ox-jekyll-lite">from GitHub</a> and check it
out. You can just load it up and use <code class="language-plaintext highlighter-rouge">C-c C-e j J</code> to export an org file to a markdown
buffer.</p>
<p>Finally, as a side note, I just have to give a shoutout to the excellent <a href="https://github.com/magnars/s.el"><code class="language-plaintext highlighter-rouge">s.el</code></a> and
<a href="https://github.com/magnars/dash.el"><code class="language-plaintext highlighter-rouge">dash</code></a>, which makes working in elisp infinitely more pleasant. Many thanks to Magnar
Sveen for building such nice tools for us all to use.</p>
<h1 id="my-blogging-workflow">My Blogging Workflow</h1>
<p>Now, my workflow for writing a post is pretty simple.</p>
<ol>
<li>Have brilliant idea</li>
<li>Make an org file in the <code class="language-plaintext highlighter-rouge">_posts</code> directory, named like <code class="language-plaintext highlighter-rouge">YYYY-MM-DD-post-name.org</code></li>
<li>Write brilliant words/equations/cat pictures/etc.</li>
<li>Export to markdown via <code class="language-plaintext highlighter-rouge">C-c C-e j j</code></li>
<li>Commit & push to GitHub</li>
<li>Profit!<sup id="fnref:fn4" role="doc-noteref"><a href="#fn:fn4" class="footnote">4</a></sup></li>
</ol>
<p>The only additional complication, compared to a pure-markdown workflow, is the addition
of the export step; other than that, it’s identical. And now I can blog in wonderful,
beautiful org mode instead of clunky markdown.</p>
<p>An important caveat for anyone using org and Jekyll; in order to not have Jekyll stumble
over the org artifacts, you should add <code class="language-plaintext highlighter-rouge">*.org</code> and <code class="language-plaintext highlighter-rouge">ltximg</code> to the <a href="https://github.com/peterewills/peterewills.github.io/blob/master/_config.yml#L13-L17">list of excluded files</a> in
your Jekyll <code class="language-plaintext highlighter-rouge">_config.yml</code>. You can see mine <a href="https://github.com/peterewills/peterewills.github.io/blob/master/_config.yml">on GitHub</a>.</p>
<h1 id="conclusion">Conclusion</h1>
<p>If you are just starting to blog, and you love org mode, I’d recommend using Hugo to
build your site, so that you can use the excellent <code class="language-plaintext highlighter-rouge">ox-hugo</code>. It’s a truly org-centric
approach to building a static site, and it’s much more fully-featured than any of the
solutions I’ve found in Jekyll or Pelican.</p>
<p>But, you might want to use Jekyll, because it integrates automagically with GitHub
pages, or perhaps you just like some of the available themes or whatnot. If that’s the
case, then I think <code class="language-plaintext highlighter-rouge">org-jekyll-lite</code> is a reasonable solution for writing your posts in
org. It’s lightweight, and you’ll probably have to tweak it to fit your particular
needs, but it’s small enough that modifying it shouldn’t be too hard. Also, you can
always submit an issue on GitHub and I’ll see if I can help you out.</p>
<p>I hope this post has inspired you to explore more in org mode! It’s a great tool for
organizing notes, tracking agendas/calendars/TODO lists, and for general
writing.<sup id="fnref:fn5" role="doc-noteref"><a href="#fn:fn5" class="footnote">5</a></sup> Happy blogging, and may the org be with you!</p>
<!----- Footnotes ----->
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:fn1" role="doc-endnote">
<p>As I explain later on, this tool was based on both <a href="https://github.com/gonsie/ox-jekyll-md"><code class="language-plaintext highlighter-rouge">ox-jekyll-md</code></a> and <a href="https://ox-hugo.scripter.co/"><code class="language-plaintext highlighter-rouge">ox-hugo</code></a>. <a href="#fnref:fn1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn2" role="doc-endnote">
<p>The double slash is required because markdown interprets the first slash as an escape character. <a href="#fnref:fn2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn3" role="doc-endnote">
<p>You can see an example of adding a link to an image in the org-mode demo video linked above. <a href="#fnref:fn3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn4" role="doc-endnote">
<p>This is actually a lie; I don’t make any money from this site. <a href="#fnref:fn4" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn5" role="doc-endnote">
<p>There’s also the entire subject of <a href="http://cachestocaches.com/2018/6/org-literate-programming/">literate programming</a>, in which code is interwoven with documentation, which I think is a really nice paradigm, and for which org is a natural fit. <a href="#fnref:fn5" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Peter Willspeter@pwills.comMy workflow for blogging in org mode, with jekyll and org-export.Your p-values Are Bogus2019-09-20T00:00:00+00:002019-09-20T00:00:00+00:00http://www.pwills.com/posts/2019/09/20/bogus<p>People often use a Gaussian to approximate distributions of sample means. This is
generally justified by the central limit theorem, which states that the sample mean of
an independent and identically distributed sequence of random variables converges to a
normal random variable in distribution.<sup id="fnref:fnote_clt" role="doc-noteref"><a href="#fn:fnote_clt" class="footnote">1</a></sup> In hypothesis testing, we might use
this to calculate a \(p\)-value, which then is used to drive decision making.</p>
<p>I’m going to show that calculating \(p\)-values in this way is actually incorrect, and
leads to results that get <em>less</em> accurate as you collect more data! This has
substantial implications for those who care about the statistical rigor of their A/B
tests, which are often based on Gaussian (normal) approximations.</p>
<h1 id="a-simple-example">A Simple Example</h1>
<p>Let’s take a very simple example. Let’s say that the prevailing wisdom is that no more
than 20% of people like rollerskating. You suspect that the number is in fact much
larger, and so you decide to run a statistical test. In this test, you model each person
as a Bernoulli random variable with parameter \(p\). <strong>The null hypothesis \(H_0\) is
that \(p\leq 0.2\)</strong>. You decide to go out and ask 100 people their opinions on
rollerskating.<sup id="fnref:fnote_sample" role="doc-noteref"><a href="#fn:fnote_sample" class="footnote">2</a></sup></p>
<p>You begin gathering data. Unbeknownst to you, it is <em>in fact</em> the case that a full 80%
of the population enjoys rollerskating. So, as you randomly ask people if they enjoy
rollerskating, you end up getting a lot of “yes” responses. Once you’ve gotten 100
responses, you start analyzing the data.</p>
<p>It turns out that you got 74 “yes” responses, and 26 “no” responses. Since you’re a
practiced statistician, you know that you can calculate a \(p\)-value by finding the
probability that a binomial random variable with parameter \(p_0=0.2\) would generate a
value \(k\geq74\) with \(n=100\). This probability is just</p>
\[p_\text{exact} = \text{Prob}(k\geq 74) = \sum_{k=74}^{n}{n \choose k} p_0^{k} (1-p_0)^{(n-k)}.\]
<p>However, you know that you can approximate a binomial distribution with a Gaussian of
mean \(\mu=np_0\) and variance \(\sigma^2=np_0(1-p_0)\), so you decide to calculate an
<em>approximate</em> \(p\)-value,</p>
\[p_\text{approx} = \frac{1}{\sqrt{2\pi np_0(1-p_0)}}\int_{k=74}^\infty \exp\left(-\frac{(k-np_0)^2}{2np_0(1-p_0)}\right).\]
<p>However, <strong>this approximation is actually incorrect, and will give you progressively
worse estimates of \(p_\text{exact}\).</strong> Let’s observe this in action.</p>
<h2 id="python-simulation-of-data">Python Simulation of Data</h2>
<p>We simulate data for values \(n=1\) through \(n=1000\), and compute the corresponding
exact and approximate \(p\)-value. We plot the log of the \(p\) value, since they get
very small very quickly.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">matplotlib</span> <span class="kn">import</span> <span class="n">pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">from</span> <span class="nn">scipy.stats</span> <span class="kn">import</span> <span class="n">norm</span><span class="p">,</span> <span class="n">binom</span>
<span class="n">plt</span><span class="p">.</span><span class="n">style</span><span class="p">.</span><span class="n">use</span><span class="p">(</span> <span class="p">[</span><span class="s">'classic'</span><span class="p">,</span> <span class="s">'ggplot'</span><span class="p">])</span>
<span class="n">p_true</span> <span class="o">=</span> <span class="mf">0.8</span>
<span class="n">n</span> <span class="o">=</span> <span class="mi">1000</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">binom</span><span class="p">.</span><span class="n">rvs</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">p_true</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="n">n</span><span class="p">)</span>
<span class="n">p0</span> <span class="o">=</span> <span class="mf">0.2</span>
<span class="n">p_vals</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span>
<span class="n">index</span><span class="o">=</span><span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="n">n</span><span class="p">),</span>
<span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s">'true p-value'</span><span class="p">,</span> <span class="s">'normal approx. p-value'</span><span class="p">]</span>
<span class="p">)</span>
<span class="k">for</span> <span class="n">n0</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="n">n</span><span class="p">):</span>
<span class="n">normal_dev</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">n0</span><span class="o">*</span><span class="n">p0</span><span class="o">*</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">p0</span><span class="p">))</span>
<span class="n">normal_mean</span> <span class="o">=</span> <span class="n">n0</span><span class="o">*</span><span class="n">p0</span>
<span class="n">k</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="n">data</span><span class="p">[:</span><span class="n">n0</span><span class="p">])</span>
<span class="c1"># the "survival function" is 1 - cdf, which is the p-value in our case
</span> <span class="n">normal_logpval</span> <span class="o">=</span> <span class="n">norm</span><span class="p">.</span><span class="n">logsf</span><span class="p">(</span><span class="n">k</span><span class="p">,</span> <span class="n">loc</span><span class="o">=</span><span class="n">normal_mean</span><span class="p">,</span> <span class="n">scale</span><span class="o">=</span><span class="n">normal_dev</span><span class="p">)</span>
<span class="n">true_logpval</span> <span class="o">=</span> <span class="n">binom</span><span class="p">.</span><span class="n">logsf</span><span class="p">(</span><span class="n">k</span><span class="o">=</span><span class="n">k</span><span class="p">,</span> <span class="n">n</span><span class="o">=</span><span class="n">n0</span><span class="p">,</span> <span class="n">p</span><span class="o">=</span><span class="n">p0</span><span class="p">)</span>
<span class="n">p_vals</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">n0</span><span class="p">,</span> <span class="s">'true p-value'</span><span class="p">]</span> <span class="o">=</span> <span class="n">true_logpval</span>
<span class="n">p_vals</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">n0</span><span class="p">,</span> <span class="s">'normal approx. p-value'</span><span class="p">]</span> <span class="o">=</span> <span class="n">normal_logpval</span>
<span class="n">p_vals</span><span class="p">.</span><span class="n">replace</span><span class="p">([</span><span class="o">-</span><span class="n">np</span><span class="p">.</span><span class="n">inf</span><span class="p">,</span> <span class="n">np</span><span class="p">.</span><span class="n">inf</span><span class="p">],</span> <span class="n">np</span><span class="p">.</span><span class="n">nan</span><span class="p">).</span><span class="n">dropna</span><span class="p">().</span><span class="n">plot</span><span class="p">(</span><span class="n">figsize</span> <span class="o">=</span> <span class="p">(</span><span class="mi">8</span><span class="p">,</span><span class="mi">6</span><span class="p">));</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">"Number of Samples"</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">"Log-p Value"</span><span class="p">);</span>
</code></pre></div></div>
<p>We have to drop <code class="language-plaintext highlighter-rouge">inf</code>s because after about \(n=850\) or so, the \(p\)-value actually
gets too small for <code class="language-plaintext highlighter-rouge">scipy.stats</code> to calculate; it just returns <code class="language-plaintext highlighter-rouge">-np.inf</code>.</p>
<p>The resulting plot tells a shocking tale:</p>
<p><img src="/assets/images/p-values.png" alt="P-value Divergence" /></p>
<p>The approximation diverges from the exact value! Seeing this, you begin to weep
bitterly. Is the Central Limit Theorem invalid? Has your whole life been a lie? It turns
out that the answer to the first is a resounding no, and the second… probably also
no. But then what is going on here?</p>
<h2 id="convergence-is-not-enough">Convergence Is Not Enough</h2>
<p>The first thing to note is that, mathematically speaking, the two \(p\)-values
\(p_\text{exact}\) and \(p_\text{approx}\) <strong>do, in fact, converge</strong>. That is to say,
as we increase the number of samples, their difference is approaching zero:</p>
\[\left| p_\text{exact} - p_\text{approx}\right| \rightarrow 0\]
<p>What I’m arguing, then, is that <strong>convergence is not enough</strong>.</p>
<p>If it were, then we could just approximate the true \(p\)-value with 0. That is, we
could report a \(p\)-value of \(p_\text{approx} = 0\), and claim that since our
approximation is converging to the actual value, it should be taken
seriously. Obviously, this should not be taken seriously as an approximation.</p>
<p>Our intuitive sense of “convergence”, the sense that \(p_\text{approx}\) is becoming “a
better and better approximation of” \(p_\text{exact}\) as we take more samples,
corresponds to the <em>percent error</em> converging to zero:</p>
\[\left| \frac{p_\text{approx} - p_\text{exact}}{p_\text{exact}}\right| \rightarrow 0.\]
<p>In terms of asymptotic decay, this is a stronger claim than convergence. Rather than
their difference converging to zero, which means it is \(o(1)\), we demand that their
difference converge to zero <em>faster than \(p_\text{exact}\)</em>,</p>
\[\left| p_\text{exact} - p_\text{approx}\right| = o\left(p_\text{exact}\right).\]
<p>It would also suffice to have an upper bound on the \(p\)-value; that is, if we could
say that \(p_\text{exact} < p_\text{approx}\), so \(p_\text{exact}\) is <em>at worst</em> our
approximate value \(p_\text{approx}\), and we knew that this held regardless of sample
size, then we could report our approximate result knowing that it was at worst a bit
conservative. However, as far as I can see, the central limit theorem and other similar
convergence results give us no such guarantee.</p>
<h2 id="implications">Implications</h2>
<p>What I’ve shown is that for the simple case above, Gaussian approximation is not a
strategy that will get you good estimates of the true \(p\)-value, especially for large
amounts of data. You will under-estimate your \(p\)-value, and therefore overestimate
the strength of evidence you have against the null hypothesis.</p>
<p>Although A/B testing is a slightly more complex scenario, I suspect that the same
problem exists in that realm. A refresher on a typical A/B test scenario: you, as the
administrator of the test, care about the difference between two sample means. If they
samples are from Bernoulli random variables (a good model of click-through rates), then
the <em>true</em> distribution of this difference is the distribution of the difference of
(scaled) binomial random variables, which is more difficult to write down and work
with. Of course, the Gaussian approximation is simple, since the difference of two
Gaussians is again a Gaussian.<sup id="fnref:fnote_AB" role="doc-noteref"><a href="#fn:fnote_AB" class="footnote">3</a></sup></p>
<p>Most statistical tests are approximate in this way. For example, the \(\chi^2\) test for
goodness of fit is an approximate test. So what are we to make of the fact that this
approximation does not guarantee increasingly valid \(p\)-values? Honestly, I don’t
know. I’m sure that others have considered this issue, but I’m not familiar with the
thinking of the statistical community on it. (As always, please comment if you know
something that would help me understand this better.) All I know is that when doing
tests like this in the future, I’ll be much more careful about how I report my results.</p>
<h1 id="afterword-technical-details">Afterword: Technical Details</h1>
<p>As I said above, the two \(p\)-values do, in fact, converge. However, there is an
interesting mathematical twist in that <strong>the convergence is not guaranteed by the
central limit theorem.</strong> It’s a bit besides the point, and quite technical, but I found
it so interesting that I thought I should write it up.</p>
<p>As I said, this section isn’t essential to my central argument about the insufficiency
of simple convergence; it’s more of an interesting aside.</p>
<h2 id="limitations-of-the-central-limit-theorem">Limitations of the Central Limit Theorem</h2>
<p>To understand the problem, we have to do a deep dive into the details of the central
limit theorem. This will get technical. The TL;DR is that since our \(p\)-values are
getting smaller, the CLT doesn’t actually guarantee that they will converge.</p>
<p>Suppose we have a sequence of random variables \(X_1, X_2, X_3, \ldots\). These would
be, in the example above, the Bernoulli random variables that represent individual people’s
responses to your question about rollerskates. Suppose that these random variables are
independent and identically distributed, with mean \(\mu\) and finite variance
\(\sigma^2\).<sup id="fnref:fnote_bin" role="doc-noteref"><a href="#fn:fnote_bin" class="footnote">4</a></sup></p>
<p>Let \(S_n\) be the sample mean of all the \(X_i\) up through \(n\):</p>
\[S_n = \frac{1}{n} \sum_{i=1}^n X_i.\]
<p>We want to say what distribution the sample mean converges to. First, we know it’ll
converge to something close to the mean, so let’s subtract that off so that it converges
to something close to zero. So now we’re considering \(S_n - \mu\). But we also know
that the standard deviation goes down like \(1/\sqrt{n}\), so to get it to converge to
something stable, we have to multiply by \(\sqrt{n}\). So now we’re considering the
shifted and scaled sample mean \(\sqrt{n}\left(S_n - \mu\right)\).</p>
<p>The central limit theorem states that this converges <strong>in distribution</strong> to a normal
random variable with distribution \(N(0, \sigma^2)\). Notationally, you might see
mathematicians write</p>
\[\sqrt{n}\left(S_n-\mu\right)\ \xrightarrow{D} N(0,\sigma^2).\]
<p>What does it mean that they converge <strong>in distribution</strong>? It means that, for a fixed
area, the areas under the respective curves converge. Note that <strong>we have to fix the
area</strong> to get convergence. Let’s look at some pictures. First, note that we can plot the
exact distribution of the variable \(\sqrt{n}(S_n-\mu)\); it’s just a binomial random
variable, appropriately shifted and scaled. We’ll plot this alongside the normal
approximation \(N(0,\sigma^2)\).</p>
<!-- I'd like to have this centered. -->
<p><img src="/assets/images/clt.gif" alt="CLT gif" /></p>
<p>The area under the shaded part of the normal converges to the area of the bars in that
same shaded region. This is what convergence in distribution means.</p>
<p>Now for the crux. As we gather data, it becomes more and more obvious that our null
hypothesis is incorrect - that is, we move further and further out into the tail of the
null hypothesis’ distribution for \(S_n\). This is very intuitive - as we gather more
data, we expect our \(p\)-value to go down. The \(p\)-value is a tail integral of the
distribution, so we expect to be moving further and further into the tail of the
distribution.</p>
<p>Here’s a gif, where the shaded region represents the \(p\)-value that we’re calculating:</p>
<!-- I'd like to have this centered. -->
<p><img src="/assets/images/p-val.gif" alt="p-value gif" /></p>
<p>As we increase \(n\), the area we’re integrating changes. So we don’t get convergence
guarantees from the CLT.</p>
<h2 id="the-berry-esseen-theorem">The Berry-Esseen Theorem</h2>
<p>It’s worth noting that there is a stronger statement of convergence that applies
specifically to the convergence of the binomial distribution to the corresponding
Gaussian. It is called the <strong>Barry-Esseen Theorem</strong>, and it states that the maximum
distance between the cumulative probability functions of the binomial and the
corresponding Gaussian is \(o(n^{-1/2})\). This claim, which is akin to uniform
convergence of functions (compare to the pointwise convergence of the CLT) does, in
fact, guarantee that our \(p\)-values will converge.</p>
<p>But, as I’ve said above, this is immaterial, albeit interesting; we know already that
the \(p\)-values converge, and we also know that this is not enough for us to be
reporting one as an approximation of the other.</p>
<!-------------------------------- FOOTER ---------------------------->
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:fnote_clt" role="doc-endnote">
<p>So long as the variance of the distribution being sampled is finite. <a href="#fnref:fnote_clt" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote_sample" role="doc-endnote">
<p>You should decide this number based on some alternative hypothesis and
a power analysis. Also, you should ensure that you are sampling people evenly -
going to a park, for example, might bias your sample towards those that enjoy
rollerskating. <a href="#fnref:fnote_sample" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote_AB" role="doc-endnote">
<p>I haven’t done a numerical test on this scenario because the true
distribution (the difference between two scaled binomials) is nontrivial to
calcualte, and numerical issues arise as we calculate such small \(p\)-values, which
SciPy takes care of for us in the above example. But as I said, I would be
unsurprised if our Gaussian-approximated \(p\)-values are increasingly poor
approximations of the true \(p\)-value as we gather more samples. <a href="#fnref:fnote_AB" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote_bin" role="doc-endnote">
<p>In our case, for a single Bernoulli random variable with parameter \(p\),
we have \(\mu=p\) and \(\sigma^2=p(1-p)\). <a href="#fnref:fnote_bin" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Peter Willspeter@pwills.comIs hypothesis testing built upon a house of lies? No, probably not. But still, read this article.DS Interview Study Guide Part II: Software Engineering2019-08-29T00:00:00+00:002019-08-29T00:00:00+00:00http://www.pwills.com/posts/2019/08/29/engineering<p>This post continues my series on data science interviews. One of the major difficulty of
doing data science interviews is that you must show expertise in a wide variety of
skills. In particular, I see four key subject areas that you might be asked about during
an interview:</p>
<ol>
<li>Statistics</li>
<li>Software Engineering/Coding</li>
<li>Machine Learning</li>
<li>“Soft” Questions</li>
</ol>
<p>This post focuses on software engineering & coding. It will be primarily a resource for
aggregating content that I think you should be familiar with. I will mostly point to
outside sources for technical exposition and practice questions.</p>
<p>I’ll link to these as appropriate throughout the post, but I thought it would be helpful
to put up front a list of the primary resources that I’ve used when studying for
interviews. Some of my favorites are:</p>
<ul>
<li><a href="https://www.amazon.com/Structures-Algorithms-Python-Michael-Goodrich/dp/1118290275/">Data Structures and Algorithms in Python</a>, for a good introduction to
data structures such as linked lists, arrays, hashmaps, and so on. It also can give
you good sense of how to write idiomatic Python code, for building fundamental
classes.</li>
<li><a href="https://sqlzoo.net/">SQLZoo</a> for studying SQL and doing practice questions. I particularly like
the “assessments”.</li>
<li><a href="http://www.crackingthecodinginterview.com/">Cracking the Coding Interview</a> for lots of practice questions organized by
subject, and good general advice for the technical interviewing process.</li>
</ul>
<p>I also use coding websites like LeetCode to practice various problems. I also look on
Glassdoor to see <a href="https://www.glassdoor.com/Interview/san-francisco-data-scientist-interview-questions-SRCH_IL.0,13_IM759_KO14,28.htm">what kinds of problems</a> people have been asked.</p>
<p>As always, I’m working to improve this post, so please do leave comments with feedback.</p>
<h1 id="what-languages-should-i-know">What Languages Should I Know?</h1>
<p>In this section of data science interviews, your are generally asked to implement things
in code. So, which language should you do it in? Generally, the best answer is
(unsurprisingly) that <strong>you should work in Python</strong>. The next most popular choice is R;
I’m not very familiar with R, so I can’t really speak to it’s capabilities.</p>
<p>There are a few reasons you should work in Python:</p>
<ol>
<li>It’s widely adopted within industry.</li>
<li>It has high-quality, popular packages for working with data (see <code class="language-plaintext highlighter-rouge">pandas</code>, <code class="language-plaintext highlighter-rouge">numpy</code>,
<code class="language-plaintext highlighter-rouge">scipy</code>, <code class="language-plaintext highlighter-rouge">statsmodels</code>, <code class="language-plaintext highlighter-rouge">scikit-learn</code>, <code class="language-plaintext highlighter-rouge">matplotlib</code>, etc).</li>
<li>It bridges the gap between academic work (e.g. using NumPy to build a fast solver for
differential equations) and industrial work (e.g. using Django to build webservices).</li>
</ol>
<p>This is far from an exhaustive list. Anyways, I mostly work in Python. I think it’s a
nice language because it is clear and simple to write.</p>
<p>If you want to use another language, you should make sure that you can do everything you
need to - this includes reading & writing data, cleaning/munging data, plotting,
implementing statistical and machine learning models, and leveraging basic data types
like hashmaps and arrays (more on those later).</p>
<p>I think if you wanted to do your interviews in R it would be fine, so long as you can do
the above. I would strongly recommend against languages like MATLAB, which are
proprietary and not open-source.</p>
<p>Languages like Java can be tricky since they might not have the data-oriented libraries
that Python has. For example, I’ve worked profesionally in Scala, and am very
comfortable manipulating data via the Spark API within it, but still wouldn’t want to
have to use it in an interview; it just isn’t as friendly for general-purpose hacking as
Python.</p>
<p>So is Python all you need? Well, not quite. You should also be familiar with SQL for
querying databases; we’ll get into that later. I don’t think the dialect you use
particularly matters. <a href="https://sqlzoo.net/">SQLZoo</a> works with MySQL, which is fine. Familiarity with
bash and shell-scripting is useful for a data scientist in their day-to-day work, but
generally isn’t asked about in interviews. For the interviews, I’d say if you know one
general-purpose language (preferably Python, or R if need be) and SQL, then you’ll be
fine.</p>
<h1 id="general-tips-for-coding-interviews">General Tips for Coding Interviews</h1>
<p>Coding interviews are notorious for being high-stress, so it’s important that you
practice in a way that will maximize your comfort during the interview itself - you
don’t want to add any unnecessary additional stress into an already difficult
situation. There are a wide variety of philosophies and approaches to preparing yourself
for and executing a successful interview. I’m going to talk about some points that
resonate with me, but I’d also recommend reading <a href="http://www.crackingthecodinginterview.com/">Cracking the Coding Interview</a>
for a good discussion. Of course, this isn’t the final word on the topic - there are
endless resources available online that address this.</p>
<h2 id="how-to-prepare">How to Prepare</h2>
<p>When preparing for the interview, make sure to practice in an environment similar to the
interview environment. There are a few aspects of this to keep in mind.</p>
<ul>
<li>Make sure that you replicate the <strong>writing environment</strong> of the interview. So, if
you’ll be coding on a whiteboard, try to get access to a whiteboard to practice. At
least practice on a pad of paper, so that you’re comfortable with handwriting code -
it’s really quite different than using a text editor. If you’ll be coding in a Google
Doc, practice doing that (protip: used a monospaced font). Most places I’ve
interviewed at don’t let you evaluate your code to test it, so you have to be prepared
for that.</li>
<li><strong>Time yourself!</strong> It’s important to make sure you can do these things in a reasonable
amount of time. Generally, these things last 45 minutes per “round” (with multiple
rounds for on-site interviews). Focus on being efficient at implementing simple ideas,
so that you don’t waste a bunch of time with your syntax and things like that.</li>
<li><strong>Practice talking.</strong> If you practice by coding silently by yourself, then it might
feel strange when you’re in the interview and have to talk through your process. The
best is if you can have a friend who is familiar with interviewing play the
interviewer, so that you can talk to them, get asked questions, etc. You can also
record yourself and just talk to the recorder, so that you get practice externalizing
your thoughts.</li>
</ul>
<p>There are some services online that will do “practice” interviews for you. When I was
practicing for a software engineer interview with Google, I used <a href="http://www.gainlo.co/#!/">Gainlo</a> for
this - they were kind of expensive, but you interview with real Google software
engineers, which I found helpful.</p>
<p>However, the interviews for a software engineering position at Google are very
standardized in format. I haven’t used any of the services that do this for data
science, and the interviews you’ll face are so varied. Therefore, I imagine it is harder
to do helpful “mock interviews”. If you’ve used any of these services, I’d be very
curious to hear about your experience.</p>
<h2 id="tips-for-interviewing">Tips for Interviewing</h2>
<p>There are some things it’s important to keep in mind as you do the interview itself.</p>
<ul>
<li><strong>Talk about your thought process.</strong> Don’t just sit sliently thinking, then go and
write something on the board. Let the interviewer into your mind so that they can see
how you are thinking about the problem. This is good advice at any point in a
technical interview.</li>
<li><strong>Start with a simple solution you have confidence in.</strong> If you know that you can
quickly write up a suboptimal solution (in this case, maybe insertion sort), then do
that! You can discuss <em>why</em> that solution is sub-optimal, and they will often
brainstorm with you about how to improve it. That said, if you are just as confident
in writing up something more optimal (say, quicksort) then feel free to jump right to
that.</li>
<li><strong>Sketch out your solution before doing real code.</strong> This is not necessary, but
sometimes for complicated stuff it’s nice to write out your approach in pseudocode
before jumping into real code. This can also help with exposing your thought process
to the interviewer, and making sure they’re on board with how you’re thinking about
it.</li>
<li><strong>Think about edge cases.</strong> Suppose they ask you to write a function that sorts a
list. What if you’re given an empty list? What if you’re given a list of
non-comparable things? (In Python, this might be a list of lists.) What does your
function do in this case? Is that what you <em>want</em> it to do? There’s no right answer
here, but you should definitely be thinking about this and asking the interview how
they want the function to behave on these cases.</li>
<li><strong>Be sure to do a time complexity analysis on your solution.</strong> They want to know that
you can think about efficiency, so unless they explicitly ask you not to do this, I’d
recommend it. We’ll discuss more about what this means below.</li>
</ul>
<p>For a more thorough discussion of preparation and day-of techniques, I’d recommend
<a href="http://www.crackingthecodinginterview.com/">Cracking the Coding Interview</a>.</p>
<h2 id="tips-for-coding">Tips for Coding</h2>
<p>There are few things specifically in how the interviewee writes code that I think are
worth mentioning. This kind of stuff usually isn’t a huge deal, but if you write good
code, it can show professionalism and help leave a good impression.</p>
<ul>
<li><strong>Name your variables well.</strong> If the variable is the average number of users per
region, use <code class="language-plaintext highlighter-rouge">num_users_per_region</code>, or <code class="language-plaintext highlighter-rouge">users_per_region</code>, not <code class="language-plaintext highlighter-rouge">avg_usr</code> or
<code class="language-plaintext highlighter-rouge">num_usr</code>. Unlike in mathematics, it’s good to have long, descriptive variables.</li>
<li><strong>Use built-ins when you can!</strong> Python already <em>has</em> functions for sorting, for
building cartesian products of lists, for implementing various models (in
<code class="language-plaintext highlighter-rouge">statsmodels</code> and <code class="language-plaintext highlighter-rouge">scikit-learn</code>), and endless other things. It also has some cool
data structures already implemented, like the <a href="https://docs.python.org/3.7/library/heapq.html"><code class="language-plaintext highlighter-rouge">heap</code></a> and
<a href="https://docs.python.org/3/library/queue.html"><code class="language-plaintext highlighter-rouge">queue</code></a>. Get to know the <code class="language-plaintext highlighter-rouge">itertools</code> module; it has lots of usefull stuff.
if you can use these built-ins effectively, it demonstrates skill and knowledge
without adding much effort on your part.</li>
<li><strong>Break things into functions.</strong> If one step of your code is sorting a list, and you
can’t use the built-in <code class="language-plaintext highlighter-rouge">sorted()</code> function, then write a separate function <code class="language-plaintext highlighter-rouge">def
sort()</code> before you write your main function. This increases both readability and
testability of code, and is essential for real-world software.</li>
<li><strong>Write idiomatic Python.</strong> This is a bit less important, but make sure to iterate
directly over iterables, don’t do <code class="language-plaintext highlighter-rouge">for i in range(len(my_iterable))</code>. Also,
familiarize yourself with <code class="language-plaintext highlighter-rouge">enumerate</code> and <code class="language-plaintext highlighter-rouge">zip</code> and know how to use them. Know how to
use list compreshensions, and be aware that you can do a similar thing for
dictionaries, sets, and even arguments of functions - for example, you can do
<code class="language-plaintext highlighter-rouge">max(item for item in l if item % 2 == 0)</code> to find the maximum even number in l. Know
how to do string formatting using either <code class="language-plaintext highlighter-rouge">.format()</code> for <code class="language-plaintext highlighter-rouge">f</code>-strings in Python
3.<sup id="fnref:fnote_py3" role="doc-noteref"><a href="#fn:fnote_py3" class="footnote">1</a></sup></li>
</ul>
<p>I’m only scratching the surface of how to write good code. It helps to read code that
others have written to see what you don’t know. You can also look at code in large
open-source libraries.</p>
<p>With all that said, let’s move on to some of the content that might be asked about in
these interviews.</p>
<h1 id="working-with-data">Working with Data</h1>
<p>One of the fundamental tasks of a data scientist is to load, manipulate, clean, and
visualize data in various formats. I’ll go through some of the basic tasks that I think
you should be able to do, and either include or link to Python implementations. If you
work in R, or any other language, you should make sure that you can still do these
things in your preferred language.</p>
<p>In Python, the key technologies are the packages pandas (for loading, cleaning, and
manipulating data), numpy (for efficiently working with unlabeled numeric data), and
matplotlib (for plotting and visualizing data).</p>
<h2 id="loading--cleaning-data">Loading & Cleaning Data</h2>
<p><a href="https://www.datacamp.com/community/tutorials/pandas-read-csv">This tutorial on DataCamp</a> nicely deals with the basics of using
<code class="language-plaintext highlighter-rouge">pd.read_csv()</code> to load data into Pandas. It is also possible to load from other
formats, but in my experience writing to and from comma- or tab-separated plaintext is
by far the most common approach for datasets that fit in memory.<sup id="fnref:fnote_parquet" role="doc-noteref"><a href="#fn:fnote_parquet" class="footnote">2</a></sup></p>
<p>For example, suppose you had the following data in a csv file:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>name,age,country,favorite color
steve,7,US,green
jennifer,14,UK,blue
franklin,,UK,black
calvin,22,US,
</code></pre></div></div>
<p>You can copy and paste this, into Notepad or whatever text editor you
like<sup id="fnref:fnote_emacs" role="doc-noteref"><a href="#fn:fnote_emacs" class="footnote">3</a></sup>, and save it as <code class="language-plaintext highlighter-rouge">data.csv</code>.</p>
<p>You should be able to</p>
<ul>
<li>load in data from text, whether it is separated by commas, tabs, or some other
arbitrary character (sometimes things are separated by the “pipe” character <code class="language-plaintext highlighter-rouge">|</code>). In
this case, you can just do <code class="language-plaintext highlighter-rouge">df = pd.read_csv('data.csv')</code> to load it.</li>
<li>Filter for missing data. If you wanted to find the row(s) where the age is missing,
for example, you could do <code class="language-plaintext highlighter-rouge">df[df['age'].isnull()]</code></li>
<li>Filter for data values. For example, to find people from the US, do <code class="language-plaintext highlighter-rouge">df[df['country'] == 'US']</code></li>
<li>Replace missing data; use <code class="language-plaintext highlighter-rouge">df.fillna(0)</code> to replace missing data with zeros. Think for
yourself about how you would want to handle missing data in this case - does it make
sense to replace everything with zeros? What <em>would</em> make sense?</li>
</ul>
<p>Dealing with missing data is, in particular, an important problem, and not one that has
an easy answer. <a href="https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4">Towards Data Science</a> has a decent post on this
subject, but if you’re curious, there’s a lot to read about and learn here.</p>
<p>More advanced topics in pandas-fu include <a href="https://wesmckinney.com/blog/groupby-fu-improvements-in-grouping-and-aggregating-data-in-pandas/">using <code class="language-plaintext highlighter-rouge">groupby</code></a>, joining
dataframes (this is called a “merge” in pandas, but works the same as a SQL join), and
<a href="https://hackernoon.com/reshaping-data-in-python-fa27dda2ff77">reshaping data</a>.</p>
<p>As I said before, loading and manipulating data is one of the fundamental tasks of a
data scientist. You should probably be comfortable doing most or all of these tasks if
asked. Pandas can be a bit unintuitive, so I’d recommend practicing if you aren’t
already comfortable with it. Doing slicing and reshaping tasks in numpy is also an
important skill, so make sure you are comfortable with that as well.</p>
<h2 id="visualization">Visualization</h2>
<p>Another essential aspect of data work is visualization. Of course, this is an entire
field unto itself; here, I’ll mostly be focusing on the practical aspects of making
simple plots. If you want to start to learn more about the overarching principles of the
visual representation of data, <a href="https://www.edwardtufte.com/tufte/books_vdqi">Tufte’s book</a> is the classic in the field.</p>
<p>In Python, the fundamental tool used for data visualization is the library
<code class="language-plaintext highlighter-rouge">matplotlib</code>. There exist many other libraries for more complicated visualization tasks,
such as <code class="language-plaintext highlighter-rouge">seaborn</code>, <code class="language-plaintext highlighter-rouge">bokeh</code>, and <code class="language-plaintext highlighter-rouge">plotly</code>, but the only one that you really <em>need</em> to be
comfortable with (in my opinion) is <code class="language-plaintext highlighter-rouge">matplotlib</code>.</p>
<p>You should be comfortable with:</p>
<ul>
<li>plotting two lists against one another</li>
<li>changing the labels on the x- and y-axis of your plot, and adding a title</li>
<li>changing the x- and y-limits of your plot</li>
<li>plotting a bar graph</li>
<li>plotting a histogram</li>
<li>plotting two curves together, labelling them, and adding a legend</li>
</ul>
<p>I won’t go through the details here - I’m sure you can find many good guides to each of
these online. The <a href="https://matplotlib.org/3.1.1/tutorials/introductory/pyplot.html">matplotlib pyplot tutorial</a> is a good place to
start.<sup id="fnref:fnote_pyplot" role="doc-noteref"><a href="#fn:fnote_pyplot" class="footnote">4</a></sup></p>
<p>It’s worth noting that you can plot directly from pandas, by doing <code class="language-plaintext highlighter-rouge">df.plot()</code>. This
just calls out to matplotlib and plots your dataframe; I will often find myself both
plotting from the pandas <code class="language-plaintext highlighter-rouge">DataFrame.plot()</code> method as well as directly using
<code class="language-plaintext highlighter-rouge">pyplot.plot()</code>. They work on the same objects, and so you can use them together to make
more complicated plots with multiple values plotted.</p>
<h1 id="data-structures--algorithms">Data Structures & Algorithms</h1>
<p>Designing and building effective software is predicated on a solid understanding of the
basic data structures that are available, and familiarity with the ways that they are
employed in common algorithms. For me, learning this material opened up the world of
software engineering - it illuminated the inner workings of computer languages. It also
helped me understand the pros and cons of various approaches to problems, in ways that I
wouldn’t have been able to before.</p>
<p>This subject is fundamental to software engineering interviews, but for data scientists,
its importance can vary drastically from role to role. For engineering-heavy roles, this
material can make up half or more of the interview, while for more statistician-oriented
roles, it might only be very lightly touched upon. You will have to use your judgement
to determine to what extent this material is important to you.</p>
<p>I learned this material when I was interviewing by reading the book <a href="https://www.amazon.com/Structures-Algorithms-Python-Michael-Goodrich/dp/1118290275/">Data Structures and
Algorithms in Python</a>.<sup id="fnref:fnote_dsa2" role="doc-noteref"><a href="#fn:fnote_dsa2" class="footnote">5</a></sup> It’s really a great book - it has good, clear
explanations of all the important topics, including complexity analysis and some of the
basics of the Python language. I can’t recommend it highly enough if you want to get
more familiar with this material.<sup id="fnref:fnote_dsa" role="doc-noteref"><a href="#fn:fnote_dsa" class="footnote">6</a></sup> You can buy it, or look around online for
the PDF - it shouldn’t be too hard to find.</p>
<h2 id="time-and-space-complexity-analysis">Time and Space Complexity Analysis</h2>
<p>Before you begin writing algorithms, you need to know how to analyze their
complexity. The “complexity” of an algorithm tells you how the amount of time (or space)
that the algorithm takes depends on the size of the input data.</p>
<p>It is formalized using the so-called “big-O” notation. The precise mathematical
definition of \(\mathcal{O}(n)\) is somewhat confusing, so you can just think of it
roughly as meaning that an algorithm that is \(\mathcal{O}(n)\) “scales like \(n\)”; so,
if you double the input size, you double the amount of time it takes. If an algorithm is
\(\mathcal{O}(n^3)\), then, doubling the input size means that you multiply the time it
takes by \(2^3 = 8\).<sup id="fnref:fnote_bigo" role="doc-noteref"><a href="#fn:fnote_bigo" class="footnote">7</a></sup> You can see how even a \(\mathcal{O}(n^2)\) algorithm wouldn’t
work for large data; even if it runs in a reasonable amount of time (say, 5 seconds)for
10,000 points, it would take about 15,000 years to run on 1 billion data
points. Obviously, this is no good.</p>
<p>So complexity analysis is critical. You don’t want to settle for a \(\mathcal{O}(n^2)\)
solution when a \(\mathcal{O}(n)\) or \(\mathcal{O}(n \log n)\) solution is available. I
won’t get into how to do the analysis here, besides saying that I often like to annotate
my loops with their complexity when I’m writing things. For example, here’s a (slow)
approach to finding the largest k (unique) numbers in a list:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">get_top_k</span><span class="p">(</span><span class="n">k</span><span class="p">,</span> <span class="n">input_list</span><span class="p">):</span>
<span class="n">top_k</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">k</span><span class="p">):</span> <span class="c1"># happens k times
</span> <span class="n">remaining</span> <span class="o">=</span> <span class="p">[</span><span class="n">num</span> <span class="k">for</span> <span class="n">num</span> <span class="ow">in</span> <span class="n">input_list</span> <span class="k">if</span> <span class="n">num</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">top_k</span><span class="p">]</span> <span class="c1"># O(n)
</span> <span class="k">if</span> <span class="n">remaining</span><span class="p">:</span>
<span class="n">top_remaining</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="n">remaining</span><span class="p">)</span> <span class="c1"># O(n)
</span> <span class="n">top_k</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">top_remaining</span><span class="p">)</span> <span class="c1"># O(1)
</span> <span class="k">return</span> <span class="n">top_k</span>
</code></pre></div></div>
<p>I know that the outer loop happend <code class="language-plaintext highlighter-rouge">k</code> times, and since finding the maximum of a list is
\(\mathcal{O}(n)\), the total task is \(\mathcal{O}(nk)\).<sup id="fnref:fnote_asymptotics" role="doc-noteref"><a href="#fn:fnote_asymptotics" class="footnote">8</a></sup> To learn
more about how to do complexity analysis, I’d look at <a href="https://www.amazon.com/Structures-Algorithms-Python-Michael-Goodrich/dp/1118290275/">DS&A</a>, <a href="http://www.crackingthecodinginterview.com/">Cracking the
Coding Interview</a>, or just look around online - I’m sure there are plenty of good
resources out there.</p>
<p>You can also consider not just the time of computation, but the amount of memory (space)
that your algorithm uses. This is not quite as common as time-complexity analysis, but
is still important to be able to do.</p>
<p>A very useful resource for anyone studying for a coding interview is the <a href="https://www.bigocheatsheet.com/">big-O cheat
sheet</a>, which shows the complexity of access, search, insertion, and deletion for
various data types, as well as the complexity of searching algorithms, and a lot more. I
often use it as a reference, but of course it’s important that you understand <em>why</em> (for
example) an array has \(\mathcal{O}(n)\) insertion. Just memorizing complexities won’t
help you much.</p>
<h2 id="arrays--hashmaps">Arrays & Hashmaps</h2>
<p>In my opinion, the two essential data structures for a data scientist to know are
the array and the hashmap. In Python, the <code class="language-plaintext highlighter-rouge">list</code> type is an array, while the <code class="language-plaintext highlighter-rouge">dict</code> type
is a hashmap. Since both are used so commonly, you have to know their properties if you
want to be able to design efficient algorithms and do your complexity analysis
correctly.</p>
<p><strong>Arrays</strong> are a data type where a piece of data (like a string) is linked to an index
(in Python, this is an integer, starting with 0). I won’t go too deep into the details
here, but for arrays, the important thing to know is that getting any element of an
array is easy (i.e. doing <code class="language-plaintext highlighter-rouge">mylist[5]</code> is \(\mathcal{O}(1)\), so it doesn’t depend on the
size of the array) but adding elements (particularly in the beginning or middle of the
array) is difficult; doing <code class="language-plaintext highlighter-rouge">mylist.insert(k, 'foo')</code> is \(\mathcal{O}(n-k)\), where
\(k\) is the position you wish to insert at.<sup id="fnref:fnote_linked" role="doc-noteref"><a href="#fn:fnote_linked" class="footnote">9</a></sup></p>
<p>Arrays are what we usually use when we’re building unordered, unlabelled collections of
objects in Python. This is fine, since insertion at the end of an array is fast, and
we’re often accessing slices of arrays in a complicated fashion (particularly in
numpy). I generally use arrays by default, without thinking too much about it, and it
generally works out alright.</p>
<p><strong>Hashmaps</strong> also link values to keys, but in this case the key can be anything you
want, rather than having to be an ordered set of integers. In Python, you build them by
specifying the key and the value, like <code class="language-plaintext highlighter-rouge">{'key': 'value'}</code>. Hashmaps are magical in that
accessing elements <em>and</em> adding elements are both
\(\mathcal{O}(1)\).<sup id="fnref:fnote_array_hashmap" role="doc-noteref"><a href="#fn:fnote_array_hashmap" class="footnote">10</a></sup> Why is this cool? Well, say you wanted to
store a bunch of people’s names and ages. You might think to do a list of tuples:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">names_ages</span> <span class="o">=</span> <span class="p">[(</span><span class="s">'Peter'</span><span class="p">,</span> <span class="mi">12</span><span class="p">),</span> <span class="p">(</span><span class="s">'Kat'</span><span class="p">,</span> <span class="mi">25</span><span class="p">),</span> <span class="p">(</span><span class="s">'Jeff'</span><span class="p">,</span> <span class="mi">41</span><span class="p">)]</span>
</code></pre></div></div>
<p>Then, if you wanted to find out Jeff’s age, you would have to iterate through the list
and find the correct tuple:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">name</span><span class="p">,</span> <span class="n">age</span> <span class="ow">in</span> <span class="n">name_ages</span><span class="p">:</span> <span class="c1"># happens n times
</span> <span class="k">if</span> <span class="n">name</span> <span class="o">==</span> <span class="s">'Jeff'</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="s">f"Jeff's age is </span><span class="si">{</span><span class="n">age</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
</code></pre></div></div>
<p>This is \(\mathcal{O}(n)\) - not very efficient. With hashmaps, you can just do</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">name_ages</span> <span class="o">=</span> <span class="p">{</span><span class="s">'Peter'</span><span class="p">:</span> <span class="mi">12</span><span class="p">,</span> <span class="s">'Kat'</span><span class="p">:</span> <span class="mi">25</span><span class="p">,</span> <span class="s">'Jeff'</span><span class="p">:</span> <span class="mi">41</span><span class="p">}</span>
<span class="k">print</span><span class="p">(</span><span class="s">f"Jeff's age is </span><span class="si">{</span><span class="n">name_ages</span><span class="p">[</span><span class="s">'Jeff'</span><span class="p">]</span><span class="si">}</span><span class="s">"</span><span class="p">)</span> <span class="c1"># O(1)! Wow!
</span></code></pre></div></div>
<p>It might not be obvious how cool this is until you see how to use it in
problems. <a href="http://www.crackingthecodinginterview.com/">Cracking the Coding Interview</a> has lots of good problems on hashmaps,
but I’ll just reproduce some of the classics here. I think it’s worth knowing these,
because they really can give you an intuitive sense of when and how hashmaps are
valuable.</p>
<p>The first classic hashmap algorithm is <strong>counting frequencies of items in a list.</strong> That
is, given a list, you want to know how many times each item appears. You can do this via
the following:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">get_freqs</span><span class="p">(</span><span class="n">l</span><span class="p">):</span>
<span class="n">freqs</span> <span class="o">=</span> <span class="p">{}</span>
<span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">l</span><span class="p">:</span> <span class="c1"># happens O(n) times
</span> <span class="k">if</span> <span class="n">item</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">freqs</span><span class="p">:</span> <span class="c1"># This check is O(1)! Wow!
</span> <span class="n">freqs</span><span class="p">[</span><span class="n">item</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">freqs</span><span class="p">[</span><span class="n">item</span><span class="p">]</span> <span class="o">+=</span> <span class="mi">1</span> <span class="c1"># Also O(1)! Wow!
</span> <span class="k">return</span> <span class="n">freqs</span>
</code></pre></div></div>
<p>Try and think of how you’d do this <em>without</em> hashmaps. Probably, you’d sort the list,
and then look at adjacent values. But sorting is, at best \(\mathcal{O}(\log n)\). This
solution does it in \(\mathcal{O}(n)\)!</p>
<p>Another classic problem that is solved with hashmaps is to <strong>find all repeated elements
in a list.</strong> This is really just a variant of the last, where you look for elements that
have frequency greater than 1.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">get_repeated</span><span class="p">(</span><span class="n">l</span><span class="p">):</span>
<span class="n">f</span> <span class="o">=</span> <span class="n">get_freqs</span><span class="p">(</span><span class="n">l</span><span class="p">)</span>
<span class="k">return</span> <span class="p">[</span><span class="n">item</span> <span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">f</span> <span class="k">if</span> <span class="n">f</span><span class="p">[</span><span class="n">item</span><span class="p">]</span> <span class="o">></span> <span class="mi">1</span><span class="p">]</span>
</code></pre></div></div>
<p>Now, if you only need <em>one</em> repeated element, you can be efficient and just terminate on
the first one you find. For this, we’ll use a <code class="language-plaintext highlighter-rouge">set</code>, which is just a <code class="language-plaintext highlighter-rouge">dict</code> with values
of <code class="language-plaintext highlighter-rouge">None</code>. That is to say, <strong>sets are also hashmaps</strong>. The important thing to know is
that adding to them and checking if something is in them are both \(\mathcal{O}(1)\).</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">get_repeated</span><span class="p">(</span><span class="n">l</span><span class="p">):</span>
<span class="n">items</span> <span class="o">=</span> <span class="nb">set</span><span class="p">()</span>
<span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">l</span><span class="p">:</span> <span class="c1"># happens O(n) times
</span> <span class="k">if</span> <span class="n">item</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">items</span><span class="p">:</span> <span class="c1"># This check is O(1)! Wow!
</span> <span class="n">items</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">item</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">return</span><span class="p">(</span><span class="n">item</span><span class="p">)</span>
<span class="k">return</span> <span class="bp">None</span> <span class="c1"># if this happens, all elements are unique
</span></code></pre></div></div>
<p>The last one we’ll do is a bit trickier. You’re given a list of numbers, and a “target”,
and your task is to find a pair of numbers in the list that add up to the target. Try
and think for yourself how you’d do this - the fact you use hashmaps is a big hint. You
should be able to do it in \(\mathcal{O}(n)\).</p>
<p>Have you thought about it? When I first encountered this one I had to look up the
answer. But here’s how you do it in \(\mathcal{O}(n)\):</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">get_sum_pair</span><span class="p">(</span><span class="n">l</span><span class="p">,</span> <span class="n">target</span><span class="p">):</span>
<span class="n">nums_set</span> <span class="o">=</span> <span class="nb">set</span><span class="p">()</span>
<span class="k">for</span> <span class="n">num</span> <span class="ow">in</span> <span class="n">l</span><span class="p">:</span>
<span class="n">other_num</span> <span class="o">=</span> <span class="n">target</span><span class="o">-</span><span class="n">num</span>
<span class="k">if</span> <span class="n">other_num</span> <span class="ow">in</span> <span class="n">nums_set</span><span class="p">:</span>
<span class="k">return</span> <span class="p">(</span><span class="n">num</span><span class="p">,</span> <span class="n">other_num</span><span class="p">)</span>
<span class="n">nums_set</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">num</span><span class="p">)</span> <span class="c1"># no-op if num is already there
</span> <span class="k">return</span> <span class="bp">None</span>
</code></pre></div></div>
<p>Note that <code class="language-plaintext highlighter-rouge">other_num = target-num</code> is the number that you would need to complete the sum
pair; using a hashmap, you can check in \(\mathcal{O}(1)\) if you’ve already seen it!
Wow!</p>
<p>Hopefully you get it - hashmaps are cool. Go on LeetCode, or pop open <a href="https://www.amazon.com/Structures-Algorithms-Python-Michael-Goodrich/dp/1118290275/">your favorite
data structures book</a>, or even <a href="http://www.crackingthecodinginterview.com/">Cracking the Coding Interview</a>, and
get some practice with them.</p>
<h2 id="sorting--searching">Sorting & Searching</h2>
<p>Sorting and searching are two of the basic tasks you have to be familiar with for any
coding interview. You can go into a lot of depth with these, but I’ll stick to the
basics here, because that’s what I find most helpful.</p>
<h3 id="sorting">Sorting</h3>
<p><strong>Sorting</strong> is a nice problem in that the statement of the problem is fairly
straightforward; given a list of numbers, reorder the list so that every element is less
than or equal to the next. There are a number of approaches to sorting. The naive
approach is called <a href="https://en.wikipedia.org/wiki/Insertion_sort"><strong>insertion sort</strong></a>; for example, it is what most people
do when sorting a hand of cards. It has some advantages, but is \(\mathcal{O}(n^2)\) in
time, and so is not the most efficient available.</p>
<p>The two most common fast sorting algorithms are <a href="https://en.wikipedia.org/wiki/Quicksort"><strong>quicksort</strong></a> and
<a href="https://en.wikipedia.org/wiki/Merge_sort"><strong>mergesort</strong></a>. They are both \(\mathcal{O}(n \log n)\) in
time,<sup id="fnref:fnote_sort" role="doc-noteref"><a href="#fn:fnote_sort" class="footnote">11</a></sup> and so scale close-to-linearly with the size of the list. I won’t go
into the implementation details here; there are plenty of good discussions of them
available on the internet.</p>
<p>When thinking about sorting, it’s also worth considering space complexity -
can you sort without needing to carry around a second sorted copy of the list? If so,
that’s a significant advantage, especially for larger lists. It’s also worth thinking
about worst-case vs. average performance - how does the algorithm perform on a randomly
shuffled list, and how does it perform on a list specifically designed to take the
maximum number of steps for that algorithm to sort? Quicksort, for example, is actually
\(\mathcal{O}(n^2)\) in the worst case, but is \(\mathcal{O}(n \log n)\) on
average. Again, you can look to the <a href="https://www.bigocheatsheet.com/">big-O cheat sheet</a> to make sure you’re
remembering all your complexities correctly.</p>
<h3 id="searching">Searching</h3>
<p>The problem of <strong>searching</strong> is often stated as <strong>given a sorted list <code class="language-plaintext highlighter-rouge">l</code> and an object
<code class="language-plaintext highlighter-rouge">x</code>, find the index at which an element <code class="language-plaintext highlighter-rouge">x</code> lives.</strong> (You should immediately ask: What
should I return if <code class="language-plaintext highlighter-rouge">x</code> is not in <code class="language-plaintext highlighter-rouge">l</code>?)The name of the game here is <strong>binary
search</strong>. You basically split the list, then if the number is greater than the split,
search the top; otherwise, search the bottom. This is an example of a <em>recursive
algorithm</em>, so the way it’s written can be a bit opaque to those not used to looking at
recursive code. Once I can wrap my head around it, I find it quite elegant. The
important thing to know is that this search is \(\mathcal{O}(\log n)\), which means that
you don’t touch every element in the list - it’s very fast, even for a large list. The
key to this is that the list is already sorted - if it’s not sorted, then you’re out of
luck; you’ve got to check every element to find <code class="language-plaintext highlighter-rouge">x</code>.</p>
<p>There are tons of examples of binary search in Python online, so I won’t put one
here. That said, I have found it interesting to see how thinking in terms of binary
search can help you in a variety of areas.</p>
<p>For example, suppose you had some eggs, and worked in a 40-story building, and wanted to
know the highest floor you could drop the egg off of without it breaking (it’s kind of a
dumb example cause the egg would probably break even on the first floor, but pretend
it’s a super-tough egg.) You could drop it from the first floor, and see what
happens. Say it doesn’t break. Then drop it from the 40th, and see what happens. Say it
does break. Then, you bisect and use the midpoint - drop from the 20th floor. If it
breaks here, you next try the 10th - if it doesn’t you next try the 30th. This allows
you to find the correct floor much faster than trying each floor in succession.</p>
<p>Sorting and searching are fundamental algorithms, and have been well studied for
decades. Having a basic fluency in them shows a familiarity with the field of computers
science that many employers like to see. In my opinion, <strong>you should be able to quickly
and easily implement the three sorting algorithms above, and binary search,</strong> in Python,
or whatever your language of choice is.</p>
<h1 id="working-with-sql">Working with SQL</h1>
<p>Finally, let’s talk a bit about SQL. SQL is a tool used to interact with so-called
“relational” databases, which just means that each row in a table has certain values
(columns), and that those values have the same type for each row (that is, the schema is
uniform throughout the table).<sup id="fnref:fnote_nosql" role="doc-noteref"><a href="#fn:fnote_nosql" class="footnote">12</a></sup> It is not exactly a language, it’s more
like a family of languages. There are many “dialects” which all have slight differences,
but they behave the same with regards to core functionality; for example, you can do</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="k">column</span> <span class="k">FROM</span> <span class="k">table</span> <span class="k">WHERE</span> <span class="n">columns</span> <span class="o">=</span> <span class="s1">'value'</span>
</code></pre></div></div>
<p>in any SQL-like language.<sup id="fnref:fnote_ansi" role="doc-noteref"><a href="#fn:fnote_ansi" class="footnote">13</a></sup> Modern data-storage and -access solutions like
Spark and Presto are very different from older databases in their underlying
architecture, but still use a SQL dialect for accessing data.</p>
<p>Solving problems in SQL involves thinking in a quite different way than solving a
similar problem on an array in Python. There is no real notion of iteration, or at least
it’s not easily accessible, so most of the complicated action happens via table joins. I
used <a href="https://sqlzoo.net/">SQLZoo</a>, and particularly the “assessments”, to practice my SQL and get it
up to snuff. LeetCode also has a SQL section (I think they call it “database”).</p>
<p>It’s essential to know SQL as a working data scientist. You’ll almost certainly use it
in your day-to-day activities. That said, it’s not always asked in the interviews, so
you might clarify with the company whether they will ask you SQL questions.</p>
<h2 id="a-note-on-dialects">A Note on Dialects</h2>
<p>There are many dialects of SQL, and changing the dialect changes things like (for
example) how you work with dates. It’s worth asking the company you’re interviewing with
what dialect they want you to know, if they have one in mind. If you’re just writing SQL
on a whiteboard, then I would be surprised if they were picky about this; I would just
say something like “here I’d use <code class="language-plaintext highlighter-rouge">DATE(table.dt_str)</code> or whatever the string-to-date
conversion function is in your dialect”. In this case it’s just details that move
around, but the big picture is generally the same for different dialects.</p>
<h1 id="conclusion">Conclusion</h1>
<p>Coding interviews are stressful. From what I can tell, that’s just the way it is. For
me, the best antidote to that is being well-prepared. I think companies are moving more
towards constructive, cooperative interview formats, and away from the classic Google
brain-teaser kind of questions, which helps with this, but you can still expect to be
challenged during these interviews.</p>
<p>Remember to be kind to yourself. You’ll probably fail many times before you
succeed. That’s fine, and is what happens to almost everyone. Just keep practicing, and
keep learning from your mistakes. Good luck!</p>
<!-------------------------------- FOOTER ---------------------------->
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:fnote_py3" role="doc-endnote">
<p>You should be using Python 3 at this point, but also be familiar with the
differences between 2 and 3, and be able to write code in Python 2 if need be. <a href="#fnref:fnote_py3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote_parquet" role="doc-endnote">
<p>For “big data” stored in the cloud, an efficient format called Parquet
is the standard. In my experience, however, it’s uncommon to work with parquet files
directly in Pandas; you often read them into a distributed framework like Spark and work
with them in that context. <a href="#fnref:fnote_parquet" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote_emacs" role="doc-endnote">
<p>The correct answer is, of course, emacs. <a href="#fnref:fnote_emacs" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote_pyplot" role="doc-endnote">
<p><code class="language-plaintext highlighter-rouge">pyplot</code> is an API within matplotlib that was designed in order to
mimic the MATLAB plotting API. It is generally what I use; I begin most of my matplotlib
work with <code class="language-plaintext highlighter-rouge">from matplotlib import pyplot as plt</code>. I only rarely need to <code class="language-plaintext highlighter-rouge">import
matplotlib</code> direct, and that’s generally for configuration work. <a href="#fnref:fnote_pyplot" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote_dsa2" role="doc-endnote">
<p>I read the book when preparing for a software engineer interview at
Google, so I picked up a lot more than was necessary for a data science interview. I
still find the material helpful, however, and it’s nice to be able to demonstrate
that you have gone above and beyond in a realm that data scientists sometimes
neglect (efficient software design). <a href="#fnref:fnote_dsa2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote_dsa" role="doc-endnote">
<p>It goes well beyond what you’ll need for a data science interview,
however - it gets into tree structures, graphs (and graph traversal algorithms), and
other more advanced topics. I’d recommend focusing on complexity analysis, arrays,
and hashmaps as the most important data structures that a data scientist will use
day-to-day. <a href="#fnref:fnote_dsa" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote_bigo" role="doc-endnote">
<p>This is only approximately true, or rather it is is <em>asymptotically</em>
true; this scaling law holds in the limit as \(n\rightarrow\infty\). <a href="#fnref:fnote_bigo" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote_asymptotics" role="doc-endnote">
<p>It’s a bit weird to use <em>both</em> \(n\) and \(k\) in your
complexity - mathematically, what this means is that we consider them separate
variables , and we can take the limit of either one independently from the
other. If, for example, you knew that \(k = n/4\), so you always wanted the top
quarter of the list, then this would be \(\mathcal{O}(n^2)\), since \(n/4 =
\mathcal{O}(n)\). <a href="#fnref:fnote_asymptotics" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote_linked" role="doc-endnote">
<p>I’m glossing over some details here - the numbers I quote above are for
a fixed-size array. So, if you build up an array by adding elements at the end, it
may seem like you get to just do a bunch of \(\mathcal{O}(1)\) <code class="language-plaintext highlighter-rouge">.append</code>s, but in
reality, you have to occasionally resize the array to make more space, which slows
things down to an average append time of \(\mathcal{O}(n)\). If you want a list-like
type where inserting elements is easy (\(\mathcal{O}(1)\)) but accessing elements is
difficult (\(\mathcal{O}(n)\)), then you want a <em>linked list</em>. Linked lists aren’t
as important for data scientists to use, so I won’t get into them much here. <a href="#fnref:fnote_linked" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote_array_hashmap" role="doc-endnote">
<p>You might wonder why we would ever use an array over a hashmap
if hashmaps are strictly superior with respect to their complexity. It’s a good
question. The answer is that arrays take up less space (they don’t have to store the
keys, only the values) and they are much easier to work with in code (they look
cleaner, and are more intuitive for unordered data). Furthermore, if you had a
hashmap that linked integers <code class="language-plaintext highlighter-rouge">0</code> through <code class="language-plaintext highlighter-rouge">10</code> to strings, and you wanted to change
the element at key <code class="language-plaintext highlighter-rouge">5</code>, then you’d have to go through what is currently at keys
<code class="language-plaintext highlighter-rouge">5</code> through <code class="language-plaintext highlighter-rouge">10</code>, and increment their keys by one, so you would end up back at an
inefficient insertion algorithm like you have with arrays. <a href="#fnref:fnote_array_hashmap" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote_sort" role="doc-endnote">
<p>This is true <em>on average</em>; see the section below for a discussion of
average vs. worst-case complexity. <a href="#fnref:fnote_sort" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote_nosql" role="doc-endnote">
<p>Non-relational database formats, like HBase and NoSQL, basically
function like giant hashmaps; they have a single “key”, and then the “value” can
contain arbitrary data - you don’t have to have certain columns in there. The
advantage of this is flexibility, but the disadvantage is that sorting and filtering
are slower because the database doesn’t have a pre-defined schema. <a href="#fnref:fnote_nosql" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote_ansi" role="doc-endnote">
<p>Technically, SQL is an ANSI Standard that many different dialects
implement - so, to call yourself a SQL dialect, you must have features defined by
this standard, like the <code class="language-plaintext highlighter-rouge">SELECT</code>, <code class="language-plaintext highlighter-rouge">FROM</code>, and <code class="language-plaintext highlighter-rouge">WHERE</code> clauses shown above. <a href="#fnref:fnote_ansi" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Peter Willspeter@pwills.comPart II of my guide to data science interviews, focusing on algorithms, data structures, and general programming knowledge and best practices.DS Interview Study Guide Part I: Statistics2019-08-24T00:00:00+00:002019-08-24T00:00:00+00:00http://www.pwills.com/posts/2019/08/24/stats<p>As I have gone through a couple rounds of interviews for data scientist
positions, I’ve been compiling notes on what I consider to be the essential
areas of knowledge. I want to make these notes available to the general public;
although there are many blog posts out there that are supposed to help one
prepare for data science interviews, I haven’t found any of them to be very
high-quality.</p>
<p>From my perspective, there are four key subject areas that a data scientist
should feel comfortable with when going into an interview:</p>
<ol>
<li>Statistics (including experimental design)</li>
<li>Machine Learning</li>
<li>Software Engineering (including SQL)</li>
<li>“Soft” Questions</li>
</ol>
<p>I’m going to go through each of these individually. This first post will focus
on statistics. We will go over a number of topics in statistics in no particular
order. Note that <strong>this post will not teach you statistics; it will remind you
of what you should already know.</strong></p>
<p>If you’re utterly unfamiliar with the concepts I’m mentioning, I’d recommend <a href="https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/index.htm">this
excellent MIT course on probability & statistics</a> as a good starting point. When I
began interviewing, I had never taken a statistics class before; I worked through the
notes, homeworks, and exams for this course, and at the end had a solid foundation to
learn the specific things that you need to know for these interviews. In my studying, I
also frequently use <a href="https://stats.stackexchange.com">cross-validated</a>, a website for asking and answering questions
about statistics. It’s good for in-depth discussions of subtle issues in
statistics. Finally, <a href="https://www.goodreads.com/book/show/619590.Bayesian_Data_Analysis">Gelman’s book</a> is the classic in Bayesian inference. If you
have recommendations for good books that cover frequentist statistics in a clear manner,
I’d love to hear them.</p>
<p>These are the notes that I put together in my studying, and I’m sure that there is
plenty of room for additions and corrections. I hope to improve this guide over time;
please let me know in the comments if there’s something you think should be added,
removed, or changed!</p>
<h1 id="the-central-limit-theorem">The Central Limit Theorem</h1>
<p>The Central Limit Theorem is a fundamental tool in statistical analysis. It states
(roughly) that when you add up a bunch of independent and identically distributed random
variables (with finite variance) then their sum will converge to a Gaussian
distribution.<sup id="fnref:fnote1" role="doc-noteref"><a href="#fn:fnote1" class="footnote">1</a></sup></p>
<p>How is this idea useful to a data scientist? Well, one place where we see a sum of
random variables is in a <em>sample mean</em>. One consequence of the central limit theorem is
that the sample mean of a variable with mean \(\mu\) and variance \(\sigma^2\) will
itself have mean \(\mu\) and variance \(\sigma^2/n\), where \(n\) is the number of
samples.</p>
<p>I’d like to point out that this is pretty surprising. The distribution of the sum of two
random variables is not, in general, trivial to calculate. So it’s kind of awesome that,
if we’re adding up a large enough number of (independent and identically distributed)
random variables, then we <em>do</em>, in fact, have a very easy expression for the
(approximate) distribution of the sum. Even better, we don’t need to know much of
anything about the distribution of we’re sampling from, besides its mean and
variance - it’s other moments, or general shape, don’t matter for the CLT.</p>
<p>As we will see below, the simplification that the CLT introduces is the basis of one of
the fundamental hypothesis tests that data scientists perform: testing equality of
sample means. For now, let’s work through an example of the theorem itself.</p>
<h2 id="an-example">An Example</h2>
<p>Suppose that we are sampling a Bernoulli random variable. This is a 0/1 random
variable that is 1 with probability \(p\) and 0 with probability \(1-p\). If we
get the sequence of ten draws \([0,1,1,0,0,0,1,0,1,0]\), then our sample mean is</p>
\[\hat \mu = \frac{1}{10}\sum_{i=1}^{10} x_i = 0.4\]
<p>Of course, this sample mean is itself a random variable - when we report it, we
would like to report an estimate on its variance as well. The central limit
theorem tells us that this will, as \(n\) increases, converge to a Gaussian
distribution. Since the mean of the Bernoulli random variable is \(p\) and its
variance is \(p(1-p)\), we know that the distribution of the sample mean will
converge to a Gaussian with mean \(p\) and variance \(p(1-p)/n\). So we could
say that our estimate of the parameter \(p\) is 0.4 \(\pm\) 0.155. Of course,
we’re playing a bit loose here, since we’re using the estimate \(\hat p\) from
the data, as we don’t actually know the <em>true</em> parameter \(p\).</p>
<p>Now, a sample size of \(n=10\) is a bit small to be relying on a “large-\(n\)”
result like the CLT. Actually, in this case, we know the exact distribution of
the sample mean, since \(\sum_i x_i\) is binomially distributed with parameters
\(p\) and \(n\).</p>
<h2 id="other-questions-on-the-clt">Other Questions on the CLT</h2>
<p>I find that the CLT more comes up as a piece of context in other questions
rather than as something that gets asked about directly, but you should be
prepared to answer the following questions.</p>
<ul>
<li>
<p><strong>What is the central limit theorem?</strong> We’ve addressed this above - I doubt
they’ll be expecting a mathematically-correct statement of the theorem, but
you should know the gist of it, along with significant limitations (finite
variance being the major one).</p>
</li>
<li>
<p><strong>When can you <em>not</em> use the CLT?</strong> I think the key thing here is that you
have to be normalizing the data in an appropriate way (dividing by the sample
size), and that the underlying variance must be finite. The answer here can
get very subtle and mathematical, involving modes of convergence for random
variables and all that, but I doubt they will push you to go there, unless
you’re applying for a job specifically as a statistician.</p>
</li>
<li>
<p><strong>Give me an example of the CLT in use.</strong> The classic example here is the
distribution of the sample mean converging to a normal distribution as the
number of samples grows large.</p>
</li>
</ul>
<h1 id="hypothesis-testing">Hypothesis Testing</h1>
<p>Hypothesis testing (also known by the more verbose “null hypothesis significance
testing”) is a huge subject, both in scope and importance. We use statistics to
quantitatively answer questions based on data, and (for better or for worse) null
hypothesis significance testing is one of the primary methods by which we construct
these answers.</p>
<p>I won’t cover the background of NHST here. It’s well-covered in the MIT course; look at
<a href="https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/">the readings</a> to find the relevant sections. Instead of covering the background,
we’ll work through one exampleof a hypothesis test. It’s simple, but it comes up all the
time in practice, so it’s essential to know. I might go so far as to say that this is
the fundamental example of hypothesis testing in data science.</p>
<h2 id="an-example-1">An Example</h2>
<p>Suppose we have two buttons, one green and one blue. We put them in front of
two different samples of users. For simplicity, let’s say that each sample has
size \(n=100\). We observe that \(k_\text{green}\) 57 users click the green
button, and only \(k_\text{blue} = 48\) click the blue button.</p>
<p>Seems like the green button is better, right? Well, we want to be able to say
how <em>confident</em> we are of this fact. We’ll do this in the language of null
hypothesis significance testing. As you should (hopefully) know, in order to do NHST, we
need a null hypothesis and a test statistic; we need to know the test statistic’s
distribution (under the null hypothesis); and we need to know the probability of
observing a value “at least as extreme” as the observed value according to this
distribution.</p>
<p>I’m going to lay out a table of all the important factors here, and then discuss how we
use them to arrive at our \(p\)-value.</p>
<table>
<thead>
<tr>
<th>Description</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>Null Hypothesis</td>
<td>\(p_{blue} - p_{green} < 0\)</td>
</tr>
<tr>
<td>Test Statistic</td>
<td>\(\frac{k_\text{blue}}{n} - \frac{k_\text{green}}{n}\)</td>
</tr>
<tr>
<td>Test Statistic’s Distribution</td>
<td>\(N(0, (p_b(1-p_b) + p_g(1-p_g)) / n)\)</td>
</tr>
<tr>
<td>Test Statistic’s Observed Value</td>
<td>-0.09</td>
</tr>
<tr>
<td>\(p\)-value</td>
<td>0.1003</td>
</tr>
</tbody>
</table>
<p>There are a few noteworthy things here. First, we really want to know whether
\(p_g > p_b\), but that’s equivalent to \(p_b-p_g < 0\). Second, we assume that
\(n\) is large enough so that \(k/n\) is approximately normally distributed,
with mean \(\mu = p\) and variance \(\sigma^2 = p(1-p)/n\). Third, since the
differences of two normals is itself a normal, the test statistic’s distribution
is (under the null hypothesis) a normal with mean zero and the variance given
(which is the sum of the two variances of \(k_b/n\) and \(k_g/n\)).</p>
<p>Finally, we don’t actually know \(p_b\) or \(p_g\), so we can’t really compute
the \(p\)-value; what we do is we say that \(k_b/n\) is “close enough”” to
\(p_b\) and use it as an approximation. That gives us our final \(p\)-value.</p>
<p>The \(p\)-value was calculated in Python, as follows:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">scipy.stats</span> <span class="kn">import</span> <span class="n">norm</span>
<span class="n">pb</span> <span class="o">=</span> <span class="mf">0.48</span>
<span class="n">pg</span> <span class="o">=</span> <span class="mf">0.57</span>
<span class="n">n</span> <span class="o">=</span> <span class="mi">100</span>
<span class="n">sigma</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">sqrt</span><span class="p">((</span><span class="n">pb</span><span class="o">*</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">pb</span><span class="p">)</span> <span class="o">+</span> <span class="n">pg</span><span class="o">*</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">pg</span><span class="p">))</span><span class="o">/</span><span class="n">n</span><span class="p">)</span>
<span class="n">norm</span><span class="p">.</span><span class="n">cdf</span><span class="p">(</span><span class="o">-</span><span class="mf">0.09</span><span class="p">,</span> <span class="n">loc</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="n">scale</span> <span class="o">=</span> <span class="n">sigma</span><span class="p">)</span> <span class="c1"># 0.10034431272089045</span></code></pre></figure>
<p>Calculating the CDF of a normal at \(x=-0.09\) tells us the probability that the test
statistic is less than or equal to \(-0.09\), which is to say the probability that our
test statistic is at least as extreme as the observed value. This probability is
precisely our \(p\)-value.</p>
<p>So what’s the conclusion? Well, often times a significance level is set before the test
is performed; if the \(p\)-value is not below this threshold, then the null hypothesis
is not rejected. Suppose we had set a significance level of 0.05 before the test began -
then, with this data, we would not be able to reject the null hypothesis, which is that
the buttons are equally appealing to users.</p>
<p>Phew! I went through that pretty quick, but if you can’t follow the gist of what
I was doing there, I’d recommend you think through it until it is clear to
you. You will be faced with more complicated situations in practice; it’s
important that you begin by understanding the most simple situation inside out.</p>
<h2 id="other-topics-in-hypothesis-testing">Other Topics in Hypothesis Testing</h2>
<p>Some important follow-up questions you should be able to answer:</p>
<ul>
<li>
<p><strong>What are Type I & II error? What is a situation where you would be more concerned
with Type I error? Vice versa?</strong> These are discussed <a href="https://en.wikipedia.org/wiki/Type_I_and_type_II_errors#Type_I_error">on Wikipedia</a>. Type I error
is false-positive error. You might be very concerned with Type I error if you are
interviewing job candidates; it is very costly to hire the wrong person for the job,
so you really want to avoid false positives. Type II error is false-negative error. If
you are testing for a disease that is deadly but has a simple cure, then you would
certainly NOT want to have a false negative result of the test, since that would
result in an easily-avoidable negative outcome.</p>
</li>
<li>
<p><strong>What is the <em>power</em> of a test? How do you calculate it?</strong> The power of a test is the
probability that you will reject the null hypothesis, given an alternative
hypothesis. Therefore, to calculate the power, you need an alternative hypothesis; in
the example above, this would look like \(p_b-p_g = -0.1\). Although these alternative
hypothesis are often somewhat ad-hoc, the power analysis depends critically upon
them. Google will turn up plenty of videos and tutorials on calculating the power of a
test.</p>
</li>
<li>
<p><strong>What is the significance of a test?</strong> This is the same as the
\(p\)-value threshold below which we reject the null
hypothesis. (In)famously, 0.05 has become the de-facto standard throughout
many sciences for significance levels worthy of publication.</p>
</li>
<li>
<p><strong>Gow would you explain a p-value to a lay person</strong>? Of course, you should
have a solid understanding of the statistical definition of the
\(p\)-value. A generally accepted answer is “a \(p\)-value quantifies the
evidence for a hypothesis - closer to zero means more evidence.” Of course,
this is wrong on a lot of levels - it’s actually quantifying evidence
<em>against</em> the null hypothesis, not <em>for</em> the alternative. For what it’s
worth, I’m not convinced there’s a great answer to that one; it’s an
inherently technical quantity that is frequently misrepresented and abused by
people trying to (falsely) simplify its meaning.</p>
</li>
<li>
<p><strong>If you measure many different test statistics, and get a \(p\)-value for each (all
based on the same null hypothesis), how do you combine them to get an aggregate
\(p\)-value?</strong> This one is more of a bonus question, but it’s worth knowing. It’s
actually not obvious how do to this, and the true \(p\)-value depends on how the tests
depend on each other. However, you can get an upper-bound (worst-case estimate) on the
aggregate \(p\)-value by adding together the different \(p\)-values. The validity of
this bound results from the inclusion-exclusion principle.</p>
</li>
</ul>
<h1 id="confidence-intervals">Confidence Intervals</h1>
<p>Confidence intervals allow us to state a statistical result as a range, rather than a
single value. If we count that 150 out of 400 people sample randomly from a city
identify themselves as male, then our best estimate of the fraction of women in the city
is 250/400, or 5/8. But we only looked at 400 people, so it’s reasonable to expect that
the true value might be a bit more or less than 5/8. Confidence intervals allow us to
quantify this width in a statistically rigorous way.</p>
<p>As per usual, we won’t actually introduce the concepts here - I’ll refer you to the
<a href="https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/">readings from the MIT course</a> for an introduction. We’ll focus on working through
an example, and looking at some different approaches.</p>
<h2 id="the-exact-method">The Exact Method</h2>
<p>Suppose that we want to find a 95% confidence inverval on the female fraction in the
city discussed above. This corresponds to a significance level of \(\alpha/2\). One way
to get the <strong>exact confidence inverval</strong> is to use the CDF of our test statistic, but
substitute in the observed parameter for the true parameter, and then invert it to find
where it hits \(\alpha/2\) and \(1-\alpha/2\). That is, we need to find the value
\(p_l\) that solves the equation</p>
\[CDF\left(n, p_l\right) = \alpha/2\]
<p>and the value \(p_u\) that solves the equation</p>
\[CDF\left(n, p_u\right) = 1 - \alpha/2.\]
<p>In these, \(CDF(n,p)\) is the cumulative distribution function of our test statistic,
assuming that the true value of \(p\) is in fact the observed value \(\hat p\). This is
a bit confusing, so it’s worth clarifying. In our case, the sample statistic is the
sample mean of \(n\) binomial random variables, so this CDF is the CDF of the sample
mean of \(n\) binomial random variables with parameter \(5/8\). Solving the two
equations above would give us our confidence inverval \([p_l, p_u]\).</p>
<p>It took me a bit of work to see that solving the above two equations would in fact give
us bounds that satisfy the definitions of a \(1-\alpha\) confidence interval, which says
that, were we to run many experiments, we would find that the true value of \(p\) would
fall between \(p_l\) and \(p_u\) with the probability</p>
\[P\left(p_l\leq p \leq p_u\right) = 1-\alpha.\]
<p>If you’re into this sort of thing, I’d suggest you take some time thinking through why
inverting the CDF as above guarantees bounds \([p_l, p_u]\) that solve the above
equaiton.</p>
<p>Although it is useful for theoretical analysis, I rarely use this method in
practice, because I often do not actually know the true CDF of the statistic
I am measuring. Sometimes I do know the true CDF, but even in such cases, the
next (approximate) method is generally sufficient.</p>
<h2 id="the-approximate-method">The Approximate Method</h2>
<p>If your statistic can be phrased as a sum, then its distribution approaches a normal
distribution.<sup id="fnref:fnote2" role="doc-noteref"><a href="#fn:fnote2" class="footnote">2</a></sup> This means that you can solve the above equations for a normal
CDF rather than the true CDF of the sum (in the case above, a binomial CDF).</p>
<p>How does this help? For a normal distribution, the solutions for the above equations to
find lower and upper bounds are well known. In particular, the inverval
\([\mu-\sigma,\mu+\sigma]\), also called a \(1\sigma\)-interval, covers about 68% of the
mass (probability) of the normal PDF, so if we wanted to find a confidence interval of
level \(0.68\), then we know to use the bounds \((\overline x-\sigma, \overline
x+\sigma)\), where \(\overline x\) is our estimate of the true mean \(\mu\).</p>
<p>This sort of result is very powerful, because it saves us from having to do any
inversion by hand. A table below indicates the probability mass contained in various
symmetric intervals on a normal distribution:</p>
<table>
<thead>
<tr>
<th>Inverval</th>
<th>Width<sup id="fnref:fnote3" role="doc-noteref"><a href="#fn:fnote3" class="footnote">3</a></sup></th>
<th>Coverage</th>
</tr>
</thead>
<tbody>
<tr>
<td>\([\mu-\sigma,\mu+\sigma]\)</td>
<td>\(1\sigma\)</td>
<td>0.683</td>
</tr>
<tr>
<td>\([\mu-2\sigma,\mu+2\sigma]\)</td>
<td>\(2\sigma\)</td>
<td>0.954</td>
</tr>
<tr>
<td>\([\mu-3\sigma,\mu+3\sigma]\)</td>
<td>\(3\sigma\)</td>
<td>0.997</td>
</tr>
</tbody>
</table>
<p>Let’s think through how we would use this in the above example, where we give a
confidence interval on our estimate of the binomial parameter \(p\).</p>
<p>A binomial distribution has mean \(\mu=np\) and variance \(\sigma^2=np(1-p)\). Since
the sample statistical \(\hat p\) is just the binomial divided by \(n\), it has mean
\(\mu=p\) and variance \(\sigma^2 = p(1-p)/n\). The central limit theorem tells us that
the distribution of \(\hat p\) will converge to a normal with just these parameters.</p>
<p>Suppose we want an (approximate) 95% confidence interval on the percentage of women in
the population of our city; the table above tells us we can just do a two-sigma
interval. (This is not <em>exactly</em> a 95% confidence interval; it’s a bit over, as we see
in the table above). The parameter \(\hat p\) has mean \(\mu= p\) and variance
\(\sigma^2 = p(1-p)/n\).<sup id="fnref:fnote4" role="doc-noteref"><a href="#fn:fnote4" class="footnote">4</a></sup> In our case, \(\hat p=5/8\), so our confidence
interval is \(5/8 \pm 15/1280 \approx 0.625 \pm 0.0117\). Note that we approximated
\(p\) with our experimental value \(\hat p\); the theoretical framework that allows us
to do this substitution is beyond the scope of this article, but is nicely covered in
the MIT readings (Reading 22, in particular).</p>
<h2 id="the-bootstrap-method">The Bootstrap Method</h2>
<p>The previous approach relies on the accuracy of approximating our statistic’s
distribution by a normal distribution. Bootstrapping is a pragmatic, flexible
approach to calculating confidence intervals, which makes no assumptions on the
underlying statistics we are calculating. We’ll go into more detail on
bootstrapping in general below, so we’ll be pretty brief here.</p>
<p>The basic idea is to repeatedly pull 400 samples <em>with replacement</em> from the sampled
data. For each set of 400 samples, we get an estimate \(\hat p\), and thus can build an
empirical distribution on \(\hat p\). Of course, the CLT indicates that this empirical
distribution should look a lot like a gaussian distribution with mean \(\mu= p\) and variance
\(\sigma^2 = p(1-p)/n\)..</p>
<p>Once you have bootstrapped an empirical distribution for your statistic of interest (in
the example above, this is the percentage of the population that is women), then you can
simply find the \(\alpha/2\) and \(1-\alpha/2\) percentiles, which then become your
confidence interval. Although in this case our empirical distribution is (approximately)
normal, it’s worth realizing that we can reasonably calculate percentiles <em>regardless</em>
of what the empirical distribution is; this is why bootstrapping confidence intervals
are so flexible.</p>
<p>As you’ll see below, the downside of bootstrapping confidence intervals is that
it requires some computation. The amount of computation required can be
anywhere from trivial to daunting, depending on how many samples you want in
your empirical distribution. Another downside is that their statistical interpretation
is not exactly in alignment with the definition of a confidence interval, but I’ll leave
the consideration of that as an exercise for the reader.<sup id="fnref:fnotez" role="doc-noteref"><a href="#fn:fnotez" class="footnote">5</a></sup> <a href="https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/MIT18_05S14_Reading24.pdf">One of the MIT
readings</a> has an in-dpeth discussion of confidence intervals generated via the
bootstrap method.</p>
<p><strong>Overall, I would recommend using the approximate method when you have good reason to
believe your sample statistic is approximately normal, or bootstrapping otherwise.</strong> Of
course, the central limit theorem can provide some guarantees about the asympototic
distribution of certain statistics, so it’s worth thinking through whether that applies
to your situations.</p>
<h2 id="other-topics-in-confidence-intervals">Other Topics in Confidence Intervals</h2>
<ul>
<li>
<p><strong>What is the definition of a confidence interval?</strong> This is a bit more technical, but
it’s essential to know that it is <strong>not</strong> “there is a 95% probability that the true
parameter is in this range.” Actually, what it means is that “if you reran the
experiment many times, then 95% of the time, the true value of the parameter you’re
estimating would fall in this range.” It’s worth noting that the <em>range</em> is the random
variable here - the parameter itself (the true percentage of the population that
identifies as female, in our example) is fixed.</p>
</li>
<li>
<p><strong>How would this change if you wanted a <em>one-sided</em> confidence interval?</strong>
This one isn’t too bad - you just solve either \(CDF(n,p_l) = \alpha\) or
\(CDF(n,p_u) = 1-\alpha\) for a lower- or upper-bounded interval,
respectively.</p>
</li>
<li>
<p><strong>What is the relationship between confidence intervals and hypothesis testing?</strong>
There are many ways to answer this question; it’s a good one to ponder in order to get
a deeper understanding of the two topics. One connection is the relationship between
confidence intervals and rejection regions in NHST - Reading 22 in the MIT course
addresses this one nicely.</p>
</li>
</ul>
<h1 id="bootstrapping">Bootstrapping</h1>
<p>Bootstrapping is a technique that allows you to get insight into the quality of your
estimates, based only on the data you have. It’s a key tool in a data scientist’s
toolbag, because we frequently don’t have a clear theoretical understanding of our
statistics, and yet we want to provide uncertainty estimates. To understand how it
works, let’s look through an example.</p>
<p>In the last section, we sampled 400 people in an effort to understand what percentage of
a city’s population identified as female. Since 250 of them identified themselves as
female, our estimate of the raio for the total population is \(5/8\). This estimate it
itself a random variable; if we had sampled different people, we might have ended up
with a different number. What if we want to know the distribution of this estimate? How
would we go about getting that?</p>
<p>Well, the obvious way is to go out and sample 400 more people, and repeat this over and
over again, until we have many such fractional estimates. But what if we don’t have
access to sampling more people? The natural thing is to think that we’re out of luck -
without the ability to sample further, we can’t actually understand more about the
distribution of our parameter (ignoring, for the moment, that we have lots of
theoretical knowledge about it via the CLT).</p>
<p>The idea behind bootstrapping is simple. Sample from the data you already have, with
replacement, a new sample of 400 people. This will give you an estimate of the female
fraction that is distinct from your original estimate, due to the replacement in your
sampling. You can repeat this process as many times as you like; you will then get an
empirical distribution whic approaches the true distribution of the statistic.<sup id="fnref:fnote4:1" role="doc-noteref"><a href="#fn:fnote4" class="footnote">4</a></sup></p>
<p>Bootstrapping has the advantage of belig flexible, although it does have its
limitations. Rather than get too far into the weeds, I’ll just point you to the
<a href="https://en.wikipedia.org/wiki/Bootstrapping_(statistics)">Wikipedia article on bootstrapping</a>. There are also tons of resources about this
subject online. Try coding it up for yourself! By the time you’re interviewing, you
should be able to write a bootstrapping algorithm quite easily.</p>
<p><a href="https://machinelearningmastery.com/a-gentle-introduction-to-the-bootstrap-method/">Machine Learning Mastery</a> has a good introduction to bootstrapping that uses the
scikit-learn API. <a href="https://towardsdatascience.com/an-introduction-to-the-bootstrap-method-58bcb51b4d60">Towards Data Science</a> codes it up directly in NumPy, which is a
useful thing to know how to be able to do. Asking someone to code up a bootstrapping
function would be an entirely reasonable interview questions, so it’s something you
should be comfortable doing.</p>
<h2 id="other-topics-in-bootstrapping">Other Topics in Bootstrapping</h2>
<ul>
<li><strong>When would you <em>not</em> want to use bootstrapping?</strong> It might not be feasible when it
is very costly to calculate your sample statistic. To get accurate estimates you’ll
need to calculate your statistic thousands of times, so it might not be feasible if it
takes minutes or hours to calculate a single sample. Also, it is often difficult to
get strong theoretical guarantees about probabilities based on bootstrapping, so if
you need a highly statistically rigorous approach, you might be better served with
something more analytical. Finally, if you know the distribution of your statistic
already (for example, you know from the CLT that it is normally distributed) then you
can get better (more accurate) uncertainty estimates from an analytical approach.</li>
</ul>
<h1 id="linear-regression">Linear Regression</h1>
<p>Regression is the study of the relationship between variables; for example, we
might wish to know how the weight of a person relates to their height. <em>Linear</em>
regression assumes that your input (height, or \(h\)) and output (weight, or
\(w\)) variables are <em>linearly related</em>, with slope \(\beta_1\), intercept
\(\beta_0\), and noise \(\epsilon\).</p>
\[w = \beta_1\cdot h + \beta_0 + \epsilon.\]
<p>A linear regression analysis helps the user discover the \(\beta\)s in the
above equation. This is just the simplest application of LR; in reality, it is
quite flexible and can be used in a number of scenarios.</p>
<p>Linear regression is another large topic that I can’t really do justice to in this
article. Instead, I’ll just go through some of the common topics, and introduce the
questions you should be able to address. As is the case with most of these topics, you
can look at the <a href="https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/index.htm">MIT Statistics & Probability course</a> for a solid academic
introduction to the subject. You can also dig through <a href="https://en.wikipedia.org/wiki/Linear_regression">the Wikipedia article</a> to get
a more in-depth picture. The subject is so huge, and there’s so much to learn about it,
that you really can spend as much time as you want digging into it - I’m just going to
gesture at some of the simpler aspects of it.</p>
<h2 id="calculating-a-linear-regression">Calculating a Linear Regression</h2>
<p>Rather than go through an example here, I’ll just refer you to the many available guides
that show you how to do this in code. Of course, you could do it in raw NumPy, solving
the normal equations explicitly, but I’d recommend using scikit-learn or statsmodels, as
they have much nicer interfaces, and give you all sorts of additional information about
your model (\(r^2\), \(p\)-value, etc.)</p>
<p><a href="https://realpython.com/linear-regression-in-python/">Real Python</a> has a good guide to coding this up - see the section “Simple Linear
Regression with scikit-learn.” <a href="https://www.geeksforgeeks.org/linear-regression-python-implementation/">GeeksForGeeks</a> does the solution in raw NumPy; the
equations won’t be meaningful for you until you read up on the normal equation and how
to analytically solve for the optimal LR coefficients. If you want something similar in
R, or Julia, or MATLAB,<sup id="fnref:fnoted" role="doc-noteref"><a href="#fn:fnoted" class="footnote">6</a></sup> then I’m sure it’s out there, you’ll just have to go do
some Googling to find it.</p>
<h2 id="a-statistical-view">A Statistical View</h2>
<p>This subject straddles the boundary between statistics and machine-learning. It has been
quite thoroughly studied from a statistical point of view, and there are some iportant
results that you should be familiar with when thinking about linear regression from a
statistical frame.<sup id="fnref:fnotec" role="doc-noteref"><a href="#fn:fnotec" class="footnote">7</a></sup></p>
<p>Let’s look back at our foundational model for linear regression. LR assumes
that your input \(x\) and output \(y\) are related via</p>
\[y_i = \beta_1\cdot x_i + \beta_0 + \epsilon_i,\]
<p>where \(\epsilon_i\) are i.i.d., distributed as \(N(0, \sigma^2)\). Since the
\(\epsilon\) are random variables, the \(\beta_j\) are themselves random
variables. One important question is whether there is, in fact, any
relationship between our variables at all. If there is not, then we should
\(\beta_1\) close to 0,<sup id="fnref:fnoteb" role="doc-noteref"><a href="#fn:fnoteb" class="footnote">8</a></sup> but they will not ever be exactly zero. One important
statistical technique in LR is <strong>doing a hypothesis test against the null
hypothesis that \(\beta_1 = 0\)</strong>. When a package like scikit-learn returns a
“\(p\)-value of the regression”, this is the \(p\)-value they are talking
about.</p>
<p>Like I said before, there is a lot more to know about the statistics of linear
regression than just what I’ve said here. You can learn more about the statistics of LR
by looking at the <a href="https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/MIT18_05S14_Reading25.pdf">MIT course notes on the subject</a>, or by digging through your
favorite undergraduate statistics book - most of them should have sections covering it.</p>
<h2 id="validating-your-model">Validating Your Model</h2>
<p>Once you’ve calculated your LR, you’d like to validate it. This is very important to
do - if you’re asked to calculate a linear regression in an interview, you should always
go through the process of validating it after you’ve done the calculation.</p>
<p>I’d generally go through the following steps:</p>
<ul>
<li>If it’s just a simple (one independent variable) linear regression, then plot the two
variables. This should give you a good sense of whether it’s a good idea to use linear
regression in the first place. If you have multiple independent variables, you can
make separate plots for each one.</li>
<li>Look at your \(r^2\) value. Is it reasonably large? Remember, closer to 1 is
better. If it’s small, then doing a linear regression hasn’t helped much.</li>
<li>You can look at the \(p\)-value to see if it’s difference from zero is
statistically significant (see the section below). Also, you can have a very
significant \(p\)-value while still having a low \(r^2\), so be cautious in your
interpretation of this one.</li>
<li>You can also look at the RMSE of your model, but this number is not scaled between 0
and 1, so a “good” RMSE is highly dependent on the units of your indepedent variable.</li>
<li>Plot your residuals, for each variable. The residual is just the input minus
the value predicted by your model, a.k.a. the error of your model. Plotting
each residual isn’t really feasible if you have hundreds of independent
variables, but it’s a good idea if your data is small enough. You should be
looking for “homoskedasticity” - that the variance of the error is uniform
across the range of the independent variable. If it’s not, then certain
things you’ve calculated (for example, the \(p\)-value of your regression)
are no longer valid. You might also see that your errors have a bias that
changes as the \(x_i\) changes; this means that there’s some more complicated
relationship between \(y\) and \(x_i\) that your regression did not pick up.</li>
</ul>
<p>Some of the questions below address the assumptions of linear regression; you
should be familiar with them, and now how to test for them either before or
after the regression is performed, so that you can be confident that your model
is valid.</p>
<h2 id="basic-questions-on-lr">Basic Questions on LR</h2>
<p>Hopefully you’ve familiarized yourself with the basic ideas behind linear
regression. Here are some conceptual questions you should be able to answer.</p>
<ul>
<li>
<p><strong>How are the \(\beta\)s calculated?</strong> Practically, you let the library
you’re using take care of this. But behind the scenes, generally it’s solving
the so-called “normal equations”, which give you the optimal (highest
\(r^2\)) parameters possible. You can use gradient descent to approximate
the optimal solution when the design matrix is too large to invert; this is
available via the <code class="language-plaintext highlighter-rouge">SGDRegressor</code> model in scikit-learn.</p>
</li>
<li>
<p><strong>How do you decide if you should use linear regression?</strong> The best case is
when the data is 2- or 3-dimensional; then you can just plot the data and see
if it looks like “linear plus noise”. However, if you have lots of
independent variables, this isn’t really an option. In such a case, you
should look perform a linear regression analysis, and then look at the errors
to verify that they look normally distributed and homoskedastic (constant
variance).</p>
</li>
<li>
<p><strong>What does the \(r^2\) value of a regression indicate?</strong> The \(r^2\) value
indicates “how much of the variance of the output data is explained by the
regression.” That is, your output data \(y\) has some (sample) variance, just
on its own. Once you discover the linear relationship and subtract it off,
then the remaining error \(y - \beta_0 - \beta_1x\) still has some variance,
but hopefully it’s lower - \(r^2\) is one minus the ratio of the original to
the remaining variance. When \(r^2=1\), then your line is a perfect fit of
the data, and there is no remaining error. It is often used to explain the
“quality” of your fit, although this can be a bit treacherous - see
<a href="https://en.wikipedia.org/wiki/Anscombe%27s_quartet">Anscombe’s Quartet</a> for examples of very different situations with the
same \(r^2\) value.</p>
</li>
<li>
<p><strong>What are the assumptions you make when doing a linear regression?</strong> The
Wikipedia article <a href="https://en.wikipedia.org/wiki/Linear_regression#Assumptions">addresses this point</a> quite thoroughly. This is worth
knowing, because you don’t just want to jump in and blindly do LR; you want
to be sure it’s actually a reasonable approach.</p>
</li>
<li>
<p><strong>When is it a bad idea to do LR?</strong> When you do linear regression, you’re assuming a
certain relationship between your variables. Just the parameters and output of your
regression won’t tell you whether the data really are appropriate for a linear
model. <a href="https://en.wikipedia.org/wiki/Anscombe%27s_quartet">Anscombe’s Quartet</a> is a particularly striking example of how the output of
a linear regression analysis can look similar but in fact the quality of the analysis
can be radically different. Beyond this, it is a bad idea to do LR whenever the
assumptions of LR are violated by the data; see the above bullet for more info there.</p>
</li>
<li>
<p><strong>Can you do linear regression on a nonlinear relationship?</strong> In many cases,
yes. What we need is for the model to be linear in the parameters \(\beta\);
if, for example, you are comparing distance and time for a constantly
accelerating object \(d = 1/2at^2\), and you want to do regression to
discover the acceleration \(a\), then you can just use \(t^2\) as your
independent variable. The model relating \(d\) and \(t^2\) is linear in the
acceleration \(a\), as required.</p>
</li>
<li>
<p><strong>What does the “linear” in linear regression refer to?</strong> This one might seem
trivial, but it’s a bit of a trick question; the relationship \(y =
2\log(x)\) might not appear linear, but in fact it can be obtained via a
linear regression, by using \(\log(x)\) as the input variables, rather than
\(x\). Of course, for this to work, you need to know ahead of time that you
want to compare against \(\log(x)\), but this can be discovered via
trial-and-error, to some extent. So the “linear” <em>does</em>, as you’d expect,
mean that the relationship between independent and dependent variable is
linear, but you can always <em>change</em> either of them and re-calculate your
regression.</p>
</li>
</ul>
<h2 id="handling-overfitting">Handling Overfitting</h2>
<p>Overfitting is a very important to understand, and is a fundamental challenge in machine
learning and modeling. I’m not going to go into great detail on it here; more
information will be presented in the machine learning section of the guide. There are
some techniques for handling it that are particular to LR, which is what I’ll talk about
here.</p>
<p><a href="https://realpython.com/linear-regression-in-python/">RealPython</a> has good images showing examples of over-fitting. You can
handle it by building into your model a “penalty” on the \(\beta_i\)s; that is,
tell your model “I want low error, <strong>and</strong> I don’t want large coefficients.**
The balance of these preferences is determined by a parameter, often denoted by
\(\lambda\).</p>
<p>Since you have many \(\beta\)s, in general, you have to combine them in some
fashion. Two such ways to calculate the measure of “overall badness” (which I’ll call
\(OB\)) are</p>
\[OB = \sqrt{ \beta_1^2 + \beta_2^2 + \ldots + \beta_n^2 }\]
<p>or</p>
\[OB = |\beta_1| + |\beta_2| + \ldots + |\beta_n|.\]
<p>The first will tend to be emphasize outliers; that is, it is more sensitive to
single large \(\beta\)s. The second considers all the \(\beta\)s more
uniformly. If you use the first, it is called “ridge regression”, and if you
use the second it is called “LASSO regression.”</p>
<p>In mathematics, these denote the \(\ell_1\) and \(\ell_2\) norms of the vectors
of \(\beta\)s; you can in theory use \(\ell_p\) norms for any \(p\), even
\(p=0\) (count the number of non-zero \(\beta\)s to get the overall badness) or
\(p=\infty\) (take the largest \(\beta\) as the overall badness). However, in
practice, LASSO and ridge regression are already implemented in common
packages, so it’s easy to use them right out of the box.</p>
<p>As usual, there is a LOT to learn about how LASSO and ridge regression change your
output, and what kinds of problems they can address (and/or create). I’d highly
recommend searching around the internet to learn more about them if you aren’t already
confident in your understanding of how they work.</p>
<h2 id="logistic-regression">Logistic Regression</h2>
<p>Logistic regression is a way of modifying linear regression models to get a
classification model. The statistics of logistic regression are, generally speaking, not
as clean as those of linear regression. It will be covered in the machine learning
section, so we won’t discuss it here.</p>
<h1 id="bayesian-inference">Bayesian Inference</h1>
<p>Up until now this guide has primarily focused on frequentist topics in
statistics, such as hypothesis testing and the frequentist approach to
confidence intervals. There is an entire world of Bayesian statistical
inference, which differs significantly from the frequentist approach in both
philosophy and technique. I will only touch on the most basic application of
Bayesian reasoning in this guide.</p>
<p>In this section, I will mostly defer to outside sources, who I think speak more
eloquently on the topic than I can. Some companies (such as Google, or so I’m told) tend
to focus on advanced Bayesian skills in their data science interviews; if you want to
really learn the Bayesian approach, I’d reccomend <a href="https://www.goodreads.com/book/show/619590.Bayesian_Data_Analysis">Gelman’s book</a>, which is a
classic in the field.</p>
<h2 id="bayesian-vs-frequentist-statistics">Bayesian vs Frequentist Statistics</h2>
<p>It’s worth being able to clearly discuss the difference in philosophy and approach
between the two schools of statistics. I particularly like the discussion in the MIT
course notes. They state, more or less, that while the Bayesians like to reason from
Bayes theorem</p>
\[P(H|D) = \frac{ P(D|H)P(H)}{P(D)},\]
<p>the frequentist school thinks that “the probability of the hypothesis” is a nonsense
concept - it is not a well-founded probablistic value, in the sense that there is no
repeatable experiment you can run in which to gather relative frequency counts and
calculate probabilities. Therefore, the frequentists must reason directly from
\(P(D|H)\), the probability of the data given the hypothesis, which is just the
\(p\)-value. The upside of this is that the probabilistic interpretation of \(P(D|H)\)
is clean and unambiguous; the downside is that it is easy to misunderstand, since what
we really think we want is “the probability that the hypothesis is true.”</p>
<p>If you want to know more about this, there are endless discussions of it all over the
internet. Like many such dichotomies (emacs vs. vim, overhand vs underhand toilet paper,
etc.) it is generally overblown - a working statistician should be familiar with, and
comfortable using, both frequentist <em>and</em> Bayesian techniques in their analysis.</p>
<h2 id="basics-of-bayes-theorem">Basics of Bayes Theorem</h2>
<p>Bayes theorem tells us how to update our belief in light of new evidence. You
should be comfortably applying Bayes theorem in order to answer basic
probability questions. The classic example is the “base rate fallacy”:</p>
<p>Consider a routine screening test for a disease. Suppose the frequency of the
disease in the population (base rate) is 0.5%. The test is highly accurate with
a 5% false positive rate and a 10% false negative rate. You take the test and
it comes back positive. What is the probability that you have the disease?</p>
<p>The answer is NOT 0.95, even though the test has a 5% false positive rate. You should be
able to clearly work through this problem, building probability tables and using Bayes
theorem to calculate the final answer. The problem is worked through in the <a href="https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/MIT18_05S14_Reading3.pdf">MIT stats
course readings</a> (see Example 10), so I’ll defer to them for the details.</p>
<h2 id="updating-posteriors--conjugate-priors">Updating Posteriors & Conjugate Priors</h2>
<p>The above approach of calculating out all the probabilites by hand works reasonbly well
when there are only a few possible outcomes in the probability space, but it doesn’t
scale well to large (discrete) probability spaces, and won’t work at all in continuous
probability spaces. In such situations, you’re still fundamentally relying on Bayes
theorem, but the way it is applied looks quite different - you end up using sums and
integrals to calculate the relevant terms.</p>
<p>Again, I’ll defer to the <a href="https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/">MIT stats course readings</a> for the details - readings 12
and 13 are the relevant ones here.</p>
<p>It’s particularly useful to be familiar with the concept of <strong>conjugate
priors</strong>. In general, updating your priors involves computing an integral,
which as anyone who has taken calculus knows can be a pain in the ass. When
sampling from a distribution and estimating the parameters, there are certain
priors for which the updates based on successive samples work out to be very
simple.</p>
<p>For an example of this, suppose you’re flipping a biased coin and trying to
figure out the bias. This is equivalent to sampling a binomial distribution and
trying to estimate the parameter \(p\). If your prior is uniform (flat across
the interval \([0,1]\)), then after \(N\) flips, \(k\) of which come up heads,
your posterior probability density on \(p\) will be</p>
\[f(p) \propto p^{k}((1-p)^{N-k}.\]
<p>This is called a <strong>\(\beta\) distribution</strong>. It is kind of magical that we can
calculate this without having to do any integrals - this is because the
\(\beta\) distribution is “conjugate to” the binomial distribution. It’s
important that we started out with a uniform distribution as our prior - if we
had chosen an arbitrary prior, the algebra might not have worked out as
nicely. In particular, if we start with a non-\(\beta\) prior, then this trick
won’t work, because our prior will not be conjuage to the binomial distribution.</p>
<p>The other important conjugate pair to know is that of the Gaussian
distribution; it is, in fact, conjuage to itself, so if you estimate the
parameters of a normal distribution, those estimates are themselves normal, and
updating your belief about the parameters based on new draws from the normal
distribution is as simple as doing some algebra.</p>
<p>There are many good resources available online and in textbooks discussing
conjuage priors; <a href="https://en.wikipedia.org/wiki/Conjugate_prior">Wikipedia</a> is a good place to start.</p>
<h1 id="maximum-likelihood-estimation">Maximum Likelihood Estimation</h1>
<p>We discussed before the case where you have a bunch of survey data, and want to estimate
the proportion of the population that identifies as female. Statistically speaking,
this proportion is a <em>parameter</em> of the probability distribution over gender identity in
the that geographical region. We’ve intuitively been saying that if we see 250 out of
400 respond that they are female, then our best estimate of the proportion is 5/8. Let’s
get a little more formal about why exactly this is our best estimate.</p>
<p>First of all, I’m going to consider a simplified world in which there are only two
genders, male and female. I do this to simplify the statistics, not because it is an
accurate model of the world. In this world, if the <em>true</em> fraction of the population
that identifies as female is 0.6, then there is some non-zero probability that you would
draw a sample of 400 people in which 250 identify as female. We call this the
<em>likelihood</em> of the parameter 0.6. In particular, the binomial distribution tells us
that</p>
\[\mathcal{L}(0.6|n_\text{female}=250) = {400 \choose 250} \,0.6^{250}\, (1-0.6)^{400-250}\]
<p>Of course, I could calculate this for any parameter in \([0,1]\); if I were very far
from 5/8, however, then this likelihood would be very small.</p>
<p>Now, a natural question to ask is “which parameter \(p\) would give us the highest
likelihood?” That is, which parameter best fits our data? That is the
<strong>maximum-likelihood estimate</strong> of the parameter \(p\). The actual calculation of that
maximum involves some calculus and a neat trick involving logarithms, but I’ll refer the
reader <a href="https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/MIT18_05S14_Reading10b.pdf">elsewhere</a> for those details. It’s worth noting that the MLE is often our
intuitive “best guess” at the parameter; in this case, as you might anticipate,
\(p=5/8\) maximizes the likelihood of seeing 250 people out of 400 identify as female.</p>
<p>I won’t give any question here, because I honestly have not seen any in my searching
around. Even so, I think it’s an important concept to be familiar with. Maximum
likelihood estimation often provides a theoretical foundation for our intuitive
estimates of parameters, and it’s helpful to be able to justify yourself in this
framework.</p>
<p>For example, if you’re looking at samples from an exponential distribution, and you want
to identify the parameter \(\lambda\), you might guess that since the mean of an
exponential random variable is \(\mu= 1/\lambda\), a good guess would be \(\lambda
\approx 1/\overline x\), where \(\overline x\) is your sample mean. In fact you would be
correct, and this is the MLE for \(\lambda\); you should be familiar with this way of
thinking about parameter estimation.</p>
<h1 id="experimental-design">Experimental Design</h1>
<p>Last, but certainly not least, is the large subject of experimental design. This is a
more nebulous topic, and therefore harder to familiarize yourself with quickly, than the
others we’ve discussed so far.</p>
<p>If we have some new feature, we might have reason to think it will be good to include in
our product. For example, Facebook rolled out a “stories” feature some time ago (I
honestly couldn’t tell you what it does, but it’s some thing that sits on the top of
your newsfeed). However, before they expose this to all their users, they want to put it
out there “in the wild” and see how it performs. So, they run an experiment.</p>
<p>Designing this experiment in a valid way is essential to getting meaningful, informative
results. An interview question at Facebook might be: <strong>How will you analyze if launching
stories is a good idea? What data would you look at?</strong> The discussion of this question
could easily fill a full 45-minute interview session, as there are many nuances and
details to examine.</p>
<p>One basic approach would be to randomly show the “stories” feature to some people, and
not to others, and then see how it affects their behavior. This is an A/B test. Some
questions you should be thinking about are:</p>
<ul>
<li><strong>What metrics will we want to track in order ot measure the effect of stories?</strong> For
example, we might measure the time spent on the site, the number of clicks, etc.</li>
<li><strong>How should we randomize the two groups?</strong> Should we randomly choose every time someone
visits the site whether to show them stories or not? Or should we make a choice for
each <em>user</em> and fix that choice? Generally, user-based randomization is preferable,
although sometimes it’s hard to do across devices (think about why this is).</li>
<li><strong>How long should we run the tests? How many people should be in each group?</strong> This
decision is often based on a <em>power calculation</em>, which gives us the probability of
rejecting the null hypothesis, given some alternative hypothesis. I personally am not
a huge fan of these because the alternative hypothesis is usually quite ad-hoc, but it
is the standard, so it’s good to know how to do it. For example, you might demand that
your test be large enough that if including stories increases site visit time by at
least one minute, our A/B test will detect that with 90% probability.</li>
<li><strong>When can we stop the test?</strong> The important thing to note here is that you <strong>cannot</strong>
just stop the test once the results look good - you have to decide beforehand how long
you want it to run.</li>
<li><strong>How will you deal with confounding variables?</strong> What if, due to some techincal
difficulty, you end up mostly showing stories to users at a certain time of day, or in
a certain geographical region? There are a variety of approaches here, and I won’t get
into the details, but it’s essential that you be able to answer this concern clearly
and thoroughly.</li>
</ul>
<p>It’s also worth considering scenarios where you have to analyze data after the fact in
order to perform “experiments”; sometimes you want to know (for example) if the color of
a product has affected how well it sold, and you want to do so using existing sales
data. What limitations might this impose? A key limitation is that of confounding
variables - perhaps the product in red mostly sold in certain geographic regions,
whereas the blue version sold better in other geographic regions. What impact will this
have on your analysis?</p>
<p>There are many other considerations to think about around experimental design. I don’t
have any particular posts that I like; I’d recommend searching around Google to find
more information on the topic.</p>
<p>If you have any friends that do statistics professionally, I’d suggest sketching our a
design for the above experiment and talking through it with them - the ability to think
through an experimental design is something that is best developed over years of
professional experience.</p>
<h1 id="conclusion">Conclusion</h1>
<p>This guide has focused on some of the basic aspects of statistics that get covered in
data science interviews. It is far from exhaustive - different companies focus on
different skills, and will therefore be asking you about different statistical concepts
and techniques. I haven’t discussed time-dependent statistics at all - Markov chains,
time-series analysis, forecasting, and stochastic processes all might be of interest to
employers if they are relevant to the field of work.</p>
<p>Please let me know if you have any corrections to what I’ve said here. I’m far
from a statistician, so I’m sure that I’ve made lots of small (and some large)
mistakes!</p>
<p>Stay tuned for the rest of the study guide, which should be appearing in the
coming months. And finally, best of luck with your job search! It can be a
challenging, and even demoralizing experience; just keep learning, and don’t
let rejection get you down. Happy hunting!</p>
<!-------------------------------- FOOTER ---------------------------->
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:fnote1" role="doc-endnote">
<p>Of course, the actual statement is careful about the mode of
convergence, and the fact that it is actually an appropriately-normalized
version of the distribution that converges, and so on. <a href="#fnref:fnote1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote2" role="doc-endnote">
<p>Again, we’re being loose here - it has to have finite variance, and
the convergence is only in a specific sense. <a href="#fnref:fnote2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote3" role="doc-endnote">
<p>I’m being a little loose with definitions here - the width of a
\(2\sigma\) inverval is actually \(4\sigma\), but I think most would still
describe it using the phrase “two-sigma”. <a href="#fnref:fnote3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote4" role="doc-endnote">
<p>As usual, we’re being a bit sloppy - we’re just using the sample variance in
place of the true variance and pretending this is correct. This will work if the
number of samples \(n\) is large. If you need confidence intervals with few (say,
less than 15) samples, I recommend you look into confidence intervals based on the
student-t distribution. <a href="#fnref:fnote4" class="reversefootnote" role="doc-backlink">↩</a> <a href="#fnref:fnote4:1" class="reversefootnote" role="doc-backlink">↩<sup>2</sup></a></p>
</li>
<li id="fn:fnotez" role="doc-endnote">
<p>In doing bootstrapping, we’re really trying to find the distribution of our
statistic \(\hat S\). So, what we find via this method are bounds \((l,u)\) such that
\(P(l\leq \hat S \leq u)\geq C\). How does this relate to the definition of a
confidence interval? This is a somewhat theoretic exercise, but can be helpful in
clarifying your understanding of the more technical aspects of confidence interval
computation. <a href="#fnref:fnotez" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnoted" role="doc-endnote">
<p>Why are you using MATLAB? Stop that. You’re not in school anymore. <a href="#fnref:fnoted" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnotec" role="doc-endnote">
<p>Some of the issues that arise here (for example, over- and
under-fitting) have solutions that are more practical and less theoretical and
statistical in nature - these will be covered in more depth in the machine
learning portion of this guide, and so we don’t go into too much detail in this
section. <a href="#fnref:fnotec" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnoteb" role="doc-endnote">
<p>\(\beta_0\) just represents the difference in the mean of the two
variables, so it could be non-zero even if the two are independent. <a href="#fnref:fnoteb" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Peter Willspeter@pwills.comPart I of my guide to data science interviews, focusing on statistics and experimental design.New Paper: Metrics For Graph Comparison2019-07-05T00:00:00+00:002019-07-05T00:00:00+00:00http://www.pwills.com/posts/2019/07/05/metrics-paper<p>I just put a <a href="https://www.biorxiv.org/content/10.1101/611509v1">new paper up on the arXiv</a>, and so I thought I would share it
here. This was the final paper I wrote for my Ph.D., and it’s the one I’m most proud
of. The paper is called “Metrics for Graph Comparison: a Practitioner’s Guide.”</p>
<h1 id="the-basic-idea">The Basic Idea</h1>
<p>Suppose you have two graphs, or even just a single graph that is changing in time. For
example, you might have a social network between students at a school that evolves as
time passes. Below, we see the social network for a particular French elementary school,
which is evolving as the day passes. Each vertex is a person, and each edge indicates
face-to-face contact.</p>
<p><img src="/assets/images/research/class_graphs.png" alt="Primary School Graphs" /></p>
<p>One important question that we must answer is “how much did the graph change between
times \(t\) and \(t+1\)?” Said another way, how similar are graphs \(G_t\) and
\(G_{t+1}\)? The central subjects of this paper are the many methods available for
comparing graphs.</p>
<p>We study these methods both by looking at empirical examples like the one above, as well
as by doing a large study of the statistics of comparing various random graph
models. Which graph comparison tool can best distinguish an Erdos-Renyi random graph
from a stochastic blockmodel? What about comparing a random graph with fixed degree
distribution to a preferential attachment graph? Using Monte Carlo simulation of the
graphs, we are able to answer these questions and gain insight into the behavior of our
distances when they are used on a variety of different structures and geometries.</p>
<p>One important focus of the paper is on practicality, and so we only look at distances
that are linear or near-linear (i.e. \(O(n)\) or \(O(n \log n)\)) in the number of
vertices in the graph.<sup id="fnref:fnote1" role="doc-noteref"><a href="#fn:fnote1" class="footnote">1</a></sup> More computationally expensive distances may be of
theoretical interest, but for the graphs used in business, which often range upwards of
1 million vertices, they are not feasible to use.</p>
<h1 id="findings">Findings</h1>
<p>There is a lot of nuance in the interpretation of these comparisons - it’s not as
simplea as “method X is the best”. The results depend strongly on the geometric
structural differences you with to learn about the graph. Do you care about total
connectivity? Then just use a simple edit distance. If you care about the community
structure of a graph, then you should probably use a spectral distance.</p>
<p>That said, we find that spectral methods (which are quite standard, and have been around
for some time) are strong performers all around. They are robust, flexible, and have the
added benefit of easy implementation - fast spectral algorithms are ubiquitous in modern
computing packages such a MATLAB, SciPy, and Julia.</p>
<p>For example, here is a plot showing how well the different distances are able to discern
an Erdos-Renyi random graph from a stochastic blockmodel.</p>
<p><img src="/assets/images/metric_comparison_plot.png" alt="ER_SBM_Comparison" /></p>
<p>Higher numbers mean that the distances can more reliably discern between the two
populations. We see that the adjacency spectral distance \(\lambda^A\) and the
normalized Laplacian spectral distance \(\lambda^{\mathcal L}\) are most reliably able
to pick out the community structure that differentiates between these two models. This
is not surprising, as the spectra of the graph has a direct interpretation in terms of
vibrational modes, which depend critically upon community structure.</p>
<p>If you want to know more, check out <a href="https://www.biorxiv.org/content/10.1101/611509v1">the full paper</a>. The above result is just one of
a large collection of findings that we lay out. As I said before, the idea isn’t to come
to a single conclusion; it is to survey the landscape and to compare and contrast these
different tools.</p>
<h1 id="conclusion">Conclusion</h1>
<p>In research, so many people spend so much time developing new methods, and I always
think to myself, “How does this compare to the standard method? Is it actually an
improvement?” This paper attempts to take stock of a number of standard and cutting-edge
methods in graph comparison, and see what works best. After spending some time doing a
theoretical analysis of a particular graph distance metric (see <a href="https://arxiv.org/abs/1707.07362">my previous paper</a>)
I was curious to see how all the tools available compared to one another.</p>
<p>Also, I’ve implemented many of these distances in my Python library <a href="https://www.github.com/peterewills/netcomp">NetComp</a>, which
you can get via <code class="language-plaintext highlighter-rouge">pip install netcomp</code>. Check it out, and feel free to post issues and/or
PRs if you want to add to/modify the library.</p>
<p>Let me know in the comments what you think! Or feel free to email me if you
have more detailed questions about graph metrics. Happy Friday!</p>
<!-------------------------------- FOOTER ---------------------------->
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:fnote1" role="doc-endnote">
<p>This is paired with the assumption that the graph is sparse, so the
number of edges is \(O(n \log n)\) <a href="#fnref:fnote1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Peter Willspeter@pwills.comA brief discussion of my latest paper, which benchmarks various metrics used to compare complex networks, also known as graphs.Types as Propositions2018-11-30T00:00:00+00:002018-11-30T00:00:00+00:00http://www.pwills.com/posts/2018/11/30/types<p>Some of the most meaningful mathematical realizations that I’ve had have been
unexpected connections between two topics; that is, realizing that two concepts
that first appeared quite distinct are in fact one and the same. In our first
linear algebra courses, we learn that manipulations of matrices is, in fact,
equivalent to solving systems of equations. In quantum mechanics, we see that
<a href="https://en.wikipedia.org/wiki/Observable">physically observable quantities</a> are, mathematically speaking, linear
operators (I still don’t quite grok this one). And, my personal favorite
example, we learn in functional analysis that the linear functionals in the dual
space of a Hilbert space are themselves in perfect correspondence with the
functions in the original space.<sup id="fnref:fnote1" role="doc-noteref"><a href="#fn:fnote1" class="footnote">1</a></sup></p>
<p>Recently, I’ve stumbled upon another such result, which has captured my
attention for a while. The result, often referred to as Curry-Howard
correspondence, is the statement that propositions in a formal logical system
are equivalent to types in the simply typed lambda calculus. Loosely, this means
that <strong>logical statements are equivalent to data types</strong>!</p>
<p>Let’s unpack that a bit; “propositions” are just statements in a logical
system.<sup id="fnref:fnote15" role="doc-noteref"><a href="#fn:fnote15" class="footnote">2</a></sup> In mathematics, for example, one might put forward the
proposition “no even numbers are prime,” or “14 is greater than 18”. Note that
propositions need not be <em>true</em>; in fact, some logical systems support
propositions that cannot even be determined to be true or false.<sup id="fnref:fnote2" role="doc-noteref"><a href="#fn:fnote2" class="footnote">3</a></sup>
“Types” can be though of as types in a computing language; <code class="language-plaintext highlighter-rouge">Integer</code>, <code class="language-plaintext highlighter-rouge">Boolean</code>,
and so on. We will have much more to say about types as we move forward, but for
now, hold in your mind the conventional notion of types as defined in a language
such as Java or Python (or better yet, Haskell).</p>
<p>How on earth could these two be in correspondence? On the surface, they appear
entirely separate concepts. In this post, I’ll spend some time unpacking what
this equivalence is actually saying, using a simple example. I am far from a
full understanding of it, but as usual, I write about it in the hopes that I’ll
be forced to clarify what I <em>do</em> understand, or even better, be corrected by
someone more knowledgable than myself.</p>
<p>Speaking of those more knowledgable than myself, there are various resources
online that I found very helpful in understanding the correspondence:
<a href="https://www.youtube.com/watch?v=IOiZatlZtGU&t=1176s">Philip Wadler’s talk</a> on the subject is a great starting point, and there
are a number of <a href="http://lambda-the-ultimate.org/node/1532">useful</a> <a href="https://stackoverflow.com/questions/2969140/what-are-the-most-interesting-equivalences-arising-from-the-curry-howard-isomorp">discussions</a> <a href="https://stackoverflow.com/questions/2829347/a-question-about-logic-and-the-curry-howard-correspondence">available</a> on StackExchange and
various functional programming forums.</p>
<h2 id="an-example">An Example</h2>
<p>I was confused by the idea of propositions as types when I first encountered it,
and after learning more, I believe that the root of my confusion lies in the
fact that types such as <code class="language-plaintext highlighter-rouge">Integer</code>, <code class="language-plaintext highlighter-rouge">Boolean</code>, and <code class="language-plaintext highlighter-rouge">String</code>, which we are
familiar with from programming, correspond to very trivial propositions, making
them poor examples. We’ll have to introduce something a bit fancier; a
<em>conditional type</em>. For example, <code class="language-plaintext highlighter-rouge">OddInt</code> might be odd Integers, and <code class="language-plaintext highlighter-rouge">PrimeInt</code>
might be prime integers. We’ll approximate these conditional types with custom
classes in Scala. Classes and types are <a href="https://stackoverflow.com/questions/5031640/what-is-the-difference-between-a-class-and-a-type-in-scala-and-java">different beasts</a>, of course, but
we will ignore that distinction in this post.<sup id="fnref:fnote3" role="doc-noteref"><a href="#fn:fnote3" class="footnote">4</a></sup></p>
<p>Let’s consider one conditional type in particular: <code class="language-plaintext highlighter-rouge">BigInteger</code>. This type
(actually a class in this example) is defined as follows:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="k">class</span> <span class="nc">BigInteger</span> <span class="o">(</span><span class="k">val</span> <span class="nv">value</span><span class="k">:</span> <span class="kt">Int</span><span class="o">)</span> <span class="o">{</span>
<span class="k">private</span> <span class="k">final</span> <span class="k">val</span> <span class="nv">LOWER_BOUND</span> <span class="k">=</span> <span class="mi">10000</span>
<span class="nf">if</span> <span class="o">(</span><span class="n">value</span> <span class="o"><</span> <span class="nc">LOWER_BOUND</span><span class="o">)</span> <span class="o">{</span>
<span class="k">throw</span> <span class="k">new</span> <span class="nc">IllegalArgumentException</span><span class="o">(</span><span class="s">"Too small!"</span><span class="o">)</span>
<span class="o">}</span>
<span class="k">override</span> <span class="k">def</span> <span class="nf">toString</span> <span class="k">=</span> <span class="n">s</span><span class="s">"BigInteger($value)"</span>
<span class="o">}</span></code></pre></figure>
<p>One could then instantiate a <code class="language-plaintext highlighter-rouge">BigInteger</code> as follows:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="k">val</span> <span class="nv">big</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">BigInteger</span><span class="o">(</span><span class="mi">10001</span><span class="o">)</span>
<span class="c1">// res0: BigInteger(10001)</span>
<span class="k">val</span> <span class="nv">small</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">BigInteger</span><span class="o">(</span><span class="mi">500</span><span class="o">)</span>
<span class="c1">// java.lang.IllegalArgumentException: Too small!</span></code></pre></figure>
<p>Now the fundemanetal question: what proposition corresponds to this type? In
simple scenarios like this, the corresponding proposition is that the type can
be <em>inhabited</em>; that is, there exists a value that satisfies that type. For
example, the type <code class="language-plaintext highlighter-rouge">BigInteger</code> corresponds to the claim “there exists an integer
\(i\) for which \( i > 10,000 \)”. Obviously, such an integer exists, and the
fact that we can instantiate this type indicates that it corresponds to a true
proposition. Alternatively, consider a type <code class="language-plaintext highlighter-rouge">WeirdInteger</code>, which is an integer
satisfying <code class="language-plaintext highlighter-rouge">i < 3 && i > 5</code>. We can define the type well enough, but there are
no values which satisfy it; it is an uninhabitable type, and so corresponds to a
false proposition.</p>
<h2 id="functions-and-implication">Functions and Implication</h2>
<p>Let’s make things a little more interesting. In programming languages, there are
not only primitive types like <code class="language-plaintext highlighter-rouge">Integer</code> and <code class="language-plaintext highlighter-rouge">Boolean</code>, but there are also
<strong>function types</strong>, which are the types of functions. For example, in Scala, the
function <code class="language-plaintext highlighter-rouge">def f(x: Int) = x.toString</code> has type <code class="language-plaintext highlighter-rouge">Int => String</code>, which is to say
it is a function that maps integers to strings.</p>
<p>What sort of propositions would <em>functions</em> correspond to? It turns out that
functions naturally map to <em>implication</em>. In some ways, the correspondence here
is very natural. Consider the conditional type <code class="language-plaintext highlighter-rouge">BigInteger</code>, and the conditional
type <code class="language-plaintext highlighter-rouge">BiggerInteger</code>. The definition of the latter should look familiar, from
above:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="k">class</span> <span class="nc">BiggerInteger</span> <span class="o">(</span><span class="k">val</span> <span class="nv">value</span><span class="k">:</span> <span class="kt">Int</span><span class="o">)</span> <span class="o">{</span>
<span class="k">private</span> <span class="k">final</span> <span class="k">val</span> <span class="nv">LOWER_BOUND</span> <span class="k">=</span> <span class="mi">20000</span>
<span class="nf">if</span> <span class="o">(</span><span class="n">value</span> <span class="o"><</span> <span class="nc">LOWER_BOUND</span><span class="o">)</span> <span class="o">{</span>
<span class="k">throw</span> <span class="k">new</span> <span class="nc">IllegalArgumentException</span><span class="o">(</span><span class="s">"Too small!"</span><span class="o">)</span>
<span class="o">}</span>
<span class="k">override</span> <span class="k">def</span> <span class="nf">toString</span> <span class="k">=</span> <span class="n">s</span><span class="s">"BiggerInteger($value)"</span>
<span class="o">}</span></code></pre></figure>
<p>Now, we can write a function that maps <code class="language-plaintext highlighter-rouge">BigInteger</code> to <code class="language-plaintext highlighter-rouge">BiggerInteger</code>:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="k">def</span> <span class="nf">makeBigger</span><span class="o">(</span><span class="n">b</span><span class="k">:</span> <span class="kt">BigInteger</span><span class="o">)</span><span class="k">:</span> <span class="kt">BiggerInteger</span> <span class="o">=</span>
<span class="k">new</span> <span class="nc">BiggerInteger</span><span class="o">(</span><span class="nv">b</span><span class="o">.</span><span class="py">value</span> <span class="o">*</span> <span class="mi">2</span><span class="o">)</span></code></pre></figure>
<p>Recall that the proposition corresponding to the type <code class="language-plaintext highlighter-rouge">BigInteger</code> is the
statement “there exists an integer greater than 10,000”, and the proposition
corresponding to <code class="language-plaintext highlighter-rouge">Bigger</code> is the statement “there exists an integer greater than
20,000”; the proposition corresponding to the function type <code class="language-plaintext highlighter-rouge">BigInteger =>
BiggerInteger</code> is then just the statement “the existence of an integer above
10,000 implies the existence of an integer above 20,000”. And note that, as it
should be for an implication, we do not care whether there actually <em>does</em> exist
an integer above 10,000; we simply know that <em>if</em> one exists, then its existence
implies the existence of an integer above 20,000.</p>
<p>To be a bit more explicit, the function that we wrote above can be thought of as
a <strong>proof</strong> of the implication; in particular, if we suppose that there exists
an \(i\) such that \(i > 10,000\), then clearly \(2i > 20,000\), and so
if we let \(j=2i\), then we have proven the existence of a \(j\) such that
\(j > 20,000\). This is what the theoretical computer scientists mean when
they say that “programs are proofs”.</p>
<p>Of course, Scala is not a proof-checking language, and cannot tell during
compilation that the function <code class="language-plaintext highlighter-rouge">makeBigger</code> is valid; we would need a much richer
type system to be able to validate such functions. Consider that the following
function compiles with no problem, although there are no input values for which
it will not throw a (runtime) exception:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="k">def</span> <span class="nf">wonky</span><span class="o">(</span><span class="n">b</span><span class="k">:</span> <span class="kt">BigInteger</span><span class="o">)</span><span class="k">:</span> <span class="kt">BiggerInteger</span> <span class="o">=</span>
<span class="k">new</span> <span class="nc">BiggerInteger</span><span class="o">(</span><span class="nv">b</span><span class="o">.</span><span class="py">value</span> <span class="o">%</span> <span class="mi">1000</span><span class="o">)</span></code></pre></figure>
<h3 id="wait-what">Wait… what?</h3>
<p>If you think about it a bit more, it’s sort of a weird example; you
could map <em>any</em> type to <code class="language-plaintext highlighter-rouge">BiggerInteger</code>, just by doing <code class="language-plaintext highlighter-rouge">def f[A](a:A):
BiggerInteger = new BiggerInteger(20001)</code>. This is because the proposition that
corresponds to <code class="language-plaintext highlighter-rouge">BiggerInteger</code> is true (the type is inhabitable), and if B is
true, then A implies B for any A at all.</p>
<p>Common languages such as Haskell only express very trivial propositions with
their types; there does exist one uninhabitable type (<code class="language-plaintext highlighter-rouge">void</code>), but I have not
found much use for it in practice. The benefit of using conditional types for
these examples is that we can explore at least some types which have
corresponding <em>false</em> propositions, such as <code class="language-plaintext highlighter-rouge">WeirdInteger</code>, which are integers
<code class="language-plaintext highlighter-rouge">i</code> which satisfy <code class="language-plaintext highlighter-rouge">i < 3 && i > 5</code>.</p>
<h2 id="in-conclusion">In Conclusion</h2>
<p>Seeing all this, you can begin to get a sense of how computer-assisted proof
techniques might arise out of it. If the fact that a program compiles is
equivalent to the truth the corrsponding proposition, then all we need is a
language with a rich enough type system to express interesting
statements. Examples of languages used in this way include <a href="https://coq.inria.fr/">Coq</a> and
<a href="https://en.wikipedia.org/wiki/Agda_(programming_language">Agda</a>. A thorough discussion of such languages is beyond both the scope of
this post and my understanding.</p>
<p>I think what keeps me interested in this subject is that it still remains quite
opaque to me; I’ve struggled to even come up with these simple (and flawed)
examples of how Curry-Howard correspondence plays out in practice. I hope that
anyone reading this who understand the subject better than I do will leave a
detailed list of my misunderstandings, so that I can better grasp this
mysterious and fascinating topic.</p>
<!-------------------------------- FOOTER ---------------------------->
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:fnote1" role="doc-endnote">
<p>This statement is difficult to understand without background in
functional analysis, but it is in fact one of the most beautiful examples of
such an equivalence result. <a href="#fnref:fnote1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote15" role="doc-endnote">
<p>I’m being a bit sloppy here. The type of logic we’re talking about
here is not classical logic, but rather in the sense of <a href="https://en.wikipedia.org/wiki/Natural_deduction">natural deduction</a>. <a href="#fnref:fnote15" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote2" role="doc-endnote">
<p>Such systems are called undecidable; see
<a href="https://en.wikipedia.org/wiki/Decidability_(logic)">the wiki entry on decidability</a> for more information. <a href="#fnref:fnote2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote3" role="doc-endnote">
<p>We won’t be careful about whether the idea of conditional types
presented here corresponds well with conditional types as they are actually
implemented in programming languages such as <a href="https://github.com/Microsoft/TypeScript/pull/21316">Typescript</a>. <a href="#fnref:fnote3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Peter Willspeter@pwills.comWhat is the connection between data types and logical propositions? Surprisingly, it runs quite deep. This post explores and illuminates that link.Inverse Transform Sampling in Python2018-06-24T00:00:00+00:002018-06-24T00:00:00+00:00http://www.pwills.com/posts/2018/06/24/sampling<p>When doing data work, we often need to sample random variables. This is easy to
do if one wishes to sample from a Gaussian, or a uniform random variable, or a
variety of other common distributions, but what if we want to sample from an
arbitrary distribution? There is no obvious way to do this within
<code class="language-plaintext highlighter-rouge">scipy.stats</code>. So, I build a small library, <a href="https://www.github.com/peterewills/itsample"><code class="language-plaintext highlighter-rouge">inverse-transform-sample</code></a>,
that allows for sampling from arbitrary user provided distributions. In use, it
looks like this:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="n">pdf</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">np</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="o">-</span><span class="n">x</span><span class="o">**</span><span class="mi">2</span><span class="o">/</span><span class="mi">2</span><span class="p">)</span> <span class="c1"># unit Gaussian, not normalized
</span><span class="kn">from</span> <span class="nn">itsample</span> <span class="kn">import</span> <span class="n">sample</span>
<span class="n">samples</span> <span class="o">=</span> <span class="n">sample</span><span class="p">(</span><span class="n">pdf</span><span class="p">,</span><span class="mi">1000</span><span class="p">)</span> <span class="c1"># generate 1000 samples from pdf </span></code></pre></figure>
<p>The code is available <a href="https://www.github.com/peterewills/itsample">on GitHub</a>. In this post, I’ll outline the theory of
<a href="https://en.wikipedia.org/wiki/Inverse_transform_sampling">inverse transform sampling</a>, discuss computational details, and outline some
of the challenges faced in implementation.</p>
<h2 id="introduction-to-inverse-transform-sampling">Introduction to Inverse Transform Sampling</h2>
<p>Suppose we have a probability density function \(p(x)\), which has an
associated cumulative density function (CDF) \(F(x)\), defined as usual by</p>
\[F(x) = \int_{-\infty}^x p(s)ds.\]
<p>Recall that the cumulative density function \(F(x)\) tells us <em>the probability
that a random sample from \(p\) is less than or equal to x</em>.</p>
<p>Let’s take a second to notice something here. If we knew, for some x, that
\(F(x)=t\), then drawing \(x\) from \(p\) is in some way <strong>equivalent to
drawing \(t\) from a uniform random variable on \([0,1]\)</strong>, since the CDF for
a uniform random variable is \(F_u(t) = t\).<sup id="fnref:fnote1" role="doc-noteref"><a href="#fn:fnote1" class="footnote">1</a></sup></p>
<p>That realization is the basis for inverse transform sampling. The procedure is:</p>
<ol>
<li>Draw a sample \(t\) uniformly from the inverval \([0,1]\).</li>
<li>Solve the equation \(F(x)=t\) for \(x\) (invert the CDF).</li>
<li>Return the resulting \(x\) as the sample from \(p\).</li>
</ol>
<h2 id="computational-considerations">Computational Considerations</h2>
<p>Most of the computational work done in the above algorithm comes in at step 2,
in which the CDF is inverted.<sup id="fnref:fnote2" role="doc-noteref"><a href="#fn:fnote2" class="footnote">2</a></sup> Consider Newton’s method, a typical
routine for finding numerical solutions to equations: the approach is iterative,
and so the function to be inverted, in our case the CDF \(F(x)\), is evaluated
many times. Now, in our case, since \(F\) is a (numerically computed) integral
of \(p\), this means that we will have to run our numerical quadrature routine
once for each evaluation of \(F\). Since we need <em>many</em> evaluations of \(F\)
for a single sample, this can lead to a significant slowdown in sampling.</p>
<p>Again, the pain point here is that our CDF \(F(x)\) is slow to evaluate,
because each evaluation requires numerical quadrature. What we need is an
approximation of the CDF that is fast to evaluate, as well as accurate.</p>
<h3 id="chebyshev-approximation-of-the-cdf">Chebyshev Approximation of the CDF</h3>
<p>I snooped around on the internet a bit, and found <a href="https://github.com/scipy/scipy/issues/3747">this feature request</a> for
scipy, which is related to this same issue. Although it never got off the
ground, I found an interesting link to <a href="https://arxiv.org/pdf/1307.1223.pdf">a 2013 paper by Olver & Townsend</a>, in
which they suggest using Chebyshev polynomials to approximate the PDF. The
advantage of this approach is that the integral of a series of Chebyshev
polynomials is known analytically - that is, if we know the Chebyshev expansion
of the PDF, we automatically know the Chebyshev expansion of the CDF as
well. This should allow us to rapidly invert the (Chebyshev approximation of
the) CDF, and thus sample from the distribution efficiently.</p>
<h3 id="other-approaches">Other Approaches</h3>
<p>There are also less mathematically sophisticated approaches that immediately
present themselves. One might consider solving \(F(x)=t\) on a grid of \(t\)
values, and then building the function \(F^{-1}(x)\) by interpolation. One
could even simply transform the provided PDF into a histogram, and then use the
functionality built in to <code class="language-plaintext highlighter-rouge">scipy.stats</code> for sampling from a provided histogram
(more on that later). However, due to time constraints,
<code class="language-plaintext highlighter-rouge">inverse-transform-sample</code> only includes the numerical quadrature and Chebyshev
approaches.</p>
<h2 id="implementation-in-python">Implementation in Python</h2>
<p>The implementation of this approach is not horribly sophisticated, but in
exchange it exhibits that wonderful readability characteristic of Python
code. The complexity is the highest in the methods implementing the
Chebyshev-based approach; those without a background in numerical analysis may
wonder, for example, why the function is evaluted on <a href="https://en.wikipedia.org/wiki/Chebyshev_nodes">that particularly strange
set of nodes</a>.</p>
<p>In the quadrature-based approach, both the numerical quadrature and root-finding
are both done via <code class="language-plaintext highlighter-rouge">scipy</code> library (<code class="language-plaintext highlighter-rouge">scipy.integrate.quad</code> and
<code class="language-plaintext highlighter-rouge">scipy.optimize.root</code>, respectively). When using this approach, one can set the
boundaries of the PDF to be infinite, as <code class="language-plaintext highlighter-rouge">scipy.integrate.quad</code> supports
improper integrals. In the <a href="https://github.com/peterewills/itsample/blob/master/example.ipynb">notebook of examples</a>, we show that the samples
generated by this approach do, at least in the eyeball norm, conform to the
provided PDF. As we expected, this approach is slow - it takes about 7 seconds to generate
5,000 samples from a unit normal.</p>
<p>As with the quadrature and root-finding, pre-rolled functional from <code class="language-plaintext highlighter-rouge">scipy</code> was
used to both compute and evaluate the Chebyshev approximants. When approximating
a PDF using Chebyshev polynomials, finite bounds must be provided. A
user-determined tolerance determines the order of the Chebyshev approximation;
however, rather than computing a true error, we simply use the size of the last
few coefficients of the Chebyshev coefficients as an approximation. Since this
approach differs from the previousl only in the way that the CDF is constructed,
we use the same function <code class="language-plaintext highlighter-rouge">sample</code> for both approaches; an option
<code class="language-plaintext highlighter-rouge">chebyshev=True</code> will generate a Chebyshev approximant of the CDF, rather than
using numerical quadrature.</p>
<p>I hoped that the Chebyshev approach would improve on this by an order of
magnitude or two; however, my hopes were thwarted. The implementation of the
Chebyshev approach is faster by perhaps a factor of 2 or 3, but does not offer
the kind of improvement I had hoped for. What happened? In testing, a single
evaluation of the Chebyshev CDF was not much faster than a single evaluation of
the quadrature CDF. The advantage of the Chebyshev CDF comes when one wishes to
evaluate a long, vectorized set of inputs; in this case, the Chebyshev CDF is
orders of magnitude faster than quadrature. But <code class="language-plaintext highlighter-rouge">scipy.optimize.root</code> does not
appear to take advantage of vectorization, which makes sense - in simple
iteration schemes, the value at which the next iteration occurs depends on the
outcome of the current iteration, so there is not a simple way to vectorize the
algorithm.</p>
<h2 id="conclusion">Conclusion</h2>
<p>I suspect that the reason this feature is absent from large-scale library like
<code class="language-plaintext highlighter-rouge">scipy</code> and <code class="language-plaintext highlighter-rouge">numpy</code> is that it is difficult to build a sampler that is both fast
and accurate over a large enough class of PDFs. My approach sacrifices speed;
other approximation schemes may be very fast, but may not provide the accuracy
guarantees needed by some users.</p>
<p>What we’re left with is a library that is useful for generating small numbers
(less than 100,000) of samples. It’s worth noting that in the work of Olver &
Townsend, they seem to be able to use the Chebyshev approach to sample orders of
magnitude faster than my impelmentation, but sadly their Matlab code is nowhere
to be found in the Matlab library <a href="http://www.chebfun.org/"><code class="language-plaintext highlighter-rouge">chebfun</code></a>, which is the location
advertised in their work. Presumably they implemented their own root-finder, or
Chebyshev approximation scheme, or both. There’s a lot of space for improvement
here, but I simply ran out of time and energy on this one; if you feel inspired,
<a href="https://github.com/peterewills/itsample#contributing">fork the repo</a> and submit a pull request!</p>
<!-------------------------------- FOOTER ---------------------------->
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:fnote1" role="doc-endnote">
<p>This is only true for \(t\in [0,1]\). For \(t<0\),
\(F_u(t)=0\), and for \(t>1\), \(F_u(t)=1\). <a href="#fnref:fnote1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote2" role="doc-endnote">
<p>The inverse of the CDF is often called the percentile point function,
or PPF. <a href="#fnref:fnote2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Peter Willspeter@pwills.comExplanation of, and code for, a small Python tool for sampling from arbitrary distributions.Algorithmic Musical Genre Classification2018-06-06T00:00:00+00:002018-06-06T00:00:00+00:00http://www.pwills.com/posts/2018/06/06/genre<p>If you are not automatically redirected, please <a href="/portfolio/genre_cls">click here</a></p>
<meta http-equiv="refresh" content="0;url=/portfolio/genre_cls" />Peter Willspeter@pwills.comA summary of a project of mine in which I build an algorithmic classifier that identifies the genre of a piece of music based directly on the waveform.The Meaning of Entropy2018-02-06T00:00:00+00:002018-02-06T00:00:00+00:00http://www.pwills.com/posts/2018/02/06/entropy<p><strong>Entropy</strong> is a word that we see a lot in various forms. It’s classical use
comes from thermodynamics: e.g. “the entropy in the universe is always
increasing.” With the recent boom in statistics and machine learning, the word
has also seen a surge in use in information-theoretic contexts: e.g. “minimize
the cross-entropy of the validation set.”</p>
<p>It’s been an ongoing investigation for me, trying to figure out just what the
hell this information-theoretic entropy is all about, and how it connects to
the notion I’m familiar with from statistical mechanics. Reading through the
wonderful book <a href="https://www.amazon.com/Data-Analysis-Bayesian-Devinderjit-Sivia/dp/0198568320">Data Analysis: a Bayesian Tutorial</a> by D. S. Sivia, I
found the first connection between these two notions that really clicked for
me. I’m going to run through the basic argument here, in the hope that
reframing it in my own words will help me understand it more thoroughly.</p>
<h2 id="entropy-in-thermodynamics">Entropy in Thermodynamics</h2>
<p>Let’s start with the more intuitive notion, which is that of thermodynamic
entropy. This notion, when poorly explained, can seem opaque or quixotic;
however, when viewed through the right lens, it is straightforward, and the law
of increasing entropy becomes a highly intuitive result.</p>
<h3 id="counting-microstates">Counting Microstates</h3>
<p>Imagine, if you will, the bedroom of a teenager. We want to talk about the
entropy of two different states: the state of being “messy” and the state of
being “clean.” We will call these <strong>macrostates</strong>; they describe the macroscopic
(large-scale) view of the room. However, there are also many different
microstates. One can resolve these on a variety of scales, but let’s just say
they correspond to the location/position of each individual object in the
room. To review:</p>
<table>
<thead>
<tr>
<th>Type</th>
<th>Definition</th>
<th>Example</th>
</tr>
</thead>
<tbody>
<tr>
<td>Macrostate</td>
<td>Overall Description</td>
<td>“Messy”</td>
</tr>
<tr>
<td>Microstate</td>
<td>Fine-Scale Description</td>
<td>“Underwear on lamp, shoes in bed, etc.”</td>
</tr>
</tbody>
</table>
<h3 id="the-boltzmann-entropy">The Boltzmann Entropy</h3>
<p>One might notice an interesting fact: that there are many more possible
microstates that correspond to “messy” than there are microstates that
correspond to “clean.” <strong>This is exactly what we mean when we say that a messy
room has higher entropy.</strong> In particular, the entropy of a macrostate is <strong>the
log of the number of microstates that correspond to that macrostate.</strong> We call
this the Boltzmann entropy, and denote it by \(S_B\). If there are
\(\Omega\) possible microstates that correspond to the macrostate of being
“messy,” then we define the entropy of this state as<sup id="fnref:fnote2" role="doc-noteref"><a href="#fn:fnote2" class="footnote">1</a></sup></p>
\[S_B(\text{messy}) = \log(\Omega).\]
<p>This is essentiall all we need to know here.<sup id="fnref:fnote1" role="doc-noteref"><a href="#fn:fnote1" class="footnote">2</a></sup> The entropy tells us how many
different ways there are to get a certian state. A pyramid of oranges in a
supermarket has lower entropy than the oranges fallen all over the floor,
because there are many configurations of oranges that we would call “oranges all
over the floor,” but very few that we would call “a nicely organized pyramid of
oranges.”</p>
<p>In this context, the law of increasing entropy becomes almost tautological. If
things are moving around in our bedroom at random, and we call <em>most</em> of those
configurations “messy,” then the room will tend towards messyness rather than
cleanliness. We sometimes use the terms “order” and “disorder” to refer to
states of relatively low and high entropy, respectively.</p>
<h2 id="entropy-in-information-theory">Entropy in Information Theory</h2>
<p>One also frequently encounters a notion of entropy in statistics and information
theory. This is called the <em>Shannon entropy</em>, and the motivation for this post
is my persistent puzzlement over the connection between Boltzmann’s notion of
entropy and Shannon’s. Previous to reading <a href="https://www.amazon.com/Data-Analysis-Bayesian-Devinderjit-Sivia/dp/0198568320">D. Sivia’s manual</a>, I only knew
the definition of Shannon entropy, but his work presented such a clear
exposition of the connection to Boltzmann’s ideas that I felt compelled to share it.</p>
<h3 id="permutations-and-probabilities">Permutations and Probabilities</h3>
<p>We’ll work with a thought experiment.<sup id="fnref:fnote3" role="doc-noteref"><a href="#fn:fnote3" class="footnote">3</a></sup> Suppose we have \(N\) subjects
we organize into \(M\) groups, with \(N\gg M\). Let \(n_i\) indicate the
number of subjects that are in the \(i^\text{th}\) group, for
\(i=1,\ldots,M\). Of course,</p>
\[\sum_{i=1}^M n_i = N,\]
<p>and if we choose a person at random the probability that they are in group
\(i\) is</p>
\[p_i = \frac{n_i}{N}.\]
<p>The <strong>Shannon entropy</strong> of such a discrete distribution is defined as</p>
\[S = -\sum_{i=1}^M p_i\log(p_i)\]
<p>But why? Why \(p\log(p)\)? Let’s look and see.</p>
<p>A macrostate of this system is defined by the size of the groups \(n_i\);
equivalently, it is defined as the probability distribution. A microstate of
this system is specifying the group of each subject: the specification that
subject number \(j\) is in group \(i\) for each \(j=1,\ldots,N\). How many
microstates correspond to a given macrostate? For the first group, we can fill
it with any of the \(N\) participants, and we must choose \(n_1\) members of
the group, so the number of ways of assigning participants to this group is</p>
\[{N\choose n_1} = \frac{N!}{n_1!(N-n_1)!}\]
<p>For the second group, there are \(N - n_1\) remaining subjects, and we must assign
\(n_2\) of them, and so on. Thus, the total number of ways of arranging the
\(N\) balls into the groups of size \(n_i\) is</p>
\[\Omega = {N\choose n_1}{N-n_1 \choose n_2}\ldots {N-n_1-\ldots-n_{M-1}\choose n_M}.\]
<p>This horrendous list of binomial coefficients can be simplified down to just</p>
\[\Omega = \frac{N!}{n_1!n_2!\ldots n_M!}.\]
<p>The Boltzmann entropy of this macrostate is then</p>
\[S_B = \log(\Omega) = \log(N!) - \sum_{i=1}^M \log(n_i!)\]
<h3 id="from-boltzmann-to-shannon">From Boltzmann to Shannon</h3>
<p><strong>We will now show that the Boltzmann entropy is (approimxately) a scaling of the
Shannon entropy</strong>; in particular, \(S_B \approx N\,S\). Things are going to get
slightly complicated in the algebra, but hang on. If you’d prefer, you can take
my word for it, and skip to the next section.</p>
<p>We will use the Stirling approximation \(\log(n!)\approx n\log(n)\)<sup id="fnref:fnote4" role="doc-noteref"><a href="#fn:fnote4" class="footnote">4</a></sup>
to simplify:</p>
\[S_B \approx N\log(N) - \sum_{i=1}^M n_i\log(n_i)\]
<p>Since the probability \(p_i=n_i/N\), we can re-express \(S_b\) in terms of
\(p_i\) via</p>
\[S_B \approx N\log(N)-N\sum_{i=1}^M p_i\log(Np_i)\]
<p>Since \(\sum_ip_i=1\), we have</p>
\[S_B \approx -N\sum_{i=1}^M p_i\log(p_i) = N \, S.\]
<p>Phew! So, the Boltzmann entropy \(S_b\) of having \(N\) students in \(M\)
groups with sized \(n_i\) is (approximately) \(N\) times the Shannon
entropy.</p>
<h2 id="who-cares">Who Cares?</h2>
<p>Admittedly, this kind of theoretical revalation will probably not change the way
you deploy cross-entropy in your machine learning projects. It is primarily used
because its gradients behave well, which is important in the stochastic
gradient-descent algorithms favored by modern deep-learning
architectures. However, I personally have a strong dislike of using tools that I
don’t have both a theoretical understanding of; hopefully you now have a better
grip on the theoretical underpinnings of cross entropy, and its relationship to
statistical mechanics.</p>
<!-------------------------------- FOOTER ---------------------------->
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:fnote2" role="doc-endnote">
<p>Often a constant will be included in this definition, so that
\(S=k_B \log(\Omega)\). This constant is arbitrary, as it simply rescales
the units of our entropy, and it will only serve to get in the way of our
analysis, so we omit it. <a href="#fnref:fnote2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote1" role="doc-endnote">
<p>All we need to know for the purpose of establishing a connection
between thermodynamic and information-theoretic entropy; of course there is
much more to know, and there are many alternative ways of conceptualizing
entropy. However, none of these have ever been intuitive to me in the way
that Boltzmann’s definition of entropy is. <a href="#fnref:fnote1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote3" role="doc-endnote">
<p>We have slightly rephrased Sivia’s presentation to fit our purposes here. <a href="#fnref:fnote3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fnote4" role="doc-endnote">
<p>The most commonly used form of Stirling’s approximation is the more
precise \(\log(n!)\approx n\log(n)-n\), but we use a coarser form here. <a href="#fnref:fnote4" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Peter Willspeter@pwills.comWe often talk about entropy, but what does it really mean? How do its various uses relate to one another? How does the entropy of a sequence relate to the entropy of a physical state of matter?