pwills.com

Human-in-the-Loop ML

2023-08-08T00:00:00+00:00

It’s interesting to look back and think about what I’ve learned since I started working as a data scientist back in 2016. In school we’re taught about various algorithms and models, their mathematical properties, and some common use-cases. What we’re not taught is the nuances of actually turning these models into viable products.

My experience in industry has shown me that it can be essential to have a human in the loop for many ML products. In this post, I’ll discuss when you do and don’t need to use human-in-the-loop ML, how humans can be incorporated ML systems, and the pros and cons of such an approach.

When should you use a human in the loop?

Augmenting your machine systems with human input can have dramatic strengths, but also significant drawbacks. It can often change the properties of your system (e.g. speed, accuracy, explainability, cost) by orders of magnitude, so it merits careful consideration of the pros and cons.

One dimension to consider is the importance of accuracy in your model. Serving ads and recommending content on social media need not be high-accuracy - if even 80% of what you recommend is relevant, the model can still have a strong positive impact on the product. Contrast this with applications like driverless cars. In these cases, system “misses” carry a very high cost, and so having human intervention as an option (or necessity) is often helpful.

Another important consideration is cost and scale. Involving humans in a process is much more costly than a raw machine learning model. It is also much harder to scale; building out a large network of laborers involves complicated financial, logistical, and legal considerations. This is more of a concern for organizations trying to go from 1 to 100, rather than 0 to 1, which operate on a smaller more manageable scale. This suggests a bootstrapping strategy I will discuss later on.

Of course, adding a human into an automated system will have dramatic impacts on latency. A low-latency machine-learning system can return results in milliseconds; most human-based systems will have SLAs¹ on the order of hours. For applications like online ad serving, which must entirely run within the time a webpage loads, human intervention is therefore a non-starter. However, for applications like content moderation, where accuracy trumps latency concerns, human intervention can be very useful. A hybrid approach can also address this concern; a system where the initial recommendation is provided and acted upon quickly, and human “review” of the action is then provided within the much-slower SLA.

Finally, there is the more fuzzy notion of explainability and perception. If your model is customer-facing and there is a high cost for individual model failures (e.g. medical diagnosis models), then the customer will often demand an explanation for why a certain failure occured. The same holds true for some cybersecurity applications; if an attack passes through, and a customer demands an explanation, human filtering can prevent embarassing situations where the output of the ML system seems obviously wrong to a human, but we cannot explain why the system made the judgement it did. (This can be more than embarassing; a handful of such incidents can be enough to drive away customers and hurt the market’s perception of the product.)²

How do you incorporate humans into your ML system?

There are a few different approaches to incorporating humans into an ML system, which each have their own advantages and disadvantages.

Human confirmation before output

One approach is to have an automated system that makes a suggestion to a human agent, who then confirms or modifies the suggestion before it is sent to the user. This is the approach used by Stitch Fix. Their recommendation engine sends recommendations to the stylists, who have final say in selecting what clothing is sent to the customer.

This approach is the “strongest” in terms of human intervention. Each output will be vetted by a human, and therefore the system will achieve maximum accuracy, but will have much slower output time and higher cost. Such an approach will only be useful for systems where the human-in-the-loop is a characteristic aspect of the product (e.g. Stitch Fix) since the approach leads to a system that behaves quite differently from a fully automated ML system.

Human verification after output

A alternative is to have an ML system that generates outputs that are sent to the user, and then verified by human agents some time later. For example, YouTube might run a system that detects whether a video should be removed from the site.³ It can immediately remove the video based on the system’s output, and then later have the video undergo human review. This review could result in the video being reinstated if it is determined to be appropriate.

This system has the benefit that the automated outputs are available immediately; in the example above, videos deemed problematic by the model are removed within seconds. The downside is that there is a time gap where the user is exposed to the system’s (potentially erroneous) output. In the example above, there may be a few hours time where a video is removed from the platform in error. That said, if your base model is reasonably accurate, then such system misses become more rare and such an approach can be very fruitful.

Most thoughtful system architects will employ partial verification. For example, in the above example, if the model outputs >99.5% probability that a video should be removed, then that video might not undergo human review; if the model outputs between 80% and 99.5% probability, the video would undergo review.

Partial verification is tuneable; one can cost by lowering the upper threshold (more videos skip review) or increase recall by lowering the lower threshold (more borderline videos get reviewed).

Product Bootstrapping

Another partial approach is to use human intervention to bootstrap your organization. A small startup with a good idea for an ML product might not have enough data for a highly accurate model. For such an organization, the benefits of a high-touch approach outweigh the costs. As the organization scales, they can transition away from the human-in-the-loop model and focus on (mostly) independent automated systems.

Of course, this transition is not a trivial matter. It can often be difficult to remove human verification and maintain system accuracy. However, as I said above, scale gives advantages to machine learning products, and these advantages may make it possible for the system to be freestanding (or at least much closer to freestanding) than it was initially.

Oh, the humanity

Up until now, we’ve been discussing the impact of the humans on the system. It would be remiss of me not to discuss the impact of the system on the humans. I don’t have any simple conclusions here, but it’s worth at least raising a few points for consideration.

Working below the API can be challenging. The work is often hourly, with strict SLAs and a focus on productivity metrics. This can generate a high-pressure environment. Depending on the product area, hours can be irregular. The issues with the “gig economy” have been widely discussed in the media, and many of those issues are shared by the kind of systems we’re describing here.

However, there are also many benefits. The work is often remote, and since it is hourly, can be flexible and work with irregular schedules. Many of the stylists at Stitch Fix are mothers that supplement the family income by styling part-time. Before such systems existed, it would be very hard to find work that could be done from home in the three hours between when the baby goes to sleep and when the mother does; ideally, incorporating humans into ML systems can enable such work and be a win-win.

Conclusion

Going from “we could solve this with ML!” to an actually-viable product is often⁴ a bumpy road. Incorporating human feedback into an automated system is a key tool to help ease this transition. I don’t have any easy recommendations here; whether and how you should incorporate human input into your particular product is highly dependent on the product and the market in which it is situated.

But you should consider it. I’ve seen human augmentation assist ML companies at every stage of growth, from pre-seed to post-IPO. It is a tool that, in my opinion, every technology strategist should have in their toolkit.

Service-level agreements, which “defines the level of service expected by a customer from a supplier”; in this case, the “level of service” refers to the latency of a system. ↩
Even if a model is more accurate than the human-only alternative, explainability can still be an important psychological issue for customers. Consider a driverless car that has accident rates 1/10th those of an average driver; however, when it does crash, it does so seemingly at random. Public perception and adoption of such a product would (I predict) be poor, since when we are in such critical situations, we often rely on explanations to feel safe and in-control. Note that this may be less of an issue for internal-use models, where adoption can be decreed by management, and not driven by user perception. ↩
I’m not saying this is what YouTube actually does. This is just an example. ↩
Read: always. ↩

It’s Been a While…

2023-08-07T00:00:00+00:00

Since the last time I wrote on this site, a lot has changed. A global pandemic has come and (mostly) gone, which has given rise to remote work as a viable option for many in the tech sector, but also pushed forward an economic chain of events that made it harder for companies to get easy funding without demonstrating strong fiscal responsibility.¹ The industry looks very different today than it did back in September of 2019, when I wrote my last post.

What have I been up to in that time? I spend the beginning of the pandemic working remotely for Stitch Fix. During this time, my partner and I moved around the country, spending a month or two in various places in California, Washington, and finally near my childhood home in western New York State. Experiencing the support of being near my family and dear friends of decades, as well as the newfound viability of a remote-first career, led us to purchase a home in the Ecovillage at Ithaca. We’ve been living here for over a year now, and it’s been food for my soul to be near my people and live closer to the land. Swimming in the pond, sweating in the sauna in winter, swimming in the gorges in the summer, sharing meals with the community, going on night walks surrounded by fireflies; it really is a magical place.

Around the time I moved to Ithaca, I also left Stitch Fix to start a new job as Senior ML Engineer with Abnormal Security. I enjoyed the high-impact nature of the role, and the dynamic character of the organization, but after a year I was feeling really burned out, and realized I needed some time to myself. Over the last three months I’ve taken time meditate, ride my bicycle, and reflect on where I want to go next in my career.

That reflection has coincided with one of the hottest summers on record. Sea ice in the Southern hemisphere is at a record low, and wildfires raging in Canadian forests have frequently blanketed the Eastern US in dangerous levels of smoke. It is now glaringly apparent that there is nowhere we can go to escape the impact of climate change - it will have effects everywhere in the world, even in places like Ithaca that are better insulated against the (local) effects.

Seeing this, and feeling the direct impact it has on my life, I think I need to try and do what I can to contribute. As the summer winds down, I feel ready to dive back into the professional world. To that end, I’m looking for a Senior ML/DS role in an organization that is working to mitigate climate change.

If you think you have an opening where I could be a good fit, or would just like to connect, don’t hesitate to reach out! My email is peter@pwills.com and I’m happy to put some time on the calendar to chat.

To the next adventure!

The Federal Reserve has increased interest rates in order to combat inflation. From this study on the sources of inflation by the US Bureau of Labor Statistics: “So, from this research, the authors find that three main components explain the rise in inflation since 2020: volatility of energy prices, backlogs of work orders for goods and service caused by supply chain issues due to COVID-19, and price changes in the auto-related industries.” ↩

Blogging in Org Mode

2019-09-24T00:00:00+00:00

I recently transitioned from writing my posts directly in markdown to writing them in org mode, a document authoring system built in GNU Emacs. I learned a lot in the process, and also built a new org exporter in the process, ox-jekyll-lite.¹

Org Mode and the Meaning of Life

What is Org Mode?

Laozi said that the Tao that can be told is not the eternal Tao; I think we can safely say the same of org mode. Org mode is many things to many people, but at it’s core it is a tool for taking notes and organizing lists. Additional functionality allows for simple text markup, links, inline images, rendered \(\LaTeX\) fragments, and so on. You can embed and run code blocks within org files, using the powerful org-babel package. Some people have even written their Ph.D. thesis in org mode. It’s an amazingly powerful tool, with a passionate user base that is constantly expanding its capabilities.

Why Not Markdown?

I like to use org mode for my personal and professional note-taking because it has very good folding features - you can hide all headings besides the one you’re focusing on. You can even “narrow” your buffer, so that only the heading (“subtree”, in org-mode parlance) that you’re working on is present at all.

Org mode also has some nice visual features for writing, such as:

rendering \(\LaTeX\) fragments inline
styling bold, underlined, and italicized text properly
excellent automatic formatting of tables
code syntax highlighting in various languages
display of images inline

I wrote in markdown (using markdown-mode within emacs) for some time, but once I saw what org mode had to offer, I realized that I needed to transfer my blogging over to org. In particular, the Emacs mode markdown-mode doesn’t have a lot of the features that org mode does, such as inline rendering of math and images or well-built text folding. I used org for notes, and I realized that it would be much easier to just write in org instead of trying to get markdown mode to work the way I want it to.

Below is a short clip that shows just some of what org mode has to offer. You’ll want to full-screen it to make the text legible.

Overall, I find the experience of writing in org much more enjoyable than writing in markdown. Plus, I love hacking on emacs, and moving my blogging workflow over to org presented me with an opportunity to do just that! So of course, I couldn’t resist.

Org-Export and Jekyll

Blogging in Jekyll

The primary tool I use to generate my blog is a static-site generator called Jekyll, which is written in Ruby. I wrote a previous post describing my process for setting up my site. Pelican is a similar tool written in Python, and Hugo is a static-site generator written in Go. We’ll talk a bit more about Hugo later.

All of these tools allow the user to write content in simple markdown, with the site generator doing most of the heavy lifting in generating a full static site behind the scenes. In Jekyll, the user provided some basic configuration for each post, like a title, date, and excerpt, and then the them determines the details on how the text is rendered into fully styled HTML. I use the excellent minimal mistakes theme.

Unfortunately, markdown is not a nicely unified language specification There are many dialects of markdown, and each has subtle differences, so there is not, in general, one markdown specification to rule them all. For example, so-called “GitHub-flavored markdown”, which renders markdown from READMEs in GitHub repositories, has certian quirks that are not shared by the markdown I write for this site. To further complicate things, the static site generators often have their own quirks - Jekyll requires particularly-formatted front-matter to specify the configuration for each post, which is not part of the general markdown specification.

All that is to say, it wasn’t a trivial task to find something that converted org to the exact markdown that I need for my site. But before we jump into the details there, we should talk a bit about org exporters in general.

Org-Export

Org mode comes packaged with many built-in “exporters”, which convert from the org format to other text formats, including HTML, \(\LaTeX\), iCalendar, and more. It does come with a backend that converts org to markdown, which I hoped would be all that I need to convert org to markdown.

Unfortunately, the built-in ox-md exporter doesn’t work very well, for a few reasons. It falls back on using pure HTML (for example, to generate footnotes) when there are markdown-native ways of accomplishing the same thing. Also, some things don’t work at all - for example, equation exporting won’t work, since markdown requires you to enclose LaTeX with \\[ and \\], whereas HTML only requires a single slash.²

A quick search will show that there are many tools built to address this problem. Org exporter backends are designed to be easy to extend, and many users have extended the markdown backend to work with specific static site generators. The most fully developed of these is ox-hugo, which is built to work with the site generator Hugo. This package in particular would be a big source of the transcoding functions I would use, but since it is built to be tightly integrated with Hugo, I couldn’t just use it out of the box.

Elsa Gonsiorowski developed a Jekyll-friendly org exporter, called ox-jekyll-md, which provided the basis for what I would eventually build. She also wrote a blog post about it - if you’re interested in customizing org exported, I’d recommend giving it a read.

Building `ox-jekyll-lite`

There are some things that ox-jekyll-md does very well, including generating the Jekyll-specific YAML front matter. However, I found that it lacks a few key features:

handling footnotes in a markdown-native way
rendering MathJax delimiters with double slashes (to make them markdown-compatable)
exporting image links appropriately
export link paths relative to the Jekyll root directory

Since these were essential to my blogging workflow, I forked that project and began work on my org exporter, ox-jekyll-lite.

Customizing an Org Export Backend

You can think of an org-export backend as a collection of rules for transforming org files into other text format. For example, how should underlined text be handled? How about code blocks? How about \(\LaTeX\) snippets? Each of these rules is encapsulated by a so-called “transcoding function.”

Org export backends are built to be highly extensible. If you extend ox-md, for example, then you “inherit” all the transcoding functions that it provides, and you can add or replace only the functions you want to. For example, part of ox-jekyll-lite looks like

(org-export-define-derived-backend 'jekyll 'md
  :translate-alist
  '((headline . org-jekyll-lite-headline-offset)
    (inner-template . org-jekyll-lite-inner-template))) 

This tells us that we’re defining a backend named jekyll, which derives from the backend named md (which, if you look, itself derives from the html backend).

In the code above, the translate-alist indicates that this backend handles headline objects via the org-jekyll-lite-headline-offset method, and handles the inner-template object via org-jekyll-lite-inner-template. These functions take in org elements, returning text that will get dumped into the export buffer.

The transcoding function org-jekyll-lite-underline is a particularly simple example:

(defun org-jekyll-lite-underline (underline contents info)
  "Transcode UNDERLINE from Org to Markdown.
CONTENTS is the text with underline markup.  INFO is a plist
holding contextual information."
  (format "<u>%s</u>" contents))

Extending a backend consists of figuring out which elements you want to handle via special logic, then writing the appropriate transcoding functions for each.

Implementation Details for `ox-jekyll-lite`

Most of the more complicated transcoding functions in ox-jekyll-lite are not written by me. They either come from ox-jekyll-md, or from ox-hugo. For example, I got the transcoder for footnotes, and for \(\LaTeX\) snippets, from ox-hugo.

The most interesting addition that I made was to render file links relative to the root directory of Jekyll, when possible. For example, if you have an image in your assets/images folder, Jekyll wants you to link to it as /assets/images/kitties.jpg, not with the full path relative to the root directory of your computer’s filesystem.

However, when I use C-c C-l (along with Helm) to add a link to an org file, it renders the link with the absolute path.³ It’s important that the link is “correct” for my machine, so that any images can render inline, and the links are clickable by me when from my orgfile. But if the links are relative to my filesystem’s root in the markdown, then they won’t work within the context of my site. So, we need to “fix” the links as we export the post to markdown.

I don’t get too complicated here - I just have the user specify a custom variable org-jekyll-project-root, which then gets pulled off of the beginning of file paths when it is present.

For example, on my machine, this repository is located at ~/code/jekyll/peterewills.github.io/, and so if I link to the file ~/code/jekyll/peterewills.github.io/assets/images/kitties.jpg in my org file, ox-jekyll-lite will, upon export, transform this to a link to /assets/images/kitties.jpg in the markdown output. This approach is nice and simple, but it doesn’t handle relative links, or the situation where you have multiple Jekyll projects.

Anyways, if you want to give it a try, you can clone it from GitHub and check it out. You can just load it up and use C-c C-e j J to export an org file to a markdown buffer.

Finally, as a side note, I just have to give a shoutout to the excellent s.el and dash, which makes working in elisp infinitely more pleasant. Many thanks to Magnar Sveen for building such nice tools for us all to use.

My Blogging Workflow

Now, my workflow for writing a post is pretty simple.

Have brilliant idea
Make an org file in the _posts directory, named like YYYY-MM-DD-post-name.org
Write brilliant words/equations/cat pictures/etc.
Export to markdown via C-c C-e j j
Commit & push to GitHub
Profit!⁴

The only additional complication, compared to a pure-markdown workflow, is the addition of the export step; other than that, it’s identical. And now I can blog in wonderful, beautiful org mode instead of clunky markdown.

An important caveat for anyone using org and Jekyll; in order to not have Jekyll stumble over the org artifacts, you should add *.org and ltximg to the list of excluded files in your Jekyll _config.yml. You can see mine on GitHub.

Conclusion

If you are just starting to blog, and you love org mode, I’d recommend using Hugo to build your site, so that you can use the excellent ox-hugo. It’s a truly org-centric approach to building a static site, and it’s much more fully-featured than any of the solutions I’ve found in Jekyll or Pelican.

But, you might want to use Jekyll, because it integrates automagically with GitHub pages, or perhaps you just like some of the available themes or whatnot. If that’s the case, then I think org-jekyll-lite is a reasonable solution for writing your posts in org. It’s lightweight, and you’ll probably have to tweak it to fit your particular needs, but it’s small enough that modifying it shouldn’t be too hard. Also, you can always submit an issue on GitHub and I’ll see if I can help you out.

I hope this post has inspired you to explore more in org mode! It’s a great tool for organizing notes, tracking agendas/calendars/TODO lists, and for general writing.⁵ Happy blogging, and may the org be with you!

As I explain later on, this tool was based on both ox-jekyll-md and ox-hugo. ↩
The double slash is required because markdown interprets the first slash as an escape character. ↩
You can see an example of adding a link to an image in the org-mode demo video linked above. ↩
This is actually a lie; I don’t make any money from this site. ↩
There’s also the entire subject of literate programming, in which code is interwoven with documentation, which I think is a really nice paradigm, and for which org is a natural fit. ↩

Your p-values Are Bogus

2019-09-20T00:00:00+00:00

People often use a Gaussian to approximate distributions of sample means. This is generally justified by the central limit theorem, which states that the sample mean of an independent and identically distributed sequence of random variables converges to a normal random variable in distribution.¹ In hypothesis testing, we might use this to calculate a \(p\)-value, which then is used to drive decision making.

I’m going to show that calculating \(p\)-values in this way is actually incorrect, and leads to results that get less accurate as you collect more data! This has substantial implications for those who care about the statistical rigor of their A/B tests, which are often based on Gaussian (normal) approximations.

A Simple Example

Let’s take a very simple example. Let’s say that the prevailing wisdom is that no more than 20% of people like rollerskating. You suspect that the number is in fact much larger, and so you decide to run a statistical test. In this test, you model each person as a Bernoulli random variable with parameter \(p\). The null hypothesis \(H_0\) is that \(p\leq 0.2\). You decide to go out and ask 100 people their opinions on rollerskating.²

You begin gathering data. Unbeknownst to you, it is in fact the case that a full 80% of the population enjoys rollerskating. So, as you randomly ask people if they enjoy rollerskating, you end up getting a lot of “yes” responses. Once you’ve gotten 100 responses, you start analyzing the data.

It turns out that you got 74 “yes” responses, and 26 “no” responses. Since you’re a practiced statistician, you know that you can calculate a \(p\)-value by finding the probability that a binomial random variable with parameter \(p_0=0.2\) would generate a value \(k\geq74\) with \(n=100\). This probability is just

\[p_\text{exact} = \text{Prob}(k\geq 74) = \sum_{k=74}^{n}{n \choose k} p_0^{k} (1-p_0)^{(n-k)}.\]

However, you know that you can approximate a binomial distribution with a Gaussian of mean \(\mu=np_0\) and variance \(\sigma^2=np_0(1-p_0)\), so you decide to calculate an approximate \(p\)-value,

\[p_\text{approx} = \frac{1}{\sqrt{2\pi np_0(1-p_0)}}\int_{k=74}^\infty \exp\left(-\frac{(k-np_0)^2}{2np_0(1-p_0)}\right).\]

However, this approximation is actually incorrect, and will give you progressively worse estimates of \(p_\text{exact}\). Let’s observe this in action.

Python Simulation of Data

We simulate data for values \(n=1\) through \(n=1000\), and compute the corresponding exact and approximate \(p\)-value. We plot the log of the \(p\) value, since they get very small very quickly.

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from scipy.stats import norm, binom

plt.style.use( ['classic', 'ggplot'])

p_true = 0.8
n = 1000
data = binom.rvs(1, p_true, size=n)
p0 = 0.2
p_vals = pd.DataFrame(
    index=range(1,n), 
    columns=['true p-value', 'normal approx. p-value']
)

for n0 in range(1,n):
    normal_dev = np.sqrt(n0*p0*(1-p0))
    normal_mean = n0*p0
    k = sum(data[:n0])
    # the "survival function" is 1 - cdf, which is the p-value in our case
    normal_logpval = norm.logsf(k, loc=normal_mean, scale=normal_dev)
    true_logpval = binom.logsf(k=k, n=n0, p=p0)
    p_vals.loc[n0, 'true p-value'] = true_logpval
    p_vals.loc[n0, 'normal approx. p-value'] = normal_logpval
    
p_vals.replace([-np.inf, np.inf], np.nan).dropna().plot(figsize = (8,6));
plt.xlabel("Number of Samples")
plt.ylabel("Log-p Value");

We have to drop infs because after about \(n=850\) or so, the \(p\)-value actually gets too small for scipy.stats to calculate; it just returns -np.inf.

The resulting plot tells a shocking tale:

The approximation diverges from the exact value! Seeing this, you begin to weep bitterly. Is the Central Limit Theorem invalid? Has your whole life been a lie? It turns out that the answer to the first is a resounding no, and the second… probably also no. But then what is going on here?

Convergence Is Not Enough

The first thing to note is that, mathematically speaking, the two \(p\)-values \(p_\text{exact}\) and \(p_\text{approx}\) do, in fact, converge. That is to say, as we increase the number of samples, their difference is approaching zero:

\[\left| p_\text{exact} - p_\text{approx}\right| \rightarrow 0\]

What I’m arguing, then, is that convergence is not enough.

If it were, then we could just approximate the true \(p\)-value with 0. That is, we could report a \(p\)-value of \(p_\text{approx} = 0\), and claim that since our approximation is converging to the actual value, it should be taken seriously. Obviously, this should not be taken seriously as an approximation.

Our intuitive sense of “convergence”, the sense that \(p_\text{approx}\) is becoming “a better and better approximation of” \(p_\text{exact}\) as we take more samples, corresponds to the percent error converging to zero:

\[\left| \frac{p_\text{approx} - p_\text{exact}}{p_\text{exact}}\right| \rightarrow 0.\]

In terms of asymptotic decay, this is a stronger claim than convergence. Rather than their difference converging to zero, which means it is \(o(1)\), we demand that their difference converge to zero faster than \(p_\text{exact}\),

\[\left| p_\text{exact} - p_\text{approx}\right| = o\left(p_\text{exact}\right).\]

It would also suffice to have an upper bound on the \(p\)-value; that is, if we could say that \(p_\text{exact} < p_\text{approx}\), so \(p_\text{exact}\) is at worst our approximate value \(p_\text{approx}\), and we knew that this held regardless of sample size, then we could report our approximate result knowing that it was at worst a bit conservative. However, as far as I can see, the central limit theorem and other similar convergence results give us no such guarantee.

Implications

What I’ve shown is that for the simple case above, Gaussian approximation is not a strategy that will get you good estimates of the true \(p\)-value, especially for large amounts of data. You will under-estimate your \(p\)-value, and therefore overestimate the strength of evidence you have against the null hypothesis.

Although A/B testing is a slightly more complex scenario, I suspect that the same problem exists in that realm. A refresher on a typical A/B test scenario: you, as the administrator of the test, care about the difference between two sample means. If they samples are from Bernoulli random variables (a good model of click-through rates), then the true distribution of this difference is the distribution of the difference of (scaled) binomial random variables, which is more difficult to write down and work with. Of course, the Gaussian approximation is simple, since the difference of two Gaussians is again a Gaussian.³

Most statistical tests are approximate in this way. For example, the \(\chi^2\) test for goodness of fit is an approximate test. So what are we to make of the fact that this approximation does not guarantee increasingly valid \(p\)-values? Honestly, I don’t know. I’m sure that others have considered this issue, but I’m not familiar with the thinking of the statistical community on it. (As always, please comment if you know something that would help me understand this better.) All I know is that when doing tests like this in the future, I’ll be much more careful about how I report my results.

Afterword: Technical Details

As I said above, the two \(p\)-values do, in fact, converge. However, there is an interesting mathematical twist in that the convergence is not guaranteed by the central limit theorem. It’s a bit besides the point, and quite technical, but I found it so interesting that I thought I should write it up.

As I said, this section isn’t essential to my central argument about the insufficiency of simple convergence; it’s more of an interesting aside.

Limitations of the Central Limit Theorem

To understand the problem, we have to do a deep dive into the details of the central limit theorem. This will get technical. The TL;DR is that since our \(p\)-values are getting smaller, the CLT doesn’t actually guarantee that they will converge.

Suppose we have a sequence of random variables \(X_1, X_2, X_3, \ldots\). These would be, in the example above, the Bernoulli random variables that represent individual people’s responses to your question about rollerskates. Suppose that these random variables are independent and identically distributed, with mean \(\mu\) and finite variance \(\sigma^2\).⁴

Let \(S_n\) be the sample mean of all the \(X_i\) up through \(n\):

\[S_n = \frac{1}{n} \sum_{i=1}^n X_i.\]

We want to say what distribution the sample mean converges to. First, we know it’ll converge to something close to the mean, so let’s subtract that off so that it converges to something close to zero. So now we’re considering \(S_n - \mu\). But we also know that the standard deviation goes down like \(1/\sqrt{n}\), so to get it to converge to something stable, we have to multiply by \(\sqrt{n}\). So now we’re considering the shifted and scaled sample mean \(\sqrt{n}\left(S_n - \mu\right)\).

The central limit theorem states that this converges in distribution to a normal random variable with distribution \(N(0, \sigma^2)\). Notationally, you might see mathematicians write

\[\sqrt{n}\left(S_n-\mu\right)\ \xrightarrow{D} N(0,\sigma^2).\]

What does it mean that they converge in distribution? It means that, for a fixed area, the areas under the respective curves converge. Note that we have to fix the area to get convergence. Let’s look at some pictures. First, note that we can plot the exact distribution of the variable \(\sqrt{n}(S_n-\mu)\); it’s just a binomial random variable, appropriately shifted and scaled. We’ll plot this alongside the normal approximation \(N(0,\sigma^2)\).

The area under the shaded part of the normal converges to the area of the bars in that same shaded region. This is what convergence in distribution means.

Now for the crux. As we gather data, it becomes more and more obvious that our null hypothesis is incorrect - that is, we move further and further out into the tail of the null hypothesis’ distribution for \(S_n\). This is very intuitive - as we gather more data, we expect our \(p\)-value to go down. The \(p\)-value is a tail integral of the distribution, so we expect to be moving further and further into the tail of the distribution.

Here’s a gif, where the shaded region represents the \(p\)-value that we’re calculating:

As we increase \(n\), the area we’re integrating changes. So we don’t get convergence guarantees from the CLT.

The Berry-Esseen Theorem

It’s worth noting that there is a stronger statement of convergence that applies specifically to the convergence of the binomial distribution to the corresponding Gaussian. It is called the Barry-Esseen Theorem, and it states that the maximum distance between the cumulative probability functions of the binomial and the corresponding Gaussian is \(o(n^{-1/2})\). This claim, which is akin to uniform convergence of functions (compare to the pointwise convergence of the CLT) does, in fact, guarantee that our \(p\)-values will converge.

But, as I’ve said above, this is immaterial, albeit interesting; we know already that the \(p\)-values converge, and we also know that this is not enough for us to be reporting one as an approximation of the other.

So long as the variance of the distribution being sampled is finite. ↩
You should decide this number based on some alternative hypothesis and a power analysis. Also, you should ensure that you are sampling people evenly - going to a park, for example, might bias your sample towards those that enjoy rollerskating. ↩
I haven’t done a numerical test on this scenario because the true distribution (the difference between two scaled binomials) is nontrivial to calcualte, and numerical issues arise as we calculate such small \(p\)-values, which SciPy takes care of for us in the above example. But as I said, I would be unsurprised if our Gaussian-approximated \(p\)-values are increasingly poor approximations of the true \(p\)-value as we gather more samples. ↩
In our case, for a single Bernoulli random variable with parameter \(p\), we have \(\mu=p\) and \(\sigma^2=p(1-p)\). ↩

DS Interview Study Guide Part II: Software Engineering

2019-08-29T00:00:00+00:00

This post continues my series on data science interviews. One of the major difficulty of doing data science interviews is that you must show expertise in a wide variety of skills. In particular, I see four key subject areas that you might be asked about during an interview:

Statistics
Software Engineering/Coding
Machine Learning
“Soft” Questions

This post focuses on software engineering & coding. It will be primarily a resource for aggregating content that I think you should be familiar with. I will mostly point to outside sources for technical exposition and practice questions.

I’ll link to these as appropriate throughout the post, but I thought it would be helpful to put up front a list of the primary resources that I’ve used when studying for interviews. Some of my favorites are:

Data Structures and Algorithms in Python, for a good introduction to data structures such as linked lists, arrays, hashmaps, and so on. It also can give you good sense of how to write idiomatic Python code, for building fundamental classes.
SQLZoo for studying SQL and doing practice questions. I particularly like the “assessments”.
Cracking the Coding Interview for lots of practice questions organized by subject, and good general advice for the technical interviewing process.

I also use coding websites like LeetCode to practice various problems. I also look on Glassdoor to see what kinds of problems people have been asked.

As always, I’m working to improve this post, so please do leave comments with feedback.

What Languages Should I Know?

In this section of data science interviews, your are generally asked to implement things in code. So, which language should you do it in? Generally, the best answer is (unsurprisingly) that you should work in Python. The next most popular choice is R; I’m not very familiar with R, so I can’t really speak to it’s capabilities.

There are a few reasons you should work in Python:

It’s widely adopted within industry.
It has high-quality, popular packages for working with data (see pandas, numpy, scipy, statsmodels, scikit-learn, matplotlib, etc).
It bridges the gap between academic work (e.g. using NumPy to build a fast solver for differential equations) and industrial work (e.g. using Django to build webservices).

This is far from an exhaustive list. Anyways, I mostly work in Python. I think it’s a nice language because it is clear and simple to write.

If you want to use another language, you should make sure that you can do everything you need to - this includes reading & writing data, cleaning/munging data, plotting, implementing statistical and machine learning models, and leveraging basic data types like hashmaps and arrays (more on those later).

I think if you wanted to do your interviews in R it would be fine, so long as you can do the above. I would strongly recommend against languages like MATLAB, which are proprietary and not open-source.

Languages like Java can be tricky since they might not have the data-oriented libraries that Python has. For example, I’ve worked profesionally in Scala, and am very comfortable manipulating data via the Spark API within it, but still wouldn’t want to have to use it in an interview; it just isn’t as friendly for general-purpose hacking as Python.

So is Python all you need? Well, not quite. You should also be familiar with SQL for querying databases; we’ll get into that later. I don’t think the dialect you use particularly matters. SQLZoo works with MySQL, which is fine. Familiarity with bash and shell-scripting is useful for a data scientist in their day-to-day work, but generally isn’t asked about in interviews. For the interviews, I’d say if you know one general-purpose language (preferably Python, or R if need be) and SQL, then you’ll be fine.

General Tips for Coding Interviews

Coding interviews are notorious for being high-stress, so it’s important that you practice in a way that will maximize your comfort during the interview itself - you don’t want to add any unnecessary additional stress into an already difficult situation. There are a wide variety of philosophies and approaches to preparing yourself for and executing a successful interview. I’m going to talk about some points that resonate with me, but I’d also recommend reading Cracking the Coding Interview for a good discussion. Of course, this isn’t the final word on the topic - there are endless resources available online that address this.

How to Prepare

When preparing for the interview, make sure to practice in an environment similar to the interview environment. There are a few aspects of this to keep in mind.

Make sure that you replicate the writing environment of the interview. So, if you’ll be coding on a whiteboard, try to get access to a whiteboard to practice. At least practice on a pad of paper, so that you’re comfortable with handwriting code - it’s really quite different than using a text editor. If you’ll be coding in a Google Doc, practice doing that (protip: used a monospaced font). Most places I’ve interviewed at don’t let you evaluate your code to test it, so you have to be prepared for that.
Time yourself! It’s important to make sure you can do these things in a reasonable amount of time. Generally, these things last 45 minutes per “round” (with multiple rounds for on-site interviews). Focus on being efficient at implementing simple ideas, so that you don’t waste a bunch of time with your syntax and things like that.
Practice talking. If you practice by coding silently by yourself, then it might feel strange when you’re in the interview and have to talk through your process. The best is if you can have a friend who is familiar with interviewing play the interviewer, so that you can talk to them, get asked questions, etc. You can also record yourself and just talk to the recorder, so that you get practice externalizing your thoughts.

There are some services online that will do “practice” interviews for you. When I was practicing for a software engineer interview with Google, I used Gainlo for this - they were kind of expensive, but you interview with real Google software engineers, which I found helpful.

However, the interviews for a software engineering position at Google are very standardized in format. I haven’t used any of the services that do this for data science, and the interviews you’ll face are so varied. Therefore, I imagine it is harder to do helpful “mock interviews”. If you’ve used any of these services, I’d be very curious to hear about your experience.

Tips for Interviewing

There are some things it’s important to keep in mind as you do the interview itself.

Talk about your thought process. Don’t just sit sliently thinking, then go and write something on the board. Let the interviewer into your mind so that they can see how you are thinking about the problem. This is good advice at any point in a technical interview.
Start with a simple solution you have confidence in. If you know that you can quickly write up a suboptimal solution (in this case, maybe insertion sort), then do that! You can discuss why that solution is sub-optimal, and they will often brainstorm with you about how to improve it. That said, if you are just as confident in writing up something more optimal (say, quicksort) then feel free to jump right to that.
Sketch out your solution before doing real code. This is not necessary, but sometimes for complicated stuff it’s nice to write out your approach in pseudocode before jumping into real code. This can also help with exposing your thought process to the interviewer, and making sure they’re on board with how you’re thinking about it.
Think about edge cases. Suppose they ask you to write a function that sorts a list. What if you’re given an empty list? What if you’re given a list of non-comparable things? (In Python, this might be a list of lists.) What does your function do in this case? Is that what you want it to do? There’s no right answer here, but you should definitely be thinking about this and asking the interview how they want the function to behave on these cases.
Be sure to do a time complexity analysis on your solution. They want to know that you can think about efficiency, so unless they explicitly ask you not to do this, I’d recommend it. We’ll discuss more about what this means below.

For a more thorough discussion of preparation and day-of techniques, I’d recommend Cracking the Coding Interview.

Tips for Coding

There are few things specifically in how the interviewee writes code that I think are worth mentioning. This kind of stuff usually isn’t a huge deal, but if you write good code, it can show professionalism and help leave a good impression.

Name your variables well. If the variable is the average number of users per region, use num_users_per_region, or users_per_region, not avg_usr or num_usr. Unlike in mathematics, it’s good to have long, descriptive variables.
Use built-ins when you can! Python already has functions for sorting, for building cartesian products of lists, for implementing various models (in statsmodels and scikit-learn), and endless other things. It also has some cool data structures already implemented, like the heap and queue. Get to know the itertools module; it has lots of usefull stuff. if you can use these built-ins effectively, it demonstrates skill and knowledge without adding much effort on your part.
Break things into functions. If one step of your code is sorting a list, and you can’t use the built-in sorted() function, then write a separate function def sort() before you write your main function. This increases both readability and testability of code, and is essential for real-world software.
Write idiomatic Python. This is a bit less important, but make sure to iterate directly over iterables, don’t do for i in range(len(my_iterable)). Also, familiarize yourself with enumerate and zip and know how to use them. Know how to use list compreshensions, and be aware that you can do a similar thing for dictionaries, sets, and even arguments of functions - for example, you can do max(item for item in l if item % 2 == 0) to find the maximum even number in l. Know how to do string formatting using either .format() for f-strings in Python 3.¹

I’m only scratching the surface of how to write good code. It helps to read code that others have written to see what you don’t know. You can also look at code in large open-source libraries.

With all that said, let’s move on to some of the content that might be asked about in these interviews.

Working with Data

One of the fundamental tasks of a data scientist is to load, manipulate, clean, and visualize data in various formats. I’ll go through some of the basic tasks that I think you should be able to do, and either include or link to Python implementations. If you work in R, or any other language, you should make sure that you can still do these things in your preferred language.

In Python, the key technologies are the packages pandas (for loading, cleaning, and manipulating data), numpy (for efficiently working with unlabeled numeric data), and matplotlib (for plotting and visualizing data).

Loading & Cleaning Data

This tutorial on DataCamp nicely deals with the basics of using pd.read_csv() to load data into Pandas. It is also possible to load from other formats, but in my experience writing to and from comma- or tab-separated plaintext is by far the most common approach for datasets that fit in memory.²

For example, suppose you had the following data in a csv file:

name,age,country,favorite color
steve,7,US,green
jennifer,14,UK,blue
franklin,,UK,black
calvin,22,US,

You can copy and paste this, into Notepad or whatever text editor you like³, and save it as data.csv.

You should be able to

load in data from text, whether it is separated by commas, tabs, or some other arbitrary character (sometimes things are separated by the “pipe” character |). In this case, you can just do df = pd.read_csv('data.csv') to load it.
Filter for missing data. If you wanted to find the row(s) where the age is missing, for example, you could do df[df['age'].isnull()]
Filter for data values. For example, to find people from the US, do df[df['country'] == 'US']
Replace missing data; use df.fillna(0) to replace missing data with zeros. Think for yourself about how you would want to handle missing data in this case - does it make sense to replace everything with zeros? What would make sense?

Dealing with missing data is, in particular, an important problem, and not one that has an easy answer. Towards Data Science has a decent post on this subject, but if you’re curious, there’s a lot to read about and learn here.

More advanced topics in pandas-fu include using groupby, joining dataframes (this is called a “merge” in pandas, but works the same as a SQL join), and reshaping data.

As I said before, loading and manipulating data is one of the fundamental tasks of a data scientist. You should probably be comfortable doing most or all of these tasks if asked. Pandas can be a bit unintuitive, so I’d recommend practicing if you aren’t already comfortable with it. Doing slicing and reshaping tasks in numpy is also an important skill, so make sure you are comfortable with that as well.

Visualization

Another essential aspect of data work is visualization. Of course, this is an entire field unto itself; here, I’ll mostly be focusing on the practical aspects of making simple plots. If you want to start to learn more about the overarching principles of the visual representation of data, Tufte’s book is the classic in the field.

In Python, the fundamental tool used for data visualization is the library matplotlib. There exist many other libraries for more complicated visualization tasks, such as seaborn, bokeh, and plotly, but the only one that you really need to be comfortable with (in my opinion) is matplotlib.

You should be comfortable with:

plotting two lists against one another
changing the labels on the x- and y-axis of your plot, and adding a title
changing the x- and y-limits of your plot
plotting a bar graph
plotting a histogram
plotting two curves together, labelling them, and adding a legend

I won’t go through the details here - I’m sure you can find many good guides to each of these online. The matplotlib pyplot tutorial is a good place to start.⁴

It’s worth noting that you can plot directly from pandas, by doing df.plot(). This just calls out to matplotlib and plots your dataframe; I will often find myself both plotting from the pandas DataFrame.plot() method as well as directly using pyplot.plot(). They work on the same objects, and so you can use them together to make more complicated plots with multiple values plotted.

Data Structures & Algorithms

Designing and building effective software is predicated on a solid understanding of the basic data structures that are available, and familiarity with the ways that they are employed in common algorithms. For me, learning this material opened up the world of software engineering - it illuminated the inner workings of computer languages. It also helped me understand the pros and cons of various approaches to problems, in ways that I wouldn’t have been able to before.

This subject is fundamental to software engineering interviews, but for data scientists, its importance can vary drastically from role to role. For engineering-heavy roles, this material can make up half or more of the interview, while for more statistician-oriented roles, it might only be very lightly touched upon. You will have to use your judgement to determine to what extent this material is important to you.

I learned this material when I was interviewing by reading the book Data Structures and Algorithms in Python.⁵ It’s really a great book - it has good, clear explanations of all the important topics, including complexity analysis and some of the basics of the Python language. I can’t recommend it highly enough if you want to get more familiar with this material.⁶ You can buy it, or look around online for the PDF - it shouldn’t be too hard to find.

Time and Space Complexity Analysis

Before you begin writing algorithms, you need to know how to analyze their complexity. The “complexity” of an algorithm tells you how the amount of time (or space) that the algorithm takes depends on the size of the input data.

It is formalized using the so-called “big-O” notation. The precise mathematical definition of \(\mathcal{O}(n)\) is somewhat confusing, so you can just think of it roughly as meaning that an algorithm that is \(\mathcal{O}(n)\) “scales like \(n\)”; so, if you double the input size, you double the amount of time it takes. If an algorithm is \(\mathcal{O}(n^3)\), then, doubling the input size means that you multiply the time it takes by \(2^3 = 8\).⁷ You can see how even a \(\mathcal{O}(n^2)\) algorithm wouldn’t work for large data; even if it runs in a reasonable amount of time (say, 5 seconds)for 10,000 points, it would take about 15,000 years to run on 1 billion data points. Obviously, this is no good.

So complexity analysis is critical. You don’t want to settle for a \(\mathcal{O}(n^2)\) solution when a \(\mathcal{O}(n)\) or \(\mathcal{O}(n \log n)\) solution is available. I won’t get into how to do the analysis here, besides saying that I often like to annotate my loops with their complexity when I’m writing things. For example, here’s a (slow) approach to finding the largest k (unique) numbers in a list:

def get_top_k(k, input_list):
    top_k = []
    for _ in range(k):  # happens k times
        remaining = [num for num in input_list if num not in top_k]  # O(n)
        if remaining:
            top_remaining = max(remaining)  # O(n)
            top_k.append(top_remaining)  # O(1)
    return top_k

I know that the outer loop happend k times, and since finding the maximum of a list is \(\mathcal{O}(n)\), the total task is \(\mathcal{O}(nk)\).⁸ To learn more about how to do complexity analysis, I’d look at DS&A, Cracking the Coding Interview, or just look around online - I’m sure there are plenty of good resources out there.

You can also consider not just the time of computation, but the amount of memory (space) that your algorithm uses. This is not quite as common as time-complexity analysis, but is still important to be able to do.

A very useful resource for anyone studying for a coding interview is the big-O cheat sheet, which shows the complexity of access, search, insertion, and deletion for various data types, as well as the complexity of searching algorithms, and a lot more. I often use it as a reference, but of course it’s important that you understand why (for example) an array has \(\mathcal{O}(n)\) insertion. Just memorizing complexities won’t help you much.

Arrays & Hashmaps

In my opinion, the two essential data structures for a data scientist to know are the array and the hashmap. In Python, the list type is an array, while the dict type is a hashmap. Since both are used so commonly, you have to know their properties if you want to be able to design efficient algorithms and do your complexity analysis correctly.

Arrays are a data type where a piece of data (like a string) is linked to an index (in Python, this is an integer, starting with 0). I won’t go too deep into the details here, but for arrays, the important thing to know is that getting any element of an array is easy (i.e. doing mylist[5] is \(\mathcal{O}(1)\), so it doesn’t depend on the size of the array) but adding elements (particularly in the beginning or middle of the array) is difficult; doing mylist.insert(k, 'foo') is \(\mathcal{O}(n-k)\), where \(k\) is the position you wish to insert at.⁹

Arrays are what we usually use when we’re building unordered, unlabelled collections of objects in Python. This is fine, since insertion at the end of an array is fast, and we’re often accessing slices of arrays in a complicated fashion (particularly in numpy). I generally use arrays by default, without thinking too much about it, and it generally works out alright.

Hashmaps also link values to keys, but in this case the key can be anything you want, rather than having to be an ordered set of integers. In Python, you build them by specifying the key and the value, like {'key': 'value'}. Hashmaps are magical in that accessing elements and adding elements are both \(\mathcal{O}(1)\).¹⁰ Why is this cool? Well, say you wanted to store a bunch of people’s names and ages. You might think to do a list of tuples:

names_ages = [('Peter', 12), ('Kat', 25), ('Jeff', 41)]

Then, if you wanted to find out Jeff’s age, you would have to iterate through the list and find the correct tuple:

for name, age in name_ages:  # happens n times
    if name == 'Jeff':
        print(f"Jeff's age is {age}")

This is \(\mathcal{O}(n)\) - not very efficient. With hashmaps, you can just do

name_ages = {'Peter': 12, 'Kat': 25, 'Jeff': 41}
print(f"Jeff's age is {name_ages['Jeff']}")  # O(1)! Wow!

It might not be obvious how cool this is until you see how to use it in problems. Cracking the Coding Interview has lots of good problems on hashmaps, but I’ll just reproduce some of the classics here. I think it’s worth knowing these, because they really can give you an intuitive sense of when and how hashmaps are valuable.

The first classic hashmap algorithm is counting frequencies of items in a list. That is, given a list, you want to know how many times each item appears. You can do this via the following:

def get_freqs(l):
    freqs = {}
    for item in l:  # happens O(n) times
        if item not in freqs:  # This check is O(1)! Wow!
            freqs[item] = 1
        else:
            freqs[item] += 1  # Also O(1)! Wow!
    return freqs

Try and think of how you’d do this without hashmaps. Probably, you’d sort the list, and then look at adjacent values. But sorting is, at best \(\mathcal{O}(\log n)\). This solution does it in \(\mathcal{O}(n)\)!

Another classic problem that is solved with hashmaps is to find all repeated elements in a list. This is really just a variant of the last, where you look for elements that have frequency greater than 1.

def get_repeated(l):
    f = get_freqs(l)
    return [item for item in f if f[item] > 1]

Now, if you only need one repeated element, you can be efficient and just terminate on the first one you find. For this, we’ll use a set, which is just a dict with values of None. That is to say, sets are also hashmaps. The important thing to know is that adding to them and checking if something is in them are both \(\mathcal{O}(1)\).

def get_repeated(l):
    items = set()
    for item in l:  # happens O(n) times
        if item not in items:  # This check is O(1)! Wow!
            items.add(item)
        else:
            return(item)
    return None  # if this happens, all elements are unique

The last one we’ll do is a bit trickier. You’re given a list of numbers, and a “target”, and your task is to find a pair of numbers in the list that add up to the target. Try and think for yourself how you’d do this - the fact you use hashmaps is a big hint. You should be able to do it in \(\mathcal{O}(n)\).

Have you thought about it? When I first encountered this one I had to look up the answer. But here’s how you do it in \(\mathcal{O}(n)\):

def get_sum_pair(l, target):
    nums_set = set()
    for num in l:
        other_num = target-num
        if other_num in nums_set: 
            return (num, other_num)
        nums_set.add(num)   # no-op if num is already there
    return None

Note that other_num = target-num is the number that you would need to complete the sum pair; using a hashmap, you can check in \(\mathcal{O}(1)\) if you’ve already seen it! Wow!

Hopefully you get it - hashmaps are cool. Go on LeetCode, or pop open your favorite data structures book, or even Cracking the Coding Interview, and get some practice with them.

Sorting & Searching

Sorting and searching are two of the basic tasks you have to be familiar with for any coding interview. You can go into a lot of depth with these, but I’ll stick to the basics here, because that’s what I find most helpful.

Sorting

Sorting is a nice problem in that the statement of the problem is fairly straightforward; given a list of numbers, reorder the list so that every element is less than or equal to the next. There are a number of approaches to sorting. The naive approach is called insertion sort; for example, it is what most people do when sorting a hand of cards. It has some advantages, but is \(\mathcal{O}(n^2)\) in time, and so is not the most efficient available.

The two most common fast sorting algorithms are quicksort and mergesort. They are both \(\mathcal{O}(n \log n)\) in time,¹¹ and so scale close-to-linearly with the size of the list. I won’t go into the implementation details here; there are plenty of good discussions of them available on the internet.

When thinking about sorting, it’s also worth considering space complexity - can you sort without needing to carry around a second sorted copy of the list? If so, that’s a significant advantage, especially for larger lists. It’s also worth thinking about worst-case vs. average performance - how does the algorithm perform on a randomly shuffled list, and how does it perform on a list specifically designed to take the maximum number of steps for that algorithm to sort? Quicksort, for example, is actually \(\mathcal{O}(n^2)\) in the worst case, but is \(\mathcal{O}(n \log n)\) on average. Again, you can look to the big-O cheat sheet to make sure you’re remembering all your complexities correctly.

Searching

The problem of searching is often stated as given a sorted list l and an object x, find the index at which an element x lives. (You should immediately ask: What should I return if x is not in l?)The name of the game here is binary search. You basically split the list, then if the number is greater than the split, search the top; otherwise, search the bottom. This is an example of a recursive algorithm, so the way it’s written can be a bit opaque to those not used to looking at recursive code. Once I can wrap my head around it, I find it quite elegant. The important thing to know is that this search is \(\mathcal{O}(\log n)\), which means that you don’t touch every element in the list - it’s very fast, even for a large list. The key to this is that the list is already sorted - if it’s not sorted, then you’re out of luck; you’ve got to check every element to find x.

There are tons of examples of binary search in Python online, so I won’t put one here. That said, I have found it interesting to see how thinking in terms of binary search can help you in a variety of areas.

For example, suppose you had some eggs, and worked in a 40-story building, and wanted to know the highest floor you could drop the egg off of without it breaking (it’s kind of a dumb example cause the egg would probably break even on the first floor, but pretend it’s a super-tough egg.) You could drop it from the first floor, and see what happens. Say it doesn’t break. Then drop it from the 40th, and see what happens. Say it does break. Then, you bisect and use the midpoint - drop from the 20th floor. If it breaks here, you next try the 10th - if it doesn’t you next try the 30th. This allows you to find the correct floor much faster than trying each floor in succession.

Sorting and searching are fundamental algorithms, and have been well studied for decades. Having a basic fluency in them shows a familiarity with the field of computers science that many employers like to see. In my opinion, you should be able to quickly and easily implement the three sorting algorithms above, and binary search, in Python, or whatever your language of choice is.

Working with SQL

Finally, let’s talk a bit about SQL. SQL is a tool used to interact with so-called “relational” databases, which just means that each row in a table has certain values (columns), and that those values have the same type for each row (that is, the schema is uniform throughout the table).¹² It is not exactly a language, it’s more like a family of languages. There are many “dialects” which all have slight differences, but they behave the same with regards to core functionality; for example, you can do

SELECT column FROM table WHERE columns = 'value'

in any SQL-like language.¹³ Modern data-storage and -access solutions like Spark and Presto are very different from older databases in their underlying architecture, but still use a SQL dialect for accessing data.

Solving problems in SQL involves thinking in a quite different way than solving a similar problem on an array in Python. There is no real notion of iteration, or at least it’s not easily accessible, so most of the complicated action happens via table joins. I used SQLZoo, and particularly the “assessments”, to practice my SQL and get it up to snuff. LeetCode also has a SQL section (I think they call it “database”).

It’s essential to know SQL as a working data scientist. You’ll almost certainly use it in your day-to-day activities. That said, it’s not always asked in the interviews, so you might clarify with the company whether they will ask you SQL questions.

A Note on Dialects

There are many dialects of SQL, and changing the dialect changes things like (for example) how you work with dates. It’s worth asking the company you’re interviewing with what dialect they want you to know, if they have one in mind. If you’re just writing SQL on a whiteboard, then I would be surprised if they were picky about this; I would just say something like “here I’d use DATE(table.dt_str) or whatever the string-to-date conversion function is in your dialect”. In this case it’s just details that move around, but the big picture is generally the same for different dialects.

Conclusion

Coding interviews are stressful. From what I can tell, that’s just the way it is. For me, the best antidote to that is being well-prepared. I think companies are moving more towards constructive, cooperative interview formats, and away from the classic Google brain-teaser kind of questions, which helps with this, but you can still expect to be challenged during these interviews.

Remember to be kind to yourself. You’ll probably fail many times before you succeed. That’s fine, and is what happens to almost everyone. Just keep practicing, and keep learning from your mistakes. Good luck!

You should be using Python 3 at this point, but also be familiar with the differences between 2 and 3, and be able to write code in Python 2 if need be. ↩
For “big data” stored in the cloud, an efficient format called Parquet is the standard. In my experience, however, it’s uncommon to work with parquet files directly in Pandas; you often read them into a distributed framework like Spark and work with them in that context. ↩
The correct answer is, of course, emacs. ↩
pyplot is an API within matplotlib that was designed in order to mimic the MATLAB plotting API. It is generally what I use; I begin most of my matplotlib work with from matplotlib import pyplot as plt. I only rarely need to import matplotlib direct, and that’s generally for configuration work. ↩
I read the book when preparing for a software engineer interview at Google, so I picked up a lot more than was necessary for a data science interview. I still find the material helpful, however, and it’s nice to be able to demonstrate that you have gone above and beyond in a realm that data scientists sometimes neglect (efficient software design). ↩
It goes well beyond what you’ll need for a data science interview, however - it gets into tree structures, graphs (and graph traversal algorithms), and other more advanced topics. I’d recommend focusing on complexity analysis, arrays, and hashmaps as the most important data structures that a data scientist will use day-to-day. ↩
This is only approximately true, or rather it is is asymptotically true; this scaling law holds in the limit as \(n\rightarrow\infty\). ↩
It’s a bit weird to use both \(n\) and \(k\) in your complexity - mathematically, what this means is that we consider them separate variables , and we can take the limit of either one independently from the other. If, for example, you knew that \(k = n/4\), so you always wanted the top quarter of the list, then this would be \(\mathcal{O}(n^2)\), since \(n/4 = \mathcal{O}(n)\). ↩
I’m glossing over some details here - the numbers I quote above are for a fixed-size array. So, if you build up an array by adding elements at the end, it may seem like you get to just do a bunch of \(\mathcal{O}(1)\) .appends, but in reality, you have to occasionally resize the array to make more space, which slows things down to an average append time of \(\mathcal{O}(n)\). If you want a list-like type where inserting elements is easy (\(\mathcal{O}(1)\)) but accessing elements is difficult (\(\mathcal{O}(n)\)), then you want a linked list. Linked lists aren’t as important for data scientists to use, so I won’t get into them much here. ↩
You might wonder why we would ever use an array over a hashmap if hashmaps are strictly superior with respect to their complexity. It’s a good question. The answer is that arrays take up less space (they don’t have to store the keys, only the values) and they are much easier to work with in code (they look cleaner, and are more intuitive for unordered data). Furthermore, if you had a hashmap that linked integers 0 through 10 to strings, and you wanted to change the element at key 5, then you’d have to go through what is currently at keys 5 through 10, and increment their keys by one, so you would end up back at an inefficient insertion algorithm like you have with arrays. ↩
This is true on average; see the section below for a discussion of average vs. worst-case complexity. ↩
Non-relational database formats, like HBase and NoSQL, basically function like giant hashmaps; they have a single “key”, and then the “value” can contain arbitrary data - you don’t have to have certain columns in there. The advantage of this is flexibility, but the disadvantage is that sorting and filtering are slower because the database doesn’t have a pre-defined schema. ↩
Technically, SQL is an ANSI Standard that many different dialects implement - so, to call yourself a SQL dialect, you must have features defined by this standard, like the SELECT, FROM, and WHERE clauses shown above. ↩

DS Interview Study Guide Part I: Statistics

2019-08-24T00:00:00+00:00

As I have gone through a couple rounds of interviews for data scientist positions, I’ve been compiling notes on what I consider to be the essential areas of knowledge. I want to make these notes available to the general public; although there are many blog posts out there that are supposed to help one prepare for data science interviews, I haven’t found any of them to be very high-quality.

From my perspective, there are four key subject areas that a data scientist should feel comfortable with when going into an interview:

Statistics (including experimental design)
Machine Learning
Software Engineering (including SQL)
“Soft” Questions

I’m going to go through each of these individually. This first post will focus on statistics. We will go over a number of topics in statistics in no particular order. Note that this post will not teach you statistics; it will remind you of what you should already know.

If you’re utterly unfamiliar with the concepts I’m mentioning, I’d recommend this excellent MIT course on probability & statistics as a good starting point. When I began interviewing, I had never taken a statistics class before; I worked through the notes, homeworks, and exams for this course, and at the end had a solid foundation to learn the specific things that you need to know for these interviews. In my studying, I also frequently use cross-validated, a website for asking and answering questions about statistics. It’s good for in-depth discussions of subtle issues in statistics. Finally, Gelman’s book is the classic in Bayesian inference. If you have recommendations for good books that cover frequentist statistics in a clear manner, I’d love to hear them.

These are the notes that I put together in my studying, and I’m sure that there is plenty of room for additions and corrections. I hope to improve this guide over time; please let me know in the comments if there’s something you think should be added, removed, or changed!

The Central Limit Theorem

The Central Limit Theorem is a fundamental tool in statistical analysis. It states (roughly) that when you add up a bunch of independent and identically distributed random variables (with finite variance) then their sum will converge to a Gaussian distribution.¹

How is this idea useful to a data scientist? Well, one place where we see a sum of random variables is in a sample mean. One consequence of the central limit theorem is that the sample mean of a variable with mean \(\mu\) and variance \(\sigma^2\) will itself have mean \(\mu\) and variance \(\sigma^2/n\), where \(n\) is the number of samples.

I’d like to point out that this is pretty surprising. The distribution of the sum of two random variables is not, in general, trivial to calculate. So it’s kind of awesome that, if we’re adding up a large enough number of (independent and identically distributed) random variables, then we do, in fact, have a very easy expression for the (approximate) distribution of the sum. Even better, we don’t need to know much of anything about the distribution of we’re sampling from, besides its mean and variance - it’s other moments, or general shape, don’t matter for the CLT.

As we will see below, the simplification that the CLT introduces is the basis of one of the fundamental hypothesis tests that data scientists perform: testing equality of sample means. For now, let’s work through an example of the theorem itself.

An Example

Suppose that we are sampling a Bernoulli random variable. This is a 0/1 random variable that is 1 with probability \(p\) and 0 with probability \(1-p\). If we get the sequence of ten draws \([0,1,1,0,0,0,1,0,1,0]\), then our sample mean is

\[\hat \mu = \frac{1}{10}\sum_{i=1}^{10} x_i = 0.4\]

Of course, this sample mean is itself a random variable - when we report it, we would like to report an estimate on its variance as well. The central limit theorem tells us that this will, as \(n\) increases, converge to a Gaussian distribution. Since the mean of the Bernoulli random variable is \(p\) and its variance is \(p(1-p)\), we know that the distribution of the sample mean will converge to a Gaussian with mean \(p\) and variance \(p(1-p)/n\). So we could say that our estimate of the parameter \(p\) is 0.4 \(\pm\) 0.155. Of course, we’re playing a bit loose here, since we’re using the estimate \(\hat p\) from the data, as we don’t actually know the true parameter \(p\).

Now, a sample size of \(n=10\) is a bit small to be relying on a “large-\(n\)” result like the CLT. Actually, in this case, we know the exact distribution of the sample mean, since \(\sum_i x_i\) is binomially distributed with parameters \(p\) and \(n\).

Hypothesis Testing

Hypothesis testing (also known by the more verbose “null hypothesis significance testing”) is a huge subject, both in scope and importance. We use statistics to quantitatively answer questions based on data, and (for better or for worse) null hypothesis significance testing is one of the primary methods by which we construct these answers.

I won’t cover the background of NHST here. It’s well-covered in the MIT course; look at the readings to find the relevant sections. Instead of covering the background, we’ll work through one exampleof a hypothesis test. It’s simple, but it comes up all the time in practice, so it’s essential to know. I might go so far as to say that this is the fundamental example of hypothesis testing in data science.

An Example

Suppose we have two buttons, one green and one blue. We put them in front of two different samples of users. For simplicity, let’s say that each sample has size \(n=100\). We observe that \(k_\text{green}\) 57 users click the green button, and only \(k_\text{blue} = 48\) click the blue button.

Seems like the green button is better, right? Well, we want to be able to say how confident we are of this fact. We’ll do this in the language of null hypothesis significance testing. As you should (hopefully) know, in order to do NHST, we need a null hypothesis and a test statistic; we need to know the test statistic’s distribution (under the null hypothesis); and we need to know the probability of observing a value “at least as extreme” as the observed value according to this distribution.

I’m going to lay out a table of all the important factors here, and then discuss how we use them to arrive at our \(p\)-value.

Description	Value
Null Hypothesis	\(p_{blue} - p_{green} < 0\)
Test Statistic	\(\frac{k_\text{blue}}{n} - \frac{k_\text{green}}{n}\)
Test Statistic’s Distribution	\(N(0, (p_b(1-p_b) + p_g(1-p_g)) / n)\)
Test Statistic’s Observed Value	-0.09
\(p\)-value	0.1003

There are a few noteworthy things here. First, we really want to know whether \(p_g > p_b\), but that’s equivalent to \(p_b-p_g < 0\). Second, we assume that \(n\) is large enough so that \(k/n\) is approximately normally distributed, with mean \(\mu = p\) and variance \(\sigma^2 = p(1-p)/n\). Third, since the differences of two normals is itself a normal, the test statistic’s distribution is (under the null hypothesis) a normal with mean zero and the variance given (which is the sum of the two variances of \(k_b/n\) and \(k_g/n\)).

Finally, we don’t actually know \(p_b\) or \(p_g\), so we can’t really compute the \(p\)-value; what we do is we say that \(k_b/n\) is “close enough”” to \(p_b\) and use it as an approximation. That gives us our final \(p\)-value.

The \(p\)-value was calculated in Python, as follows:

from scipy.stats import norm
pb = 0.48
pg = 0.57
n = 100
sigma = np.sqrt((pb*(1-pb) + pg*(1-pg))/n)
norm.cdf(-0.09, loc = 0, scale = sigma) # 0.10034431272089045

Calculating the CDF of a normal at \(x=-0.09\) tells us the probability that the test statistic is less than or equal to \(-0.09\), which is to say the probability that our test statistic is at least as extreme as the observed value. This probability is precisely our \(p\)-value.

So what’s the conclusion? Well, often times a significance level is set before the test is performed; if the \(p\)-value is not below this threshold, then the null hypothesis is not rejected. Suppose we had set a significance level of 0.05 before the test began - then, with this data, we would not be able to reject the null hypothesis, which is that the buttons are equally appealing to users.

Phew! I went through that pretty quick, but if you can’t follow the gist of what I was doing there, I’d recommend you think through it until it is clear to you. You will be faced with more complicated situations in practice; it’s important that you begin by understanding the most simple situation inside out.

Confidence Intervals

Confidence intervals allow us to state a statistical result as a range, rather than a single value. If we count that 150 out of 400 people sample randomly from a city identify themselves as male, then our best estimate of the fraction of women in the city is 250/400, or 5/8. But we only looked at 400 people, so it’s reasonable to expect that the true value might be a bit more or less than 5/8. Confidence intervals allow us to quantify this width in a statistically rigorous way.

As per usual, we won’t actually introduce the concepts here - I’ll refer you to the readings from the MIT course for an introduction. We’ll focus on working through an example, and looking at some different approaches.

The Exact Method

Suppose that we want to find a 95% confidence inverval on the female fraction in the city discussed above. This corresponds to a significance level of \(\alpha/2\). One way to get the exact confidence inverval is to use the CDF of our test statistic, but substitute in the observed parameter for the true parameter, and then invert it to find where it hits \(\alpha/2\) and \(1-\alpha/2\). That is, we need to find the value \(p_l\) that solves the equation

\[CDF\left(n, p_l\right) = \alpha/2\]

and the value \(p_u\) that solves the equation

\[CDF\left(n, p_u\right) = 1 - \alpha/2.\]

In these, \(CDF(n,p)\) is the cumulative distribution function of our test statistic, assuming that the true value of \(p\) is in fact the observed value \(\hat p\). This is a bit confusing, so it’s worth clarifying. In our case, the sample statistic is the sample mean of \(n\) binomial random variables, so this CDF is the CDF of the sample mean of \(n\) binomial random variables with parameter \(5/8\). Solving the two equations above would give us our confidence inverval \([p_l, p_u]\).

It took me a bit of work to see that solving the above two equations would in fact give us bounds that satisfy the definitions of a \(1-\alpha\) confidence interval, which says that, were we to run many experiments, we would find that the true value of \(p\) would fall between \(p_l\) and \(p_u\) with the probability

\[P\left(p_l\leq p \leq p_u\right) = 1-\alpha.\]

If you’re into this sort of thing, I’d suggest you take some time thinking through why inverting the CDF as above guarantees bounds \([p_l, p_u]\) that solve the above equaiton.

Although it is useful for theoretical analysis, I rarely use this method in practice, because I often do not actually know the true CDF of the statistic I am measuring. Sometimes I do know the true CDF, but even in such cases, the next (approximate) method is generally sufficient.

The Approximate Method

If your statistic can be phrased as a sum, then its distribution approaches a normal distribution.² This means that you can solve the above equations for a normal CDF rather than the true CDF of the sum (in the case above, a binomial CDF).

How does this help? For a normal distribution, the solutions for the above equations to find lower and upper bounds are well known. In particular, the inverval \([\mu-\sigma,\mu+\sigma]\), also called a \(1\sigma\)-interval, covers about 68% of the mass (probability) of the normal PDF, so if we wanted to find a confidence interval of level \(0.68\), then we know to use the bounds \((\overline x-\sigma, \overline x+\sigma)\), where \(\overline x\) is our estimate of the true mean \(\mu\).

This sort of result is very powerful, because it saves us from having to do any inversion by hand. A table below indicates the probability mass contained in various symmetric intervals on a normal distribution:

Inverval	Width³	Coverage
\([\mu-\sigma,\mu+\sigma]\)	\(1\sigma\)	0.683
\([\mu-2\sigma,\mu+2\sigma]\)	\(2\sigma\)	0.954
\([\mu-3\sigma,\mu+3\sigma]\)	\(3\sigma\)	0.997

Let’s think through how we would use this in the above example, where we give a confidence interval on our estimate of the binomial parameter \(p\).

A binomial distribution has mean \(\mu=np\) and variance \(\sigma^2=np(1-p)\). Since the sample statistical \(\hat p\) is just the binomial divided by \(n\), it has mean \(\mu=p\) and variance \(\sigma^2 = p(1-p)/n\). The central limit theorem tells us that the distribution of \(\hat p\) will converge to a normal with just these parameters.

Suppose we want an (approximate) 95% confidence interval on the percentage of women in the population of our city; the table above tells us we can just do a two-sigma interval. (This is not exactly a 95% confidence interval; it’s a bit over, as we see in the table above). The parameter \(\hat p\) has mean \(\mu= p\) and variance \(\sigma^2 = p(1-p)/n\).⁴ In our case, \(\hat p=5/8\), so our confidence interval is \(5/8 \pm 15/1280 \approx 0.625 \pm 0.0117\). Note that we approximated \(p\) with our experimental value \(\hat p\); the theoretical framework that allows us to do this substitution is beyond the scope of this article, but is nicely covered in the MIT readings (Reading 22, in particular).

The Bootstrap Method

The previous approach relies on the accuracy of approximating our statistic’s distribution by a normal distribution. Bootstrapping is a pragmatic, flexible approach to calculating confidence intervals, which makes no assumptions on the underlying statistics we are calculating. We’ll go into more detail on bootstrapping in general below, so we’ll be pretty brief here.

The basic idea is to repeatedly pull 400 samples with replacement from the sampled data. For each set of 400 samples, we get an estimate \(\hat p\), and thus can build an empirical distribution on \(\hat p\). Of course, the CLT indicates that this empirical distribution should look a lot like a gaussian distribution with mean \(\mu= p\) and variance \(\sigma^2 = p(1-p)/n\)..

Once you have bootstrapped an empirical distribution for your statistic of interest (in the example above, this is the percentage of the population that is women), then you can simply find the \(\alpha/2\) and \(1-\alpha/2\) percentiles, which then become your confidence interval. Although in this case our empirical distribution is (approximately) normal, it’s worth realizing that we can reasonably calculate percentiles regardless of what the empirical distribution is; this is why bootstrapping confidence intervals are so flexible.

As you’ll see below, the downside of bootstrapping confidence intervals is that it requires some computation. The amount of computation required can be anywhere from trivial to daunting, depending on how many samples you want in your empirical distribution. Another downside is that their statistical interpretation is not exactly in alignment with the definition of a confidence interval, but I’ll leave the consideration of that as an exercise for the reader.⁵ One of the MIT readings has an in-dpeth discussion of confidence intervals generated via the bootstrap method.

Overall, I would recommend using the approximate method when you have good reason to believe your sample statistic is approximately normal, or bootstrapping otherwise. Of course, the central limit theorem can provide some guarantees about the asympototic distribution of certain statistics, so it’s worth thinking through whether that applies to your situations.

Bootstrapping

Bootstrapping is a technique that allows you to get insight into the quality of your estimates, based only on the data you have. It’s a key tool in a data scientist’s toolbag, because we frequently don’t have a clear theoretical understanding of our statistics, and yet we want to provide uncertainty estimates. To understand how it works, let’s look through an example.

In the last section, we sampled 400 people in an effort to understand what percentage of a city’s population identified as female. Since 250 of them identified themselves as female, our estimate of the raio for the total population is \(5/8\). This estimate it itself a random variable; if we had sampled different people, we might have ended up with a different number. What if we want to know the distribution of this estimate? How would we go about getting that?

Well, the obvious way is to go out and sample 400 more people, and repeat this over and over again, until we have many such fractional estimates. But what if we don’t have access to sampling more people? The natural thing is to think that we’re out of luck - without the ability to sample further, we can’t actually understand more about the distribution of our parameter (ignoring, for the moment, that we have lots of theoretical knowledge about it via the CLT).

The idea behind bootstrapping is simple. Sample from the data you already have, with replacement, a new sample of 400 people. This will give you an estimate of the female fraction that is distinct from your original estimate, due to the replacement in your sampling. You can repeat this process as many times as you like; you will then get an empirical distribution whic approaches the true distribution of the statistic.⁴

Bootstrapping has the advantage of belig flexible, although it does have its limitations. Rather than get too far into the weeds, I’ll just point you to the Wikipedia article on bootstrapping. There are also tons of resources about this subject online. Try coding it up for yourself! By the time you’re interviewing, you should be able to write a bootstrapping algorithm quite easily.

Machine Learning Mastery has a good introduction to bootstrapping that uses the scikit-learn API. Towards Data Science codes it up directly in NumPy, which is a useful thing to know how to be able to do. Asking someone to code up a bootstrapping function would be an entirely reasonable interview questions, so it’s something you should be comfortable doing.

Linear Regression

Regression is the study of the relationship between variables; for example, we might wish to know how the weight of a person relates to their height. Linear regression assumes that your input (height, or \(h\)) and output (weight, or \(w\)) variables are linearly related, with slope \(\beta_1\), intercept \(\beta_0\), and noise \(\epsilon\).

\[w = \beta_1\cdot h + \beta_0 + \epsilon.\]

A linear regression analysis helps the user discover the \(\beta\)s in the above equation. This is just the simplest application of LR; in reality, it is quite flexible and can be used in a number of scenarios.

Linear regression is another large topic that I can’t really do justice to in this article. Instead, I’ll just go through some of the common topics, and introduce the questions you should be able to address. As is the case with most of these topics, you can look at the MIT Statistics & Probability course for a solid academic introduction to the subject. You can also dig through the Wikipedia article to get a more in-depth picture. The subject is so huge, and there’s so much to learn about it, that you really can spend as much time as you want digging into it - I’m just going to gesture at some of the simpler aspects of it.

Calculating a Linear Regression

Rather than go through an example here, I’ll just refer you to the many available guides that show you how to do this in code. Of course, you could do it in raw NumPy, solving the normal equations explicitly, but I’d recommend using scikit-learn or statsmodels, as they have much nicer interfaces, and give you all sorts of additional information about your model (\(r^2\), \(p\)-value, etc.)

Real Python has a good guide to coding this up - see the section “Simple Linear Regression with scikit-learn.” GeeksForGeeks does the solution in raw NumPy; the equations won’t be meaningful for you until you read up on the normal equation and how to analytically solve for the optimal LR coefficients. If you want something similar in R, or Julia, or MATLAB,⁶ then I’m sure it’s out there, you’ll just have to go do some Googling to find it.

A Statistical View

This subject straddles the boundary between statistics and machine-learning. It has been quite thoroughly studied from a statistical point of view, and there are some iportant results that you should be familiar with when thinking about linear regression from a statistical frame.⁷

Let’s look back at our foundational model for linear regression. LR assumes that your input \(x\) and output \(y\) are related via

\[y_i = \beta_1\cdot x_i + \beta_0 + \epsilon_i,\]

where \(\epsilon_i\) are i.i.d., distributed as \(N(0, \sigma^2)\). Since the \(\epsilon\) are random variables, the \(\beta_j\) are themselves random variables. One important question is whether there is, in fact, any relationship between our variables at all. If there is not, then we should \(\beta_1\) close to 0,⁸ but they will not ever be exactly zero. One important statistical technique in LR is doing a hypothesis test against the null hypothesis that \(\beta_1 = 0\). When a package like scikit-learn returns a “\(p\)-value of the regression”, this is the \(p\)-value they are talking about.

Like I said before, there is a lot more to know about the statistics of linear regression than just what I’ve said here. You can learn more about the statistics of LR by looking at the MIT course notes on the subject, or by digging through your favorite undergraduate statistics book - most of them should have sections covering it.

Validating Your Model

Once you’ve calculated your LR, you’d like to validate it. This is very important to do - if you’re asked to calculate a linear regression in an interview, you should always go through the process of validating it after you’ve done the calculation.

I’d generally go through the following steps:

If it’s just a simple (one independent variable) linear regression, then plot the two variables. This should give you a good sense of whether it’s a good idea to use linear regression in the first place. If you have multiple independent variables, you can make separate plots for each one.
Look at your \(r^2\) value. Is it reasonably large? Remember, closer to 1 is better. If it’s small, then doing a linear regression hasn’t helped much.
You can look at the \(p\)-value to see if it’s difference from zero is statistically significant (see the section below). Also, you can have a very significant \(p\)-value while still having a low \(r^2\), so be cautious in your interpretation of this one.
You can also look at the RMSE of your model, but this number is not scaled between 0 and 1, so a “good” RMSE is highly dependent on the units of your indepedent variable.
Plot your residuals, for each variable. The residual is just the input minus the value predicted by your model, a.k.a. the error of your model. Plotting each residual isn’t really feasible if you have hundreds of independent variables, but it’s a good idea if your data is small enough. You should be looking for “homoskedasticity” - that the variance of the error is uniform across the range of the independent variable. If it’s not, then certain things you’ve calculated (for example, the \(p\)-value of your regression) are no longer valid. You might also see that your errors have a bias that changes as the \(x_i\) changes; this means that there’s some more complicated relationship between \(y\) and \(x_i\) that your regression did not pick up.

Some of the questions below address the assumptions of linear regression; you should be familiar with them, and now how to test for them either before or after the regression is performed, so that you can be confident that your model is valid.

Basic Questions on LR

Hopefully you’ve familiarized yourself with the basic ideas behind linear regression. Here are some conceptual questions you should be able to answer.

How are the \(\beta\)s calculated? Practically, you let the library you’re using take care of this. But behind the scenes, generally it’s solving the so-called “normal equations”, which give you the optimal (highest \(r^2\)) parameters possible. You can use gradient descent to approximate the optimal solution when the design matrix is too large to invert; this is available via the SGDRegressor model in scikit-learn.
How do you decide if you should use linear regression? The best case is when the data is 2- or 3-dimensional; then you can just plot the data and see if it looks like “linear plus noise”. However, if you have lots of independent variables, this isn’t really an option. In such a case, you should look perform a linear regression analysis, and then look at the errors to verify that they look normally distributed and homoskedastic (constant variance).
What does the \(r^2\) value of a regression indicate? The \(r^2\) value indicates “how much of the variance of the output data is explained by the regression.” That is, your output data \(y\) has some (sample) variance, just on its own. Once you discover the linear relationship and subtract it off, then the remaining error \(y - \beta_0 - \beta_1x\) still has some variance, but hopefully it’s lower - \(r^2\) is one minus the ratio of the original to the remaining variance. When \(r^2=1\), then your line is a perfect fit of the data, and there is no remaining error. It is often used to explain the “quality” of your fit, although this can be a bit treacherous - see Anscombe’s Quartet for examples of very different situations with the same \(r^2\) value.
What are the assumptions you make when doing a linear regression? The Wikipedia article addresses this point quite thoroughly. This is worth knowing, because you don’t just want to jump in and blindly do LR; you want to be sure it’s actually a reasonable approach.
When is it a bad idea to do LR? When you do linear regression, you’re assuming a certain relationship between your variables. Just the parameters and output of your regression won’t tell you whether the data really are appropriate for a linear model. Anscombe’s Quartet is a particularly striking example of how the output of a linear regression analysis can look similar but in fact the quality of the analysis can be radically different. Beyond this, it is a bad idea to do LR whenever the assumptions of LR are violated by the data; see the above bullet for more info there.
Can you do linear regression on a nonlinear relationship? In many cases, yes. What we need is for the model to be linear in the parameters \(\beta\); if, for example, you are comparing distance and time for a constantly accelerating object \(d = 1/2at^2\), and you want to do regression to discover the acceleration \(a\), then you can just use \(t^2\) as your independent variable. The model relating \(d\) and \(t^2\) is linear in the acceleration \(a\), as required.
What does the “linear” in linear regression refer to? This one might seem trivial, but it’s a bit of a trick question; the relationship \(y = 2\log(x)\) might not appear linear, but in fact it can be obtained via a linear regression, by using \(\log(x)\) as the input variables, rather than \(x\). Of course, for this to work, you need to know ahead of time that you want to compare against \(\log(x)\), but this can be discovered via trial-and-error, to some extent. So the “linear” does, as you’d expect, mean that the relationship between independent and dependent variable is linear, but you can always change either of them and re-calculate your regression.

Handling Overfitting

Overfitting is a very important to understand, and is a fundamental challenge in machine learning and modeling. I’m not going to go into great detail on it here; more information will be presented in the machine learning section of the guide. There are some techniques for handling it that are particular to LR, which is what I’ll talk about here.

RealPython has good images showing examples of over-fitting. You can handle it by building into your model a “penalty” on the \(\beta_i\)s; that is, tell your model “I want low error, and I don’t want large coefficients.** The balance of these preferences is determined by a parameter, often denoted by \(\lambda\).

Since you have many \(\beta\)s, in general, you have to combine them in some fashion. Two such ways to calculate the measure of “overall badness” (which I’ll call \(OB\)) are

\[OB = \sqrt{ \beta_1^2 + \beta_2^2 + \ldots + \beta_n^2 }\]

\[OB = |\beta_1| + |\beta_2| + \ldots + |\beta_n|.\]

The first will tend to be emphasize outliers; that is, it is more sensitive to single large \(\beta\)s. The second considers all the \(\beta\)s more uniformly. If you use the first, it is called “ridge regression”, and if you use the second it is called “LASSO regression.”

In mathematics, these denote the \(\ell_1\) and \(\ell_2\) norms of the vectors of \(\beta\)s; you can in theory use \(\ell_p\) norms for any \(p\), even \(p=0\) (count the number of non-zero \(\beta\)s to get the overall badness) or \(p=\infty\) (take the largest \(\beta\) as the overall badness). However, in practice, LASSO and ridge regression are already implemented in common packages, so it’s easy to use them right out of the box.

As usual, there is a LOT to learn about how LASSO and ridge regression change your output, and what kinds of problems they can address (and/or create). I’d highly recommend searching around the internet to learn more about them if you aren’t already confident in your understanding of how they work.

Logistic Regression

Logistic regression is a way of modifying linear regression models to get a classification model. The statistics of logistic regression are, generally speaking, not as clean as those of linear regression. It will be covered in the machine learning section, so we won’t discuss it here.

Bayesian Inference

Up until now this guide has primarily focused on frequentist topics in statistics, such as hypothesis testing and the frequentist approach to confidence intervals. There is an entire world of Bayesian statistical inference, which differs significantly from the frequentist approach in both philosophy and technique. I will only touch on the most basic application of Bayesian reasoning in this guide.

In this section, I will mostly defer to outside sources, who I think speak more eloquently on the topic than I can. Some companies (such as Google, or so I’m told) tend to focus on advanced Bayesian skills in their data science interviews; if you want to really learn the Bayesian approach, I’d reccomend Gelman’s book, which is a classic in the field.

Bayesian vs Frequentist Statistics

It’s worth being able to clearly discuss the difference in philosophy and approach between the two schools of statistics. I particularly like the discussion in the MIT course notes. They state, more or less, that while the Bayesians like to reason from Bayes theorem

\[P(H|D) = \frac{ P(D|H)P(H)}{P(D)},\]

the frequentist school thinks that “the probability of the hypothesis” is a nonsense concept - it is not a well-founded probablistic value, in the sense that there is no repeatable experiment you can run in which to gather relative frequency counts and calculate probabilities. Therefore, the frequentists must reason directly from \(P(D|H)\), the probability of the data given the hypothesis, which is just the \(p\)-value. The upside of this is that the probabilistic interpretation of \(P(D|H)\) is clean and unambiguous; the downside is that it is easy to misunderstand, since what we really think we want is “the probability that the hypothesis is true.”

If you want to know more about this, there are endless discussions of it all over the internet. Like many such dichotomies (emacs vs. vim, overhand vs underhand toilet paper, etc.) it is generally overblown - a working statistician should be familiar with, and comfortable using, both frequentist and Bayesian techniques in their analysis.

Basics of Bayes Theorem

Bayes theorem tells us how to update our belief in light of new evidence. You should be comfortably applying Bayes theorem in order to answer basic probability questions. The classic example is the “base rate fallacy”:

Consider a routine screening test for a disease. Suppose the frequency of the disease in the population (base rate) is 0.5%. The test is highly accurate with a 5% false positive rate and a 10% false negative rate. You take the test and it comes back positive. What is the probability that you have the disease?

The answer is NOT 0.95, even though the test has a 5% false positive rate. You should be able to clearly work through this problem, building probability tables and using Bayes theorem to calculate the final answer. The problem is worked through in the MIT stats course readings (see Example 10), so I’ll defer to them for the details.

Updating Posteriors & Conjugate Priors

The above approach of calculating out all the probabilites by hand works reasonbly well when there are only a few possible outcomes in the probability space, but it doesn’t scale well to large (discrete) probability spaces, and won’t work at all in continuous probability spaces. In such situations, you’re still fundamentally relying on Bayes theorem, but the way it is applied looks quite different - you end up using sums and integrals to calculate the relevant terms.

Again, I’ll defer to the MIT stats course readings for the details - readings 12 and 13 are the relevant ones here.

It’s particularly useful to be familiar with the concept of conjugate priors. In general, updating your priors involves computing an integral, which as anyone who has taken calculus knows can be a pain in the ass. When sampling from a distribution and estimating the parameters, there are certain priors for which the updates based on successive samples work out to be very simple.

For an example of this, suppose you’re flipping a biased coin and trying to figure out the bias. This is equivalent to sampling a binomial distribution and trying to estimate the parameter \(p\). If your prior is uniform (flat across the interval \([0,1]\)), then after \(N\) flips, \(k\) of which come up heads, your posterior probability density on \(p\) will be

\[f(p) \propto p^{k}((1-p)^{N-k}.\]

This is called a \(\beta\) distribution. It is kind of magical that we can calculate this without having to do any integrals - this is because the \(\beta\) distribution is “conjugate to” the binomial distribution. It’s important that we started out with a uniform distribution as our prior - if we had chosen an arbitrary prior, the algebra might not have worked out as nicely. In particular, if we start with a non-\(\beta\) prior, then this trick won’t work, because our prior will not be conjuage to the binomial distribution.

The other important conjugate pair to know is that of the Gaussian distribution; it is, in fact, conjuage to itself, so if you estimate the parameters of a normal distribution, those estimates are themselves normal, and updating your belief about the parameters based on new draws from the normal distribution is as simple as doing some algebra.

There are many good resources available online and in textbooks discussing conjuage priors; Wikipedia is a good place to start.

Maximum Likelihood Estimation

We discussed before the case where you have a bunch of survey data, and want to estimate the proportion of the population that identifies as female. Statistically speaking, this proportion is a parameter of the probability distribution over gender identity in the that geographical region. We’ve intuitively been saying that if we see 250 out of 400 respond that they are female, then our best estimate of the proportion is 5/8. Let’s get a little more formal about why exactly this is our best estimate.

First of all, I’m going to consider a simplified world in which there are only two genders, male and female. I do this to simplify the statistics, not because it is an accurate model of the world. In this world, if the true fraction of the population that identifies as female is 0.6, then there is some non-zero probability that you would draw a sample of 400 people in which 250 identify as female. We call this the likelihood of the parameter 0.6. In particular, the binomial distribution tells us that

\[\mathcal{L}(0.6|n_\text{female}=250) = {400 \choose 250} \,0.6^{250}\, (1-0.6)^{400-250}\]

Of course, I could calculate this for any parameter in \([0,1]\); if I were very far from 5/8, however, then this likelihood would be very small.

Now, a natural question to ask is “which parameter \(p\) would give us the highest likelihood?” That is, which parameter best fits our data? That is the maximum-likelihood estimate of the parameter \(p\). The actual calculation of that maximum involves some calculus and a neat trick involving logarithms, but I’ll refer the reader elsewhere for those details. It’s worth noting that the MLE is often our intuitive “best guess” at the parameter; in this case, as you might anticipate, \(p=5/8\) maximizes the likelihood of seeing 250 people out of 400 identify as female.

I won’t give any question here, because I honestly have not seen any in my searching around. Even so, I think it’s an important concept to be familiar with. Maximum likelihood estimation often provides a theoretical foundation for our intuitive estimates of parameters, and it’s helpful to be able to justify yourself in this framework.

For example, if you’re looking at samples from an exponential distribution, and you want to identify the parameter \(\lambda\), you might guess that since the mean of an exponential random variable is \(\mu= 1/\lambda\), a good guess would be \(\lambda \approx 1/\overline x\), where \(\overline x\) is your sample mean. In fact you would be correct, and this is the MLE for \(\lambda\); you should be familiar with this way of thinking about parameter estimation.

Experimental Design

Last, but certainly not least, is the large subject of experimental design. This is a more nebulous topic, and therefore harder to familiarize yourself with quickly, than the others we’ve discussed so far.

If we have some new feature, we might have reason to think it will be good to include in our product. For example, Facebook rolled out a “stories” feature some time ago (I honestly couldn’t tell you what it does, but it’s some thing that sits on the top of your newsfeed). However, before they expose this to all their users, they want to put it out there “in the wild” and see how it performs. So, they run an experiment.

Designing this experiment in a valid way is essential to getting meaningful, informative results. An interview question at Facebook might be: How will you analyze if launching stories is a good idea? What data would you look at? The discussion of this question could easily fill a full 45-minute interview session, as there are many nuances and details to examine.

One basic approach would be to randomly show the “stories” feature to some people, and not to others, and then see how it affects their behavior. This is an A/B test. Some questions you should be thinking about are:

What metrics will we want to track in order ot measure the effect of stories? For example, we might measure the time spent on the site, the number of clicks, etc.
How should we randomize the two groups? Should we randomly choose every time someone visits the site whether to show them stories or not? Or should we make a choice for each user and fix that choice? Generally, user-based randomization is preferable, although sometimes it’s hard to do across devices (think about why this is).
How long should we run the tests? How many people should be in each group? This decision is often based on a power calculation, which gives us the probability of rejecting the null hypothesis, given some alternative hypothesis. I personally am not a huge fan of these because the alternative hypothesis is usually quite ad-hoc, but it is the standard, so it’s good to know how to do it. For example, you might demand that your test be large enough that if including stories increases site visit time by at least one minute, our A/B test will detect that with 90% probability.
When can we stop the test? The important thing to note here is that you cannot just stop the test once the results look good - you have to decide beforehand how long you want it to run.
How will you deal with confounding variables? What if, due to some techincal difficulty, you end up mostly showing stories to users at a certain time of day, or in a certain geographical region? There are a variety of approaches here, and I won’t get into the details, but it’s essential that you be able to answer this concern clearly and thoroughly.

It’s also worth considering scenarios where you have to analyze data after the fact in order to perform “experiments”; sometimes you want to know (for example) if the color of a product has affected how well it sold, and you want to do so using existing sales data. What limitations might this impose? A key limitation is that of confounding variables - perhaps the product in red mostly sold in certain geographic regions, whereas the blue version sold better in other geographic regions. What impact will this have on your analysis?

There are many other considerations to think about around experimental design. I don’t have any particular posts that I like; I’d recommend searching around Google to find more information on the topic.

If you have any friends that do statistics professionally, I’d suggest sketching our a design for the above experiment and talking through it with them - the ability to think through an experimental design is something that is best developed over years of professional experience.

Conclusion

This guide has focused on some of the basic aspects of statistics that get covered in data science interviews. It is far from exhaustive - different companies focus on different skills, and will therefore be asking you about different statistical concepts and techniques. I haven’t discussed time-dependent statistics at all - Markov chains, time-series analysis, forecasting, and stochastic processes all might be of interest to employers if they are relevant to the field of work.

Please let me know if you have any corrections to what I’ve said here. I’m far from a statistician, so I’m sure that I’ve made lots of small (and some large) mistakes!

Stay tuned for the rest of the study guide, which should be appearing in the coming months. And finally, best of luck with your job search! It can be a challenging, and even demoralizing experience; just keep learning, and don’t let rejection get you down. Happy hunting!

Of course, the actual statement is careful about the mode of convergence, and the fact that it is actually an appropriately-normalized version of the distribution that converges, and so on. ↩
Again, we’re being loose here - it has to have finite variance, and the convergence is only in a specific sense. ↩
I’m being a little loose with definitions here - the width of a \(2\sigma\) inverval is actually \(4\sigma\), but I think most would still describe it using the phrase “two-sigma”. ↩
As usual, we’re being a bit sloppy - we’re just using the sample variance in place of the true variance and pretending this is correct. This will work if the number of samples \(n\) is large. If you need confidence intervals with few (say, less than 15) samples, I recommend you look into confidence intervals based on the student-t distribution. ↩ ↩²
In doing bootstrapping, we’re really trying to find the distribution of our statistic \(\hat S\). So, what we find via this method are bounds \((l,u)\) such that \(P(l\leq \hat S \leq u)\geq C\). How does this relate to the definition of a confidence interval? This is a somewhat theoretic exercise, but can be helpful in clarifying your understanding of the more technical aspects of confidence interval computation. ↩
Why are you using MATLAB? Stop that. You’re not in school anymore. ↩
Some of the issues that arise here (for example, over- and under-fitting) have solutions that are more practical and less theoretical and statistical in nature - these will be covered in more depth in the machine learning portion of this guide, and so we don’t go into too much detail in this section. ↩
\(\beta_0\) just represents the difference in the mean of the two variables, so it could be non-zero even if the two are independent. ↩

New Paper: Metrics For Graph Comparison

2019-07-05T00:00:00+00:00

I just put a new paper up on the arXiv, and so I thought I would share it here. This was the final paper I wrote for my Ph.D., and it’s the one I’m most proud of. The paper is called “Metrics for Graph Comparison: a Practitioner’s Guide.”

The Basic Idea

Suppose you have two graphs, or even just a single graph that is changing in time. For example, you might have a social network between students at a school that evolves as time passes. Below, we see the social network for a particular French elementary school, which is evolving as the day passes. Each vertex is a person, and each edge indicates face-to-face contact.

One important question that we must answer is “how much did the graph change between times \(t\) and \(t+1\)?” Said another way, how similar are graphs \(G_t\) and \(G_{t+1}\)? The central subjects of this paper are the many methods available for comparing graphs.

We study these methods both by looking at empirical examples like the one above, as well as by doing a large study of the statistics of comparing various random graph models. Which graph comparison tool can best distinguish an Erdos-Renyi random graph from a stochastic blockmodel? What about comparing a random graph with fixed degree distribution to a preferential attachment graph? Using Monte Carlo simulation of the graphs, we are able to answer these questions and gain insight into the behavior of our distances when they are used on a variety of different structures and geometries.

One important focus of the paper is on practicality, and so we only look at distances that are linear or near-linear (i.e. \(O(n)\) or \(O(n \log n)\)) in the number of vertices in the graph.¹ More computationally expensive distances may be of theoretical interest, but for the graphs used in business, which often range upwards of 1 million vertices, they are not feasible to use.

Findings

There is a lot of nuance in the interpretation of these comparisons - it’s not as simplea as “method X is the best”. The results depend strongly on the geometric structural differences you with to learn about the graph. Do you care about total connectivity? Then just use a simple edit distance. If you care about the community structure of a graph, then you should probably use a spectral distance.

That said, we find that spectral methods (which are quite standard, and have been around for some time) are strong performers all around. They are robust, flexible, and have the added benefit of easy implementation - fast spectral algorithms are ubiquitous in modern computing packages such a MATLAB, SciPy, and Julia.

For example, here is a plot showing how well the different distances are able to discern an Erdos-Renyi random graph from a stochastic blockmodel.

Higher numbers mean that the distances can more reliably discern between the two populations. We see that the adjacency spectral distance \(\lambda^A\) and the normalized Laplacian spectral distance \(\lambda^{\mathcal L}\) are most reliably able to pick out the community structure that differentiates between these two models. This is not surprising, as the spectra of the graph has a direct interpretation in terms of vibrational modes, which depend critically upon community structure.

If you want to know more, check out the full paper. The above result is just one of a large collection of findings that we lay out. As I said before, the idea isn’t to come to a single conclusion; it is to survey the landscape and to compare and contrast these different tools.

Conclusion

In research, so many people spend so much time developing new methods, and I always think to myself, “How does this compare to the standard method? Is it actually an improvement?” This paper attempts to take stock of a number of standard and cutting-edge methods in graph comparison, and see what works best. After spending some time doing a theoretical analysis of a particular graph distance metric (see my previous paper) I was curious to see how all the tools available compared to one another.

Also, I’ve implemented many of these distances in my Python library NetComp, which you can get via pip install netcomp. Check it out, and feel free to post issues and/or PRs if you want to add to/modify the library.

Let me know in the comments what you think! Or feel free to email me if you have more detailed questions about graph metrics. Happy Friday!

This is paired with the assumption that the graph is sparse, so the number of edges is \(O(n \log n)\) ↩

Types as Propositions

2018-11-30T00:00:00+00:00

Some of the most meaningful mathematical realizations that I’ve had have been unexpected connections between two topics; that is, realizing that two concepts that first appeared quite distinct are in fact one and the same. In our first linear algebra courses, we learn that manipulations of matrices is, in fact, equivalent to solving systems of equations. In quantum mechanics, we see that physically observable quantities are, mathematically speaking, linear operators (I still don’t quite grok this one). And, my personal favorite example, we learn in functional analysis that the linear functionals in the dual space of a Hilbert space are themselves in perfect correspondence with the functions in the original space.¹

Recently, I’ve stumbled upon another such result, which has captured my attention for a while. The result, often referred to as Curry-Howard correspondence, is the statement that propositions in a formal logical system are equivalent to types in the simply typed lambda calculus. Loosely, this means that logical statements are equivalent to data types!

Let’s unpack that a bit; “propositions” are just statements in a logical system.² In mathematics, for example, one might put forward the proposition “no even numbers are prime,” or “14 is greater than 18”. Note that propositions need not be true; in fact, some logical systems support propositions that cannot even be determined to be true or false.³ “Types” can be though of as types in a computing language; Integer, Boolean, and so on. We will have much more to say about types as we move forward, but for now, hold in your mind the conventional notion of types as defined in a language such as Java or Python (or better yet, Haskell).

How on earth could these two be in correspondence? On the surface, they appear entirely separate concepts. In this post, I’ll spend some time unpacking what this equivalence is actually saying, using a simple example. I am far from a full understanding of it, but as usual, I write about it in the hopes that I’ll be forced to clarify what I do understand, or even better, be corrected by someone more knowledgable than myself.

Speaking of those more knowledgable than myself, there are various resources online that I found very helpful in understanding the correspondence: Philip Wadler’s talk on the subject is a great starting point, and there are a number of useful discussions available on StackExchange and various functional programming forums.

An Example

I was confused by the idea of propositions as types when I first encountered it, and after learning more, I believe that the root of my confusion lies in the fact that types such as Integer, Boolean, and String, which we are familiar with from programming, correspond to very trivial propositions, making them poor examples. We’ll have to introduce something a bit fancier; a conditional type. For example, OddInt might be odd Integers, and PrimeInt might be prime integers. We’ll approximate these conditional types with custom classes in Scala. Classes and types are different beasts, of course, but we will ignore that distinction in this post.⁴

Let’s consider one conditional type in particular: BigInteger. This type (actually a class in this example) is defined as follows:

class BigInteger (val value: Int) {

  private final val LOWER_BOUND = 10000
  
  if (value < LOWER_BOUND) {
    throw new IllegalArgumentException("Too small!")
  }
  
  override def toString = s"BigInteger($value)"

}

One could then instantiate a BigInteger as follows:

val big = new BigInteger(10001)
// res0: BigInteger(10001)

val small = new BigInteger(500)
// java.lang.IllegalArgumentException: Too small!

Now the fundemanetal question: what proposition corresponds to this type? In simple scenarios like this, the corresponding proposition is that the type can be inhabited; that is, there exists a value that satisfies that type. For example, the type BigInteger corresponds to the claim “there exists an integer \(i\) for which \( i > 10,000 \)”. Obviously, such an integer exists, and the fact that we can instantiate this type indicates that it corresponds to a true proposition. Alternatively, consider a type WeirdInteger, which is an integer satisfying i < 3 && i > 5. We can define the type well enough, but there are no values which satisfy it; it is an uninhabitable type, and so corresponds to a false proposition.

Functions and Implication

Let’s make things a little more interesting. In programming languages, there are not only primitive types like Integer and Boolean, but there are also function types, which are the types of functions. For example, in Scala, the function def f(x: Int) = x.toString has type Int => String, which is to say it is a function that maps integers to strings.

What sort of propositions would functions correspond to? It turns out that functions naturally map to implication. In some ways, the correspondence here is very natural. Consider the conditional type BigInteger, and the conditional type BiggerInteger. The definition of the latter should look familiar, from above:

class BiggerInteger (val value: Int) {

  private final val LOWER_BOUND = 20000
  
  if (value < LOWER_BOUND) {
    throw new IllegalArgumentException("Too small!")
  }
  
  override def toString = s"BiggerInteger($value)"

}

Now, we can write a function that maps BigInteger to BiggerInteger:

def makeBigger(b: BigInteger): BiggerInteger = 
  new BiggerInteger(b.value * 2)

Recall that the proposition corresponding to the type BigInteger is the statement “there exists an integer greater than 10,000”, and the proposition corresponding to Bigger is the statement “there exists an integer greater than 20,000”; the proposition corresponding to the function type BigInteger => BiggerInteger is then just the statement “the existence of an integer above 10,000 implies the existence of an integer above 20,000”. And note that, as it should be for an implication, we do not care whether there actually does exist an integer above 10,000; we simply know that if one exists, then its existence implies the existence of an integer above 20,000.

To be a bit more explicit, the function that we wrote above can be thought of as a proof of the implication; in particular, if we suppose that there exists an \(i\) such that \(i > 10,000\), then clearly \(2i > 20,000\), and so if we let \(j=2i\), then we have proven the existence of a \(j\) such that \(j > 20,000\). This is what the theoretical computer scientists mean when they say that “programs are proofs”.

Of course, Scala is not a proof-checking language, and cannot tell during compilation that the function makeBigger is valid; we would need a much richer type system to be able to validate such functions. Consider that the following function compiles with no problem, although there are no input values for which it will not throw a (runtime) exception:

def wonky(b: BigInteger): BiggerInteger = 
  new BiggerInteger(b.value % 1000)

Wait… what?

If you think about it a bit more, it’s sort of a weird example; you could map any type to BiggerInteger, just by doing def f[A](a:A): BiggerInteger = new BiggerInteger(20001). This is because the proposition that corresponds to BiggerInteger is true (the type is inhabitable), and if B is true, then A implies B for any A at all.

Common languages such as Haskell only express very trivial propositions with their types; there does exist one uninhabitable type (void), but I have not found much use for it in practice. The benefit of using conditional types for these examples is that we can explore at least some types which have corresponding false propositions, such as WeirdInteger, which are integers i which satisfy i < 3 && i > 5.

In Conclusion

Seeing all this, you can begin to get a sense of how computer-assisted proof techniques might arise out of it. If the fact that a program compiles is equivalent to the truth the corrsponding proposition, then all we need is a language with a rich enough type system to express interesting statements. Examples of languages used in this way include Coq and Agda. A thorough discussion of such languages is beyond both the scope of this post and my understanding.

I think what keeps me interested in this subject is that it still remains quite opaque to me; I’ve struggled to even come up with these simple (and flawed) examples of how Curry-Howard correspondence plays out in practice. I hope that anyone reading this who understand the subject better than I do will leave a detailed list of my misunderstandings, so that I can better grasp this mysterious and fascinating topic.

This statement is difficult to understand without background in functional analysis, but it is in fact one of the most beautiful examples of such an equivalence result. ↩
I’m being a bit sloppy here. The type of logic we’re talking about here is not classical logic, but rather in the sense of natural deduction. ↩
Such systems are called undecidable; see the wiki entry on decidability for more information. ↩
We won’t be careful about whether the idea of conditional types presented here corresponds well with conditional types as they are actually implemented in programming languages such as Typescript. ↩

Inverse Transform Sampling in Python

2018-06-24T00:00:00+00:00

When doing data work, we often need to sample random variables. This is easy to do if one wishes to sample from a Gaussian, or a uniform random variable, or a variety of other common distributions, but what if we want to sample from an arbitrary distribution? There is no obvious way to do this within scipy.stats. So, I build a small library, inverse-transform-sample, that allows for sampling from arbitrary user provided distributions. In use, it looks like this:

import numpy as np
pdf = lambda x: np.exp(-x**2/2) # unit Gaussian, not normalized
from itsample import sample
samples = sample(pdf,1000) # generate 1000 samples from pdf

The code is available on GitHub. In this post, I’ll outline the theory of inverse transform sampling, discuss computational details, and outline some of the challenges faced in implementation.

Introduction to Inverse Transform Sampling

Suppose we have a probability density function \(p(x)\), which has an associated cumulative density function (CDF) \(F(x)\), defined as usual by

\[F(x) = \int_{-\infty}^x p(s)ds.\]

Recall that the cumulative density function \(F(x)\) tells us the probability that a random sample from \(p\) is less than or equal to x.

Let’s take a second to notice something here. If we knew, for some x, that \(F(x)=t\), then drawing \(x\) from \(p\) is in some way equivalent to drawing \(t\) from a uniform random variable on \([0,1]\), since the CDF for a uniform random variable is \(F_u(t) = t\).¹

That realization is the basis for inverse transform sampling. The procedure is:

Draw a sample \(t\) uniformly from the inverval \([0,1]\).
Solve the equation \(F(x)=t\) for \(x\) (invert the CDF).
Return the resulting \(x\) as the sample from \(p\).

Computational Considerations

Most of the computational work done in the above algorithm comes in at step 2, in which the CDF is inverted.² Consider Newton’s method, a typical routine for finding numerical solutions to equations: the approach is iterative, and so the function to be inverted, in our case the CDF \(F(x)\), is evaluated many times. Now, in our case, since \(F\) is a (numerically computed) integral of \(p\), this means that we will have to run our numerical quadrature routine once for each evaluation of \(F\). Since we need many evaluations of \(F\) for a single sample, this can lead to a significant slowdown in sampling.

Again, the pain point here is that our CDF \(F(x)\) is slow to evaluate, because each evaluation requires numerical quadrature. What we need is an approximation of the CDF that is fast to evaluate, as well as accurate.

Chebyshev Approximation of the CDF

I snooped around on the internet a bit, and found this feature request for scipy, which is related to this same issue. Although it never got off the ground, I found an interesting link to a 2013 paper by Olver & Townsend, in which they suggest using Chebyshev polynomials to approximate the PDF. The advantage of this approach is that the integral of a series of Chebyshev polynomials is known analytically - that is, if we know the Chebyshev expansion of the PDF, we automatically know the Chebyshev expansion of the CDF as well. This should allow us to rapidly invert the (Chebyshev approximation of the) CDF, and thus sample from the distribution efficiently.

Other Approaches

There are also less mathematically sophisticated approaches that immediately present themselves. One might consider solving \(F(x)=t\) on a grid of \(t\) values, and then building the function \(F^{-1}(x)\) by interpolation. One could even simply transform the provided PDF into a histogram, and then use the functionality built in to scipy.stats for sampling from a provided histogram (more on that later). However, due to time constraints, inverse-transform-sample only includes the numerical quadrature and Chebyshev approaches.

Implementation in Python

The implementation of this approach is not horribly sophisticated, but in exchange it exhibits that wonderful readability characteristic of Python code. The complexity is the highest in the methods implementing the Chebyshev-based approach; those without a background in numerical analysis may wonder, for example, why the function is evaluted on that particularly strange set of nodes.

In the quadrature-based approach, both the numerical quadrature and root-finding are both done via scipy library (scipy.integrate.quad and scipy.optimize.root, respectively). When using this approach, one can set the boundaries of the PDF to be infinite, as scipy.integrate.quad supports improper integrals. In the notebook of examples, we show that the samples generated by this approach do, at least in the eyeball norm, conform to the provided PDF. As we expected, this approach is slow - it takes about 7 seconds to generate 5,000 samples from a unit normal.

As with the quadrature and root-finding, pre-rolled functional from scipy was used to both compute and evaluate the Chebyshev approximants. When approximating a PDF using Chebyshev polynomials, finite bounds must be provided. A user-determined tolerance determines the order of the Chebyshev approximation; however, rather than computing a true error, we simply use the size of the last few coefficients of the Chebyshev coefficients as an approximation. Since this approach differs from the previousl only in the way that the CDF is constructed, we use the same function sample for both approaches; an option chebyshev=True will generate a Chebyshev approximant of the CDF, rather than using numerical quadrature.

I hoped that the Chebyshev approach would improve on this by an order of magnitude or two; however, my hopes were thwarted. The implementation of the Chebyshev approach is faster by perhaps a factor of 2 or 3, but does not offer the kind of improvement I had hoped for. What happened? In testing, a single evaluation of the Chebyshev CDF was not much faster than a single evaluation of the quadrature CDF. The advantage of the Chebyshev CDF comes when one wishes to evaluate a long, vectorized set of inputs; in this case, the Chebyshev CDF is orders of magnitude faster than quadrature. But scipy.optimize.root does not appear to take advantage of vectorization, which makes sense - in simple iteration schemes, the value at which the next iteration occurs depends on the outcome of the current iteration, so there is not a simple way to vectorize the algorithm.

Conclusion

I suspect that the reason this feature is absent from large-scale library like scipy and numpy is that it is difficult to build a sampler that is both fast and accurate over a large enough class of PDFs. My approach sacrifices speed; other approximation schemes may be very fast, but may not provide the accuracy guarantees needed by some users.

What we’re left with is a library that is useful for generating small numbers (less than 100,000) of samples. It’s worth noting that in the work of Olver & Townsend, they seem to be able to use the Chebyshev approach to sample orders of magnitude faster than my impelmentation, but sadly their Matlab code is nowhere to be found in the Matlab library chebfun, which is the location advertised in their work. Presumably they implemented their own root-finder, or Chebyshev approximation scheme, or both. There’s a lot of space for improvement here, but I simply ran out of time and energy on this one; if you feel inspired, fork the repo and submit a pull request!

This is only true for \(t\in [0,1]\). For \(t<0\), \(F_u(t)=0\), and for \(t>1\), \(F_u(t)=1\). ↩
The inverse of the CDF is often called the percentile point function, or PPF. ↩

Algorithmic Musical Genre Classification

2018-06-06T00:00:00+00:00

If you are not automatically redirected, please click here

pwills.com

Human-in-the-Loop ML

When should you use a human in the loop?

How do you incorporate humans into your ML system?

Human confirmation before output

Human verification after output

Product Bootstrapping

Oh, the humanity

Conclusion

It’s Been a While…

Blogging in Org Mode

Org Mode and the Meaning of Life

What is Org Mode?

Why Not Markdown?

Org-Export and Jekyll

Blogging in Jekyll

Org-Export

Building ox-jekyll-lite

Customizing an Org Export Backend

Implementation Details for ox-jekyll-lite

My Blogging Workflow

Conclusion

Your p-values Are Bogus

A Simple Example

Python Simulation of Data

Convergence Is Not Enough

Implications

Afterword: Technical Details

Limitations of the Central Limit Theorem

The Berry-Esseen Theorem

DS Interview Study Guide Part II: Software Engineering

What Languages Should I Know?

General Tips for Coding Interviews

How to Prepare

Tips for Interviewing

Tips for Coding

Working with Data

Loading & Cleaning Data

Visualization

Data Structures & Algorithms

Time and Space Complexity Analysis

Arrays & Hashmaps

Sorting & Searching

Sorting

Searching

Working with SQL

A Note on Dialects

Conclusion

DS Interview Study Guide Part I: Statistics

The Central Limit Theorem

An Example

Other Questions on the CLT

Hypothesis Testing

An Example

Other Topics in Hypothesis Testing

Confidence Intervals

The Exact Method

The Approximate Method

The Bootstrap Method

Other Topics in Confidence Intervals

Bootstrapping

Other Topics in Bootstrapping

Linear Regression

Calculating a Linear Regression

A Statistical View

Validating Your Model

Basic Questions on LR

Handling Overfitting

Logistic Regression

Bayesian Inference

Bayesian vs Frequentist Statistics

Basics of Bayes Theorem

Updating Posteriors & Conjugate Priors

Maximum Likelihood Estimation

Experimental Design

Conclusion

New Paper: Metrics For Graph Comparison

The Basic Idea

Findings

Conclusion

Building `ox-jekyll-lite`

Implementation Details for `ox-jekyll-lite`