Posts

Prediction of NYC Taxi Demand

1. Narrative (TODO) Code published on GitHub. 2. Data 2.1 Taxi We use the yellow cab trip CSV data available here. The columns that we use are the pickup location (in latitude and longitude), and pickup datetime. The CSVs contain other columns, such as drop-off information, trip distance, and fare; there are many other analyses that are possible with this data. For each year, there are about 170 million records, which amount to about 25GB.
Read more

Articles---mainly about Streaming Systems

Articles and talks that I referred to while working on the taxi data project. Published here as a note to myself. Model Serving FLIP-23. The document also discusses implementing model training as well as model serving. Two linked documents—“Flink ML Roadmap” and “Flink-MS”—are also worth reading. Boris Lublinsky’s book “Serving Machine Learning Models”, and talk. Distributed Systems Please Stop Calling Database Systems AP or CP. Kate Matsudaira on distributed systems Distributed Systems for Fun and Profit Martin Kleppmann’s book and interview.
Read more

Keyword Extraction from arXiv - Summary

In the previous articles on arXiv keyword extraction, I focused on details of setting up an infrastructure to serve the algorithm. Since then, I wrote a more complete keyword extraction algorithm, and deployed it to Google Cloud Platform. This article a more high-level overview of the end product. Motivation I built this app to help myself keep learning mathematics. When I was in graduate school, I learned the terminology of my area of study through attending seminars — whenever the speaker used a term that I wasn’t familiar with, I would write it down and look it up later.
Read more

Learning Resources

Freely available online resources that I’ve found useful for learning Statistics, Machine Learning, Distributed Computing, Database Systems, and other CS and SWE topics. Statistics and Machine Learning ML Expositions on machine learning that are freely accessible (like the ones on Coursera) tend to sweep theoretical foundation and mathematical rigor under the carpet; these are resources that don’t skimp on the hard stuff. Stanford CS229 Machine Learning One-line summary: the course teaches you how to set up a cost function based on a model and data, and figure out how to optimize it.
Read more

Keyword Extraction from arXiv - Part 3

This is the final part of the tutorial. We furnish the app with an UI, and deploy it to Heroku. 5. Build UI 5.1 index.html and main.js We first create a dropdown list. The selected category is stored in selected, and it is posted to /start when submit is called. Afterwards, the component polls /results. main.js - categoryDropdown var categoryDropdown = new Vue({ el: '#category-dropdown', data: { selected: '' }, methods: { submit: function() { keywordsResult.
Read more

Keyword Extraction from arXiv - Part 2

In Part 1, we developed a keyword extraction algorithm. The next step is to modify the algorithm to use database. Configuring Postgres is more involved than in Flask by Example, since we need models to store article data. The following diagram shows the architecture of our system. We use the end product of Flask by Example tutorial as a boilerplate. Complete Part 1-4 of Flask by Example, or clone the repo of and configure Postgres by following these steps:
Read more

Keyword Extraction from arXiv - Part 1

This is a tutorial on web development written for people with a statistical analysis, scientific computing, or machine learning background. We start with an algorithm using data that fits comfortably into memory, and modify it to accept a large input. We then set up an infrastructure to serve the resulting algorithm. This tutorial focuses on the infrastructure rather than the algorithm, which will remain rudimentary. The end product is a Heroku deployment of a text summarization algorithm that analyzes articles on arXiv to extract keywords from each research category within mathematics.
Read more

Simple Linear Regression with Heteroskedastic Noise

Introduction The model we consider is \(Y_i = \alpha + \beta x_i + \epsilon_i\), where \( \epsilon_i \) are uncorrelated, and \( \mathbb{V}(\epsilon_i) \) depends on \( i \). We discuss two solutions to finding estimators of \( \alpha, \beta \). Weighted least squares regression leads to best linear unbiased estimators (BLUE). Also, with stronger assumptions on \( \epsilon_i \), maximum likelihood estimators (MLE) can be found. We begin with a discussion of the homoskedastic case with an emphasis on relations between statistical properties of the least squares estimators and assumptions on \( \epsilon_i \), which is conducive to understanding the heteroskedastic case.
Read more

Benchmarking Linear Classifiers

I ran linear classifiers on a credit card fraud data. Parallelization. Lasso and ridge. Grid search. Published on kaggle.