## Posts

Draft.
1. Narrative (TODO) Code published on GitHub.
The data is from a Kaggle competition hosted by Avazu.
2. Data The columns are:
id click hour (YYMMDDHH) banner_pos site/app_id/domain/category. device variables: id, ip, model, type, conn_type C1, C14, C15, …, C21: anonymous features. Some of the anonymous features must be properties of the ad (e.g. id, category, marketer). In most columns, hashed values are given. (As the raw values were not salted prior to hashing, the original strings can be recovered.

Read more
“Curse of Dimensionality”, in the sense of Bellman as well as in machine learning.
Geometry In \( \mathbb{R}^n \) where \( n \) is large,
The volume of the unit ball goes to 0 as the dimension goes to infinity. Most of the volume of a n-ball is concentrated around its boundary. Pick two vectors on the surface of the unit ball independently. The two are orthogonal with high probability.

Read more
1. Narrative (TODO) Code published on GitHub.
2. Data 2.1 Taxi We use the yellow cab trip CSV data available here. The columns that we use are the pickup location (in latitude and longitude), and pickup datetime. The CSVs contain other columns, such as drop-off information, trip distance, and fare; there are many other analyses that are possible with this data.
For each year, there are about 170 million records, which amount to about 25GB.

Read more
Articles and talks that I referred to while working on the taxi data project. Published here as a note to myself.
Model Serving FLIP-23. The document also discusses implementing model training as well as model serving. Two linked documents—“Flink ML Roadmap” and “Flink-MS”—are also worth reading. Boris Lublinsky’s book “Serving Machine Learning Models”, and talk. Distributed Systems Please Stop Calling Database Systems AP or CP. Kate Matsudaira on distributed systems Distributed Systems for Fun and Profit Martin Kleppmann’s book and interview.

Read more
In the previous articles on arXiv keyword extraction, I focused on details of setting up an infrastructure to serve the algorithm. Since then, I wrote a more complete keyword extraction algorithm, and deployed it to Google Cloud Platform. This article a more high-level overview of the end product.
Motivation I built this app to help myself keep learning mathematics. When I was in graduate school, I learned the terminology of my area of study through attending seminars — whenever the speaker used a term that I wasn’t familiar with, I would write it down and look it up later.

Read more
Freely available online resources that I’ve found useful for learning Statistics, Machine Learning, Distributed Computing, Database Systems, and other CS and SWE topics.
Statistics and Machine Learning ML Expositions on machine learning that are freely accessible (like the ones on Coursera) tend to sweep theoretical foundation and mathematical rigor under the carpet; these are resources that don’t skimp on the hard stuff.
Stanford CS229 Machine Learning
One-line summary: the course teaches you how to set up a cost function based on a model and data, and figure out how to optimize it.

Read more
This is the final part of the tutorial. We furnish the app with an UI, and deploy it to Heroku.
5. Build UI 5.1 index.html and main.js We first create a dropdown list. The selected category is stored in selected, and it is posted to /start when submit is called. Afterwards, the component polls /results.
main.js - categoryDropdown
var categoryDropdown = new Vue({ el: '#category-dropdown', data: { selected: '' }, methods: { submit: function() { keywordsResult.

Read more
In Part 1, we developed a keyword extraction algorithm. The next step is to modify the algorithm to use database. Configuring Postgres is more involved than in Flask by Example, since we need models to store article data. The following diagram shows the architecture of our system.
We use the end product of Flask by Example tutorial as a boilerplate. Complete Part 1-4 of Flask by Example, or clone the repo of and configure Postgres by following these steps:

Read more
This is a tutorial on web development written for people with a statistical analysis, scientific computing, or machine learning background. We start with an algorithm using data that fits comfortably into memory, and modify it to accept a large input. We then set up an infrastructure to serve the resulting algorithm. This tutorial focuses on the infrastructure rather than the algorithm, which will remain rudimentary. The end product is a Heroku deployment of a text summarization algorithm that analyzes articles on arXiv to extract keywords from each research category within mathematics.

Read more
Introduction The model we consider is \(Y_i = \alpha + \beta x_i + \epsilon_i\), where \( \epsilon_i \) are uncorrelated, and \( \mathbb{V}(\epsilon_i) \) depends on \( i \). We discuss two solutions to finding estimators of \( \alpha, \beta \). Weighted least squares regression leads to best linear unbiased estimators (BLUE). Also, with stronger assumptions on \( \epsilon_i \), maximum likelihood estimators (MLE) can be found. We begin with a discussion of the homoskedastic case with an emphasis on relations between statistical properties of the least squares estimators and assumptions on \( \epsilon_i \), which is conducive to understanding the heteroskedastic case.

Read more