Homepage Timothy Kaler
About Me

I'm a PhD student at MIT in the EECS department and a member of the Supertech research group within the Computer Science and Artificial Intelligence Laboratory (CSAIL). In my research I work on engineering high-performance multicore algorithms that have nice properties - such as serial semantics.

My email address is: tfk @ mit.edu. Feel free to email me as long as you're human (bots, be honest!).

Recreation

If you have played Monopoly and are also interested in learning a bit about the way money influences politics then you may like the infographic Monopoly: Congressional Edition which visualizes the money spent by various industries/groups for lobbying.

Do you like chess, lasers, and computers named Joshua? Then you might enjoy watching some game playing programs in action on the Leiserchess Scrimmage Server. Students in MIT's performance engineering course (6.172) have used the scrimmage server to challenge other team's game playing programs to friendly games of Leiserchess as a part of the course's final project.

Once upon a time, I worked as a freelancer and contributor within the vBulletin community designing modifications and new applications for internet forums.

Publications

Optimal Reissue Policies For Reducing Tail-Latency
Tim Kaler, Yuxiong He, Sameh Elnikety
29th ACM Symposium on Parallelism in Algorithms and Architectures, 2017

Interactive services send redundant requests to multiple different replicas to meet stringent tail latency requirements. These additional (reissue) requests mitigate the impact of non-deterministic delays within the system and thus increase the probability of receiving an on-time response. There are two existing approaches of using reissue requests to reduce tail latency. (1) Reissue requests immediately to one or more replicas, which multiplies the load and runs the risk of overloading the system. (2) Reissue requests if not completed after a fixed delay. The delay helps to bound the number of extra reissue requests, but it also reduces the chance for those requests to respond before a tail latency target. We introduce a new family of reissue policies, Single-Time/Random (SingleR), that reissue requests after a delay d with probability q. SingleR employs randomness to bound the reissue rate, while allowing requests to be reissued early enough so they have sufficient time to respond, exploiting the benefits of both immediate and delayed reissue of prior work. We formally prove that SingleR is optimal even when compared to more complex policies that reissue multiple times. To use SingleR for interactive services, we provide efficient algorithms for calculating optimal reissue delay and probability from response time logs through data-driven approach. We apply iterative adaptation for systems with load-dependent queuing delays. The key advantage of this data-driven approach is its wide applicability and effectiveness to systems with various design choices and workload properties. We evaluated SingleR policies thoroughly. We use simulation to illustrate its internals and demonstrate its robustness to a wide range of workloads. We conduct system experiments on the Redis key-value store and Lucene search server. The results show that for utilizations ranging from 40-60%, SingleR reduces the 99th-percentile latency of Redis by 30-70% by reissuing only 2% of requests, and the 99th-percentile latency of Lucene by 15-25% by reissuing 1% only.

A Multicore Path to Connectomics-on-Demand
Alexander Matveev, Yaron Meirovitch, Hayk Saribekyan, Wiktor Jakubiuk, Tim Kaler, Gergely Odor, David Budden, Aleksandar Zlateski, Nir Shavit
ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP), 2017
*Nominated for Best Paper

The current design trend in large scale machine learning is to use distributed clusters of CPUs and GPUs with MapReduce-style programming. Some are led to believe that this type of horizontal scaling can reduce or even eliminate the need for algorithm development, careful parallelization, and performance engineering. This paper is a case study showing the contrary: that the benefits of algorithms, parallelization, and engineering can sometimes be so vast that it is possible to solve "cluster-scale" problems on a single commodity multicore machine. Connectomics is an emerging area of neurobiology that uses cutting edge machine learning and image processing to extract brain connectivity graphs from electron microscopy images. It has long been assumed that the processing of connectomics data will require mass storage, farms of CPU/GPUs and will take months (if not years) of processing time. We present a high-throughput connectomics-on-demand system that runs on a multicore machine with less than 100 cores and extracts connectomes at the terabyte per hour pace of modern electron microscopes.

Executing Dynamic Data-Graph Computations Deterministically Using Chromatic Scheduling (Journal Version)
Tim Kaler, William Hasenplaugh, Tao B. Schardl, Charles E. Leiserson
ACM Transactions on Parallel Computing (TOPC) Volume 3 Issue 1, August 2016

A data-graph computation --- popularized by such programming systems as Galois, Pregel, GraphLab, PowerGraph, and GraphChi --- is an algorithm that performs local updates on the vertices of a graph. During each round of a data-graph computation, an update function atomically modifies the data associated with a vertex as a function of the vertex's prior data and that of adjacent vertices. A dynamic data-graph computation updates only an active subset of the vertices during a round, and those updates determine the set of active vertices for the next round. This article introduces Prism, a chromatic-scheduling algorithm for executing dynamic data-graph computations. Prism uses a vertex coloring of the graph to coordinate updates performed in a round, precluding the need for mutual-exclusion locks or other nondeterministic data synchronization. A multibag data structure is used by Prism to maintain a dynamic set of active vertices as an unordered set partitioned by color. We analyze Prism using work-span analysis. Let G = (V, E) be a degree-delta graph colored with x colors, and suppose that Q subset of V is the set of active vertices in a round. Define size(Q)= |Q| + sum_{v in Q} (deg(v)), which is proportional to the space required to store the vertices of Q using a sparse-graph layout. We show that a P-processor execution of Prism performs updates in Q using O(x (lg ( Q/x ) + lg delta ) + lg P span and O(size(Q) + P) work. These theoretical guarantees are matched by good empirical performance. To isolate the effect of the scheduling algorithm on performance, we modified GraphLab to incorporate Prism and studied seven application benchmarks on a 12-core multicore machine. Prism executes the benchmarks 1.2 to 2.1 times faster than GraphLab's nondeterministic lock-based scheduler while providing deterministic behavior. This article also presents Prism-R, a variation of Prism that executes dynamic data-graph computations deterministically even when updates modify global variables with associative operations. Prism-R satisfies the same theoretical bounds as Prism, but its implementation is more involved, incorporating a multivector data structure to maintain a deterministically ordered set of vertices partitioned by color. Despite its additional complexity, Prism-R is only marginally slower than Prism. On the seven application benchmarks studied, Prism-R incurs a 7% geometric mean overhead relative to Prism.

Cilkstat: A Low-Overhead Sampling-Based Scalability Profiler
Tim Kaler, Rumen Hristov, John Holliman
Unpublished

Cilkstat is a sampling-based parallelism profiler for C++ fork-join parallel programs written using Cilk Plus. Cilkstat analyzes a running program's logical parallelism by performing periodic interrupts to collect samples that are contextualized within the parallel structure of the program using series-parallel (SP) pedigrees. We describe the design and implementation of SP-pedigrees within the GCC5 compiler, and prove bounds on the accuracy of sampling-based parallelism estimates as a function of the sampling interval. Existing sampling profilers are presently used in both serial and parallel programs, but the meta-data available for these tools to use to contextualize samples is generally limited to stack traces, performance counters, and thread identifiers which are often insufficient to identify the series-parallel relationships between collected samples. As a result, existing tools for profiling programs using sampling are unable to quantify an execution's logical parallelism. There exist tools for analyzing the logical parallelism of Cilk programs. However, these tools rely on intrusive instrumentation whose overheads preclude their use in certain contexts: e.g. production systems, programs sensitive to performance dilation, and rapid-iteration performance engineering. Profilers such as Cilkview and Cilkprof determine the logical parallelism of a program through either binary instrumentation or through fine-grained time measurements inserted at link-time. Although both tools accurately measure a program's parallelism, their accuracy comes at the cost of high profiling overheads. Programs instrumented using Cilkview and Cilkprof can run up to 100 times and 6 times slower respectively. Cilkstat uses lightweight interrupt-driven sampling to generate an estimate of a program's parallelism. Over a set of 21 benchmarks, Cilkstat introduces overheads of 1.0--1.08x and produces parallelism estimates that differ, on average, by only 2x when compared to Cilkview and Cilkprof.

Polylogarithmic Fully Retroactive Priority Queues
Erik D. Demaine, Tim Kaler, Quanquan Liu, Aaron Sidford, Adam Yedidia
Proceedings of the 14th Algorithms and Data Structures Symposium (WADS), 2015

Since the introduction of retroactive data structures at SODA 2004 [1], a major open question has been the difference between partial retroactivity (where updates can be made in the past) and full retroactivity (where queries can also be made in the past). In particular, for priority queues, partial retroactivity is possible in O(log m) time per operation on a m-operation timeline, but the best fully retroactive priority queue costs O(\sqrt{m} log m) time per operation. We address both of these eleven-year-old open problems by giving a general logarithmic-overhead transformation from partial to full retroactivity, called "hierarchical checkpointing," provided that the given data structure is "time-fusible" (multiple structures with disjoint timespans can be fused into a timeline supporting queries of the present). As an application, we construct a fully retroactive priority queue which can insert an element, delete the minimum element, and find the minimum element, at any point in time, in O(log^2 m) amortized time per update and O(log^3 m log log m) time per query, using O(m log m) space. Our data structure also supports the operation of determining the time that an element is deleted in O(log^2 m) time

[1] Demaine, E.D., Iacono, J., Langerman, S.: Retroactive data structures. ACM Transactions on Algorithms 3 (2007) 13 Originally in SODA 2003.

Executing Dynamic Data-Graph Computations Deterministically Using Chromatic Scheduling
Tim Kaler, William Hasenplaugh, Tao B. Schardl, Charles E. Leiserson
26th ACM Symposium on Parallelism in Algorithms and Architectures, 2014
Introduces Prism, a chromatic-scheduling algorithm for dynamic data-graph computations. Prism vertex-colors the graph so that vertices of the same color can be updated without creating data races. During execution, Prism uses a multibag data structure to maintain sets of active vertices of the same color. We prove theoretical bounds on the work and span of Prism and compare it empirically to existing nondeterministic lock-based schedulers to demonstrate that it performs well both in theory and practice. We then describe an extension to PRISM called PRISM-R that executes dynamic data-graph computations deterministically even when updates modify global variables with associative operations. PRISM-R satisfies the same theoretical bounds as PRISM, but its implementation is more involved, incorporating a multivector data structure to maintain an ordered set of vertices partitioned by color. (Code, Input)

Ordering Heuristics For Parallel Graph Coloring
William Hasenplaugh, Tim Kaler, Tao B. Schardl, Charles E. Leiserson
26th ACM Symposium on Parallelism in Algorithms and Architectures, 2014
Introduces the largest-log-degree-first (LLF) and smallest-log-degree-last (SLL) ordering heuristics for parallel greedy graph-coloring algorithms, which are inspired by the largest-degree-first (LF) and smallest-degree-last (SL) serial heuristics, respectively. We show that the SLL and LLF heuristics are less vulnerable to adversarial inputs that limit parallelism while still maintaining coloring qualities that are of similar quality, in practice, to those produced by SL and LF. The paper also extends prior analysis of the JP message passing coloring algorithm to handle arbitrary degree graphs and develops an efficient multicore implementation that, on 12 cores, achieves a geometric mean speedup of 6.99 with LLF and 5.37 with SLL.

Cache Efficient Bloom Filters
Tim Kaler
Describes dynamic blocked bloom filters -- a cache efficient bloom filter based on a composition of two existing bloom filter variants: scalable bloom filters and blocked bloom filters. The paper also serves as a survey of these two bloom filter variants and provides an empirical comparison of their space usage and false positive rates with the standard bloom filter.

Thesis Proposal - Optimizing Deterministic Parallel Graph Computation
Tim Kaler, Charles Leiserson
An MEng thesis proposal describing preliminary work improving Graphlab, a parallel framework for machine learning, by making it deterministic and 2x faster for sufficiently large benchmarks.

Spatial Data Structures - Performance Comparison
Tim Kaler, Oscar Moll
An informal survey and performance comparison of simple spatial datastructures supporting 2-D range queries. It also proposes a simple criteria for choosing the best data structure to perform a given rectangular query which could be used by a query optimizer.

Code in the Air: Simplifying Sensing on Smartphones
Tim Kaler, John Patrick Lynch, Timothy Peng, Lenin Ravindranath, Arvind Thiagarajan, Hari Balakrishnan, and Sam Madden
Demo Abstract, Sensys 2010, November 2010