Neat GitHub Commits

Posted at — Jan 19, 2025

A notable amount of code I’ve written in my professional life has made its way onto GitHub. Here’s a collection of some of my commits that I think are neat. (Disclaimer: I wrote these while employed at Meta, where I am still employed at the time of writing. However, this is my personal space and my employer does not necessarily endorse the things I stay in my personal spaces. In particular, I am merely pointing to and summarizing public information below.)

2024

Faster FP16 and BF16 GEMV kernels in PyTorch, covering both AArch64 and x86 (too many commits to call out)
Fix a missed compiler optimization comparing 8-byte strings (fbthrift commit 80e61b3)
Micro-optimizations in Thrift’s BMI varint decoder (hat tip to @davidtgoldblatt):
Compile-time type promotion in ExecuTorch in C++11 (later upgraded to C++17) (executorch commit 8c8563f)
Switch from mmap to file I/O, halving ExecuTorch llama demo model load time (executorch #4032)
My own little fast square root (for powers of 2), as part of implementing the fast Hadamard transform (executorch #5283)
“Dramatic” build time improvement by reducing number of type parameters to template code for clamp (executorch #5784)

2023

Save a branch in IValue::isIntrusivePtr by manually converting switch-on-enum-returning-bool to bit-vector indexing with don’t-care for out-of-bounds enumerator (pytorch #109273)
Lots of overhead reductions in AOTInductor on CPU, notably:
Save 8 bytes of inline storage from the heavily-optimized F14 hash table, replacing a cached bit mask equal to 1UL << chunkCount by packing 1 byte of chunkCount with 7 bytes of table size. (folly commit a20494d)

2022

Support memory planning for aten::to in static runtime without requiring copying (pytorch #67223)
Retrofit PyTorch’s jit_type subsystem to stop refcounting singletons (pytorch #69579)
Stack-allocate boxed args for RecordFunction (pytorch #76266)
Make c10::irange(x) generate the same assembly as for loop (pytorch #86841)
Work around char*’s ability to alias anything (an exception to the usual strict aliasing rule) causing extra loads in FBGEMM QuantizeAvx2 (FBGEMM #1124)
Use topological sort to speed up quadratic-time ModuleNode.sort_types_by_inheritance (cython #5139)

2021

(This year had lots of smaller overhead reductions, including a whole lot of removal of reference counting. Feel free to take a look at all of my 2021 PyTorch GitHub commits.)

Make IValue::toTensor return const Tensor& instead of Tensor by adjusting the IValue internal union to hold a real Tensor instead of lumping it in with other c10::intrusive_ptr cases. (pytorch #48824 and #48868)
Change PyTorch Tensor’s representation for sizes and strides from a pair of SmallVector with the same size to a custom implementation that only stores the size once. (pytorch #47507 and #47508)
Save build time by using actual if constexpr instead of a C++14 shim when available. Neat to be able to demonstrate that this makes a difference. (pytorch #51368)
Introduce c10::MaybeOwned, a std::borrow::Cow lookalike without the borrow checking. Use it to power Tensor::expect_contiguous(), which is like Tensor::contiguous() except it’s expected to be a no-op because the Tensor is likely already contiguous, in which case it saves an atomic reference count increment. (pytorch #53317)
Store up to 3 elements inline in c10::ivalue::Tuple (pytorch #64066)
Arena allocate StorageImpls for memory-planned Tensors in static runtime (pytorch #66130)
Work around char*’s ability to alias anything (again) causing extra loads in FBGEMM matrix packing (FBGEMM #702)

2020

Specialize c10::optional (which was later removed when PyTorch upgraded to C++17) to make copy/move trivial for optionals containing 32-bit scalars. (This prevents them from being forced to be passed by reference thanks to the Itanium C++ ABI.) (pytorch #47015)
Avoid atomic reference count increment in intrusive_ptr::make. (pytorch #47100)

2019

Compute JNI type signatures at compile time in fbjni. If I recall correctly, this is when I really learned about constexpr. (I made more size and/or efficiency improvements as well.) (fbjni commit aed5c91)

2018

fastmod – a fast partial replacement for the codemod tool.

2016

Some performance improvements to yaml-cpp’s regex handling.

Scott Wolchok