A notable amount of code I’ve written in my professional life has made its way onto GitHub. Here’s a collection of some of my commits that I think are neat. (Disclaimer: I wrote these while employed at Meta, where I am still employed at the time of writing. However, this is my personal space and my employer does not necessarily endorse the things I stay in my personal spaces. In particular, I am merely pointing to and summarizing public information below.)
mmap to file I/O, halving ExecuTorch llama demo model
load time (executorch
#4032)clamp (executorch
#5784)Save a branch in IValue::isIntrusivePtr by manually converting
switch-on-enum-returning-bool to bit-vector indexing with don’t-care for
out-of-bounds enumerator (pytorch
#109273)
Lots of overhead reductions in AOTInductor on CPU, notably:
Save 8 bytes of inline storage from the heavily-optimized
F14
hash table, replacing a cached bit mask equal to 1UL << chunkCount
by packing 1 byte of chunkCount with 7 bytes of table
size. (folly commit a20494d)
aten::to in static runtime without requiring copying (pytorch #67223)jit_type subsystem to stop refcounting singletons (pytorch #69579)RecordFunction (pytorch #76266)c10::irange(x) generate the same assembly as for loop (pytorch #86841)char*’s ability to alias anything (an exception to the usual strict aliasing rule) causing extra loads in FBGEMM QuantizeAvx2 (FBGEMM #1124)ModuleNode.sort_types_by_inheritance (cython #5139)(This year had lots of smaller overhead reductions, including a whole lot of removal of reference counting. Feel free to take a look at all of my 2021 PyTorch GitHub commits.)
IValue::toTensor return const Tensor& instead of Tensor by adjusting the IValue internal union to hold a real Tensor instead of lumping it in with other c10::intrusive_ptr cases. (pytorch #48824 and #48868)Tensor’s representation for sizes and strides from a pair of SmallVector with the same size to a custom implementation that only stores the size once. (pytorch #47507 and #47508)if constexpr instead of a C++14 shim when available. Neat to be able to demonstrate that this makes a difference. (pytorch #51368)c10::MaybeOwned,
a
std::borrow::Cow
lookalike without the borrow checking. Use it to power
Tensor::expect_contiguous(), which is like Tensor::contiguous() except
it’s expected to be a no-op because the Tensor is likely already
contiguous, in which case it saves an atomic reference count
increment. (pytorch
#53317)c10::ivalue::Tuple (pytorch #64066)StorageImpls for memory-planned Tensors in static runtime (pytorch #66130)char*’s ability to alias anything (again) causing extra loads in FBGEMM matrix packing (FBGEMM #702)Specialize c10::optional (which was later removed when PyTorch
upgraded to C++17) to make copy/move trivial for optionals
containing 32-bit scalars. (This prevents them from being forced to
be passed by reference thanks to the Itanium C++
ABI.)
(pytorch #47015)
Avoid atomic reference count increment in
intrusive_ptr::make. (pytorch
#47100)
constexpr.
(I made more size and/or efficiency
improvements
as well.) (fbjni commit aed5c91)