A notable amount of code I’ve written in my professional life has made its way onto GitHub. Here’s a collection of some of my commits that I think are neat. (Disclaimer: I wrote these while employed at Meta, where I am still employed at the time of writing. However, this is my personal space and my employer does not necessarily endorse the things I stay in my personal spaces. In particular, I am merely pointing to and summarizing public information below.)
mmap
to file I/O, halving ExecuTorch llama demo model
load time (executorch
#4032)clamp
(executorch
#5784)Save a branch in IValue::isIntrusivePtr
by manually converting
switch
-on-enum
-returning-bool
to bit-vector indexing with don’t-care for
out-of-bounds enumerator (pytorch
#109273)
Lots of overhead reductions in AOTInductor on CPU, notably:
Save 8 bytes of inline storage from the heavily-optimized
F14
hash table, replacing a cached bit mask equal to 1UL << chunkCount
by packing 1 byte of chunkCount
with 7 bytes of table
size. (folly commit a20494d)
aten::to
in static runtime without requiring copying (pytorch #67223)jit_type
subsystem to stop refcounting singletons (pytorch #69579)RecordFunction
(pytorch #76266)c10::irange(x)
generate the same assembly as for
loop (pytorch #86841)char*
’s ability to alias anything (an exception to the usual strict aliasing rule) causing extra loads in FBGEMM QuantizeAvx2
(FBGEMM #1124)ModuleNode.sort_types_by_inheritance
(cython #5139)(This year had lots of smaller overhead reductions, including a whole lot of removal of reference counting. Feel free to take a look at all of my 2021 PyTorch GitHub commits.)
IValue::toTensor
return const Tensor&
instead of Tensor
by adjusting the IValue
internal union
to hold a real Tensor
instead of lumping it in with other c10::intrusive_ptr
cases. (pytorch #48824 and #48868)Tensor
’s representation for sizes and strides from a pair of SmallVector
with the same size to a custom implementation that only stores the size once. (pytorch #47507 and #47508)if constexpr
instead of a C++14 shim when available. Neat to be able to demonstrate that this makes a difference. (pytorch #51368)c10::MaybeOwned
,
a
std::borrow::Cow
lookalike without the borrow checking. Use it to power
Tensor::expect_contiguous()
, which is like Tensor::contiguous()
except
it’s expected to be a no-op because the Tensor
is likely already
contiguous, in which case it saves an atomic reference count
increment. (pytorch
#53317)c10::ivalue::Tuple
(pytorch #64066)StorageImpl
s for memory-planned Tensor
s in static runtime (pytorch #66130)char*
’s ability to alias anything (again) causing extra loads in FBGEMM matrix packing (FBGEMM #702)Specialize c10::optional
(which was later removed when PyTorch
upgraded to C++17) to make copy/move trivial for optionals
containing 32-bit scalars. (This prevents them from being forced to
be passed by reference thanks to the Itanium C++
ABI.)
(pytorch #47015)
Avoid atomic reference count increment in
intrusive_ptr::make
. (pytorch
#47100)
constexpr
.
(I made more size and/or efficiency
improvements
as well.) (fbjni commit aed5c91
)