This is a quick post, but I wanted to document an evolution in how my thinking with respect to move operators and move construction. Back when C++11 was released and we were getting used to the concepts, C++ moves were groundbreaking in their ability to greatly accelerate STL containers, which were often forced to invoke copy constructors wholesale due to reallocation (e.g. a std::vector grows in size and copies N elements as a result). A move constructor allowed the programmer to create a “shallow copy” so to speak which is much faster than the default (presumably) deep copy. Ergo, to think that avoiding moves entirely might be a performance win is somewhat paradoxical. Of course, it isn’t without it’s caveats, but for me, it’s been well worth it to go all in on possibly never writing a move constructor again.

Let’s dive in to just a few of my gripes with move constructors.

Verbosity

This should go without saying, but transitively applying std::move to all constituent members of a class or struct is a huge drag for something that could very nearly be automated. There are also common repetitive patterns (in a destructor, check if a thing is nullptr, reclaim it if so, std::move other things) and idioms. Generally, repetition in code carries a nasty code-smell, and this is no different in my opinion.

Performance

Invoking a move constructor can often be a deoptimization relative to what you could do yourself. Here’s a simple motivating example to show what I mean.

struct foo {
    foo() = default;

    foo& operator=(foo&& f) {
        if (this == &f) return *this;

	auto tmp = x;
        x = f.x;
        // Ensure we don't double free
        f.x = tmp;
        return *this;
    }

    ~foo() {
      if (x) delete x;
    }

    int* x = nullptr;
};

void moves(foo* f, size_t count, foo* dest) {
    for (size_t i = 0; i != count; ++i) {
        dest[i] = std::move(f[i]);
    }
}

If you compile this with O2, you’ll probably see what you expect:

moves(foo*, unsigned long, foo*):       # @moves(foo*, unsigned long, foo*)
        test    rsi, rsi
        je      .LBB0_6
        mov     r8d, esi
        and     r8d, 1
        cmp     rsi, 1
        jne     .LBB0_7
        xor     eax, eax
        test    r8, r8
        jne     .LBB0_4
        jmp     .LBB0_6
.LBB0_7:
        sub     rsi, r8
        xor     eax, eax
.LBB0_8:                                # =>This Inner Loop Header: Depth=1
        cmp     rdx, rdi
        je      .LBB0_10
        mov     r9, qword ptr [rdx + 8*rax]
        mov     rcx, qword ptr [rdi + 8*rax]
        mov     qword ptr [rdx + 8*rax], rcx
        mov     qword ptr [rdi + 8*rax], r9
.LBB0_10:                               #   in Loop: Header=BB0_8 Depth=1
        cmp     rdx, rdi
        je      .LBB0_12
        mov     r9, qword ptr [rdx + 8*rax + 8]
        mov     rcx, qword ptr [rdi + 8*rax + 8]
        mov     qword ptr [rdx + 8*rax + 8], rcx
        mov     qword ptr [rdi + 8*rax + 8], r9
.LBB0_12:                               #   in Loop: Header=BB0_8 Depth=1
        add     rax, 2
        cmp     rsi, rax
        jne     .LBB0_8
        test    r8, r8
        je      .LBB0_6
.LBB0_4:
        cmp     rdx, rdi
        je      .LBB0_6
        mov     rcx, qword ptr [rdx + 8*rax]
        mov     rsi, qword ptr [rdi + 8*rax]
        mov     qword ptr [rdx + 8*rax], rsi
        mov     qword ptr [rdi + 8*rax], rcx
.LBB0_6:
        ret

This looks pretty bad to my eye for such a simple snippet of code. Yes, the loop is unrolled and all, but it’s forced to act element by element and perform a lot of operations that aren’t necessary if we’re just relocating an object from place to place. Of course, for safety purposes, this is how it must be written. The code in the moves function mirrors what you might expect when a std::vector resizes; that is, new memory is allocated, and each element is moved to the other side before its destructor is ultimately invoked. Here’s a bit of equivalent code (with foo renamed to bar) if we only care about supporting relocation.

struct bar {
    bar() = default;

    ~bar() {
      if (x) delete x;
    }

    int* x = nullptr;
};

void copies(bar* b, size_t count, bar* dest) {
    std::memcpy(dest, b, sizeof(bar) * count);
}

In contrast to the first example, this will just be a call to memcpy (we’ll touch on memcpy soon). What happened to the original memory pointed copied from? Note that we aren’t invoking any destructors and semantically, for this code anyways, nothing breaks because those destructors are guaranteed to be noops for elements moved from for this struct. For completeness, here’s the assembly.

copies(bar*, unsigned long, bar*):      # @copies(bar*, unsigned long, bar*)
        mov     rax, rdi
        lea     rcx, [8*rsi]
        mov     rdi, rdx
        mov     rsi, rax
        mov     rdx, rcx
        jmp     memcpy                  # TAILCALL

The important intuition here is that if your move turns the destructor into a noop, you’re paying for instructions you do not need. This is something that’s bothered me since C++11 came out, but has only recently gotten more attention with the proposal regarding trivial relocation.

Let’s talk about memcpy

The point of the post isn’t to say that memcpy is necessarily the answer (although it often will be), but that there’s a swath of optimization available to the compiler that the C++ semantics don’t allow you to express. Furthermore, it does this at the cost of additional code (and a new class of bugs). If a memcpy is possible though, we can also consider further optimizations (loop unrolled SIMD copy, if not already provided). Also, we can make use of non-temporal moves to avoid cache pollution. Suffice it to say, we skip 2 function calls to a move constructor/assignment and destructor and can vectorize the operation.

What to do about it?

What I’ve done, is avoid using move constructors entirely. None of my classes support moves, and they only support a copy if the copy is trivial or the very rare case that the destructor is well-defined on copy. Of course, this would violate the bulk of the STL containers, which is why doing this approach forces you to avoid them. Provided your containers have iterators that are API compatible, you can still use other parts of the STL without issue (e.g. <algorithm>>).

Here’s a simple example of a mostly drop-in vector replacement I’ve been using from one of my own codebases. The main differences are that it also implements the small-buffer optimization and its growth factor interpolates from 2 initially to 1.5 as it grows. API-wise, it avoids the initializer-list shenanigans and doesn’t support push_back (I’ve never needed it since copyable things can accept an object via emplace_back and use the copy constructor just the same).

#pragma once

#include <cstdlib>
#include <cstring>

namespace alloy
{
// SBS: Small Buffer Size
template <typename T, size_t SBS = 8> class vector
{
public:
  vector()
  {
    static_assert(SBS > 1, "Small buffer size must be at least 1");
    data_ = sb_;
  }

  ~vector()
  {
    clear();

    if (capacity_ > SBS) {
      std::free(data_);
    }
  }

  [[nodiscard]] constexpr T* begin() noexcept
  {
    return data_;
  }

  [[nodiscard]] constexpr T* end() noexcept
  {
    return data_ + size_;
  }

  [[nodiscard]] constexpr const T* cbegin() const noexcept
  {
    return data_;
  }

  [[nodiscard]] constexpr const T* cend() noexcept
  {
    return data_ + size_;
  }

  [[nodiscard]] constexpr T* rbegin() noexcept
  {
    return end() - 1;
  }

  [[nodiscard]] constexpr T* rend() noexcept
  {
    return begin() - 1;
  }

  [[nodiscard]] constexpr const T* crbegin() const noexcept
  {
    return cend() - 1;
  }

  [[nodiscard]] constexpr const T* crend() const noexcept
  {
    return cbegin() - 1;
  }

  [[nodiscard]] constexpr bool empty() const noexcept
  {
    return size_ == 0;
  }

  [[nodiscard]] constexpr int size() const noexcept
  {
    return size_;
  }

  [[nodiscard]] constexpr int capacity() const noexcept
  {
    return capacity_;
  }

  T& front()
  {
    return data_[0];
  }

  const T& front() const
  {
    return data_[0];
  }

  T& back()
  {
    return data_[size_ - 1];
  }

  const T& back() const
  {
    return data_[size_ - 1];
  }

  T& operator[](size_t index)
  {
    return data_[index];
  }

  const T& operator[](size_t index) const
  {
    return data_[index];
  }

  void reserve(int next_capacity)
  {
    void* next = std::malloc(next_capacity * sizeof(T));
    std::memcpy(next, data_, size_ * sizeof(T));
    if (capacity_ > SBS) {
      std::free(data_);
    }
    data_ = reinterpret_cast<T*>(next);
    capacity_ = next_capacity;
  }

  void clear()
  {
    for (const T& elem : *this) {
      elem.~T();
    }

    size_ = 0;
  }

  void resize(int next_size)
  {
    if (next_size > capacity_) {
      reserve(next_size);
    }

    if (next_size < size_) {
      for (int i = next_size; i != size_; ++i) {
        data_[i].~T();
      }
    } else {
      int cursor = size_;
      size_ = next_size;
      for (; cursor != size_; ++cursor) {
        new (data_ + cursor) T();
      }
    }
  }

  template <typename... Args> T& emplace_back(Args&&... args)
  {
    if (size_ >= capacity_) {
      // Interpolate the growth factor from 2 initially to 1.5
      auto growth_factor = capacity_ > 1024
        ? 3 * 1024
        : 3 * capacity_ + 4 * (1024 - capacity_);
      reserve(capacity_ * growth_factor / 2048 + 1);
    }

    new (&data_[size_++]) T(std::forward<Args>(args)...);
    return back();
  }

private:
  T sb_[SBS];
  T* data_;
  int size_ = 0;
  int capacity_ = SBS;
};
} // namespace alloy

Of course, the same approach can apply to many other container types as well. Empirically, I have not observed lifetime bugs related to them and my code is a hell of a lot cleaner and less verbose. Part of me sometimes feels like this is the way C++ was meant to be written, and there’s a feeling of liberation to just write a constructor, destructor, and leave it at that.