| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
(footprint + performance)
Keep orig Runtime Polymorphic as `test/function2.hpp` for comparison and review.
delegate_t<R(A...)> allows fast path target function invocation.
This static polymorphous object, delegates invocation specific user template-type data and callbacks
and allows to:
- be maintained within function<R(A...)> instance as a member
- avoiding need for dynamic polymorphism, i.e. heap allocated specialization referenced by base type
- hence supporting good cache performance
- avoid using virtual function table indirection
- utilize constexpr inline for function invocation (callbacks)
Impact:
- Memory footprint reduced (64bit)
- lambda, member: 64 -> 48 bytes, i.e. 25.00% less size
- capval, capref: 56 -> 48 bytes, i.e. 14.29% less size
- Performance (linux-arm64, raspi4, gcc, mean @ 600 samples):
- member_jaufunc: 7.34998 -> 5.3406 ms, i.e. 27.34% perf. gain
- becoming faster than member_stdbind_unspec
- lambda_jaufunc: 6.89633 -> 5.52684 ms, i.e. 19.86% perf. gain
- aligning most trivial-types to similar performance
- Performance differences on linux-amd64 high-perf machine
- Less significant than linux-arm64
- Probably due to better CPU caching, memory access, branch prediction, etc.
- member_jaufunc: 1.880 -> 1.848 ms, i.e. ~2% perf. gain (too small)
- lambda_jaufunc: 1.871 -> 1.851 ms, i.e. ~1% perf. gain (too small)
- Lines of code incl. comments:
- From 1287 -> 1674 lines, i.e. 30% added lines of code
- Added code used for manual static polymorphism, see above.
- Performance methodology
- New code test
- nice -20 ./test_functional2_perf --benchmark-samples 600
- Old code test
- nice -20 ./test_functional1_perf --benchmark-samples 600
+++
Optimization of func::member_target_t<...> using `gcc`
Utilizing GCC C++ Extension: Pointer to Member Function (PMF) Conversion to function pointer
- Reduces function pointer size, i.e. PMF 16 (total 24) -> function 8 (total 16)
- Removes vtable lookup at invocation (performance)
- Pass object this pointer to function as 1st argument
- See [GCC PMF Conversion](https://gcc.gnu.org/onlinedocs/gcc/Bound-member-functions.html#Bound-member-functions)
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
findings (deprecated); Performance enhancements
cow_darray, cow_vector design changes / notes:
- Aligned all typedef's names etc between both implementations for easier review
- Removed direct element access, as this would be only valid if Value_type is a std:share_ptr,
assuming the underlying shared storage has been pulled.
Use iterator!
- Introducing immutable const_iterator (a jau::cow_ro_iterator) and mutable iterator (a jau::cow_rw_iterator).
Both types hold the underling's iterator and also either the lock-free shared snapshot (const_iterator)
or hold the lock and a copy of the storage (iterator).
This guarantees an efficient std API aligned operation, keeping alove std::shared_ptr references.
- Removed for_each_cow: Use std::for_each with our new const_iterator or iterator etc...
Performance changes / notes:
- cow_darray: Use fixed golden-ratio for grow-factor in push_back(), reducing too many copies.
- cow_darray::push_back(..): No copy on size < capacity, just push_back into underling,
dramatically reducing copies.
Guaranteed correct using cow_darray + darray, as darray increments end_ iterator
after the new element has been added.
- Always use pre-increment/decrement, avoiding copy with post-* variant.
cow_vector fixes (learned from working with cow_darray)
- reserve(): Only is new_capacity > capacity, then need copy_ctor storage
- operator=(cow_vector&& x): Hold both cow_vector's write-mutex
- pop_back(): Only if not empty
-
+++
Performance seems to fluctuate on the allocator
and we might want resolve this with a custom pooled alloctor.
This is obvious when comparing the 'empty' samples with 'reserved',
the latter reserve whole memory of the std::vector and jau::darray upfront.
Performance on arm64-raspi4 jau::cow_vector vs jau::cow_darray:
- sequential fill and list O(1): cow_vector is ~30 times slower
(starting empty)
(delta due to cow_darray capacity usage, less copies)
- unique fill and list O(n*n): cow_vector is 2-3 times slower
(starting empty)
(most time used for equal time dereferencing)
Performance on arm64-raspi4 std::vector vs jau::darray:
- sequential fill and list iterator O(1): jau::darray is ~0% - 40% slower (50 .. 1000)
(starting empty)
(we may call this almost equal, probably allocator related)
- unique fill and list iterator O(n*n): std::vector is ~0% - 23% slower (50 .. 1000)
(starting empty)
(almost equal, most time used for equal time dereferencing)
+++
Performance on amd64 jau::cow_vector vs jau::cow_darray:
- sequential fill and list O(1): cow_vector is ~38 times slower
(starting empty)
(delta due to cow_darray capacity usage, less copies)
- unique fill and list O(n*n): cow_vector is ~2 times slower
(starting empty)
(most time used for equal time dereferencing)
Performance on amd64 std::vector vs jau::darray:
- sequential fill and list iterator O(1): jau::darray is ~0% - 20% slower (50 .. 1000)
(starting empty)
(we may call this almost equal, probably allocator related)
- unique fill and list iterator O(n*n): std::vector is ~0% - 30% slower (50 .. 1000)
(starting empty)
(almost equal, most time used for equal time dereferencing)
+++
Memory ratio allocation/size jau::cow_vector vs jau::cow_darray having size:
- 50: 2 vs 1.1
- 100: 2 vs 1.44
- 1000: 2 vs 1.6
- Hence cow_darray golden-ratio growth factor is more efficient on size + perf.
Memory ratio allocation/size std::vector vs jau::darray having size:
- 50: 1.28 vs 1.10
- 100: 1.28 vs 1.44
- 1000: 1.03 vs 1.60
- Hence cow_darray golden-ratio growth factor is less efficient on big sizes
(but configurable)
|
|
counting_allocator measuring memory footprint
On arm64, raspi4:
cow_vector<T> uses ~50% more memory than vector<T>
cow_vector<T> is 9-16 times slower than vector<T> (find)
- 25 elements: 9x slower
- 50 elements: 12x slower
- 100 elements: 15x slower
- 200 elements: 16x slower
- 1000 elements: 10x slower
+++
unordered_set<T> uses ~17% more memory than vector<T>
unordered_set<T> performance is <= vector<T> up until 100 elements (find)
- 25 elements: ~97% slower
- 50 elements: ~44% slower
unordered_set<T> performance is > vector<T> (find):
- 100 elements: equal
- 200 elements: ~38% faster
- 1000 elements: ~90% faster
|