Blazing Matrix Products

(panadestein.github.io)

45 points | by Bogdanp 11 hours ago ago

5 comments

imurray 5 hours ago
This post is for those interested high-performance matrix multiplication in BQN (an APL-like array language).
The main thing I got out of it was the footnotes, in particular: https://en.algorithmica.org/hpc/algorithms/matmul/ is a really nice post on fast matrix multiplication, and is a chapter of what looks like a nice online book.
bee_rider 4 hours ago
Does BQN just natively vectorize the code? I was surprised not to see anything about that.
[-]
- mlochbaum 3 hours ago
  The relevant operations for matrix multiply are leading-axis extension, shown near the end of [0], and Insert +˝ shown in [1]. Both for floats; the leading-axis operation is × but it's the same speed as + with floating-point SIMD. We don't handle these all that well, with needless copying in × and a lot of per-row overhead in +˝, but of course it's way better than scalar evaluation.
  [0] https://mlochbaum.github.io/bencharray/pages/arith.html
  [1] https://mlochbaum.github.io/bencharray/pages/fold.html
  [-]
  - mlochbaum 3 hours ago
    And the reason +˝ is fairly fast for long rows, despite that page claiming no optimizations, is that ˝ is defined to split its argument into cells, e.g. rows of a matrix, and apply + with those as arguments. So + is able to apply its ordinary vectorization, while it can't in some other situations where it's applied element-wise. This still doesn't make great use of cache and I do have some special code working for floats that does much better with a tiling pattern, but I wanted to improve +˝ for integers along with it and haven't finished those (widening on overflow is complicated).
- icen 4 hours ago
  Yes; the CBQN interpreter has a number of specialised vectorised codepaths. It picks a good one for the arrays at runtime.