Blazing Matrix Products

(panadestein.github.io)

45 points | by Bogdanp 11 hours ago ago

5 comments

  • imurray 5 hours ago

    This post is for those interested high-performance matrix multiplication in BQN (an APL-like array language).

    The main thing I got out of it was the footnotes, in particular: https://en.algorithmica.org/hpc/algorithms/matmul/ is a really nice post on fast matrix multiplication, and is a chapter of what looks like a nice online book.

  • bee_rider 4 hours ago

    Does BQN just natively vectorize the code? I was surprised not to see anything about that.

    • mlochbaum 3 hours ago

      The relevant operations for matrix multiply are leading-axis extension, shown near the end of [0], and Insert +ห shown in [1]. Both for floats; the leading-axis operation is ร— but it's the same speed as + with floating-point SIMD. We don't handle these all that well, with needless copying in ร— and a lot of per-row overhead in +ห, but of course it's way better than scalar evaluation.

      [0] https://mlochbaum.github.io/bencharray/pages/arith.html

      [1] https://mlochbaum.github.io/bencharray/pages/fold.html

      • mlochbaum 3 hours ago

        And the reason +ห is fairly fast for long rows, despite that page claiming no optimizations, is that ห is defined to split its argument into cells, e.g. rows of a matrix, and apply + with those as arguments. So + is able to apply its ordinary vectorization, while it can't in some other situations where it's applied element-wise. This still doesn't make great use of cache and I do have some special code working for floats that does much better with a tiling pattern, but I wanted to improve +ห for integers along with it and haven't finished those (widening on overflow is complicated).

    • icen 4 hours ago

      Yes; the CBQN interpreter has a number of specialised vectorised codepaths. It picks a good one for the arrays at runtime.