matmul.c

Matrix multiplication helper library
git clone git://git.finwo.net/lib/matmul.c
Log | Files | Refs | README | LICENSE

README.md (5163B)


      1 # Matmul - Accelerated Matrix Multiplication Library
      2 
      3 A lightweight, type-safe matrix multiplication library with runtime dispatch and SIMD acceleration.
      4 
      5 **Current Implementations:**
      6 - `uint8_t` × `int8_t` → `uint8_t`
      7 - `float` × `float` → `float`
      8 - `double` × `double` → `double`
      9 
     10 ## Installation
     11 
     12 This library is installable through the [dep](https://github.com/finwo/dep) package manager. If using [dep-repository](https://github.com/finwo/dep-repository), installable through `dep add finwo/matmul`, or simply by adding the following line to your .dep file:
     13 
     14 ```
     15 finwo/matmul https://git.finwo.net/lib/matmul.c/archives/heads/main.tar.gz
     16 ```
     17 
     18 Alternatively, you can include the [matmul.c](src/matmul.c) and [matmul.h](src/matmul.h) files in your project directly.
     19 
     20 ## Features
     21 
     22 - **SIMD Acceleration**: AVX2, AVX512, AVX-VNNI, and AVX512-VNNI with automatic CPU feature detection
     23 - **Multi-Core Parallelization**: OpenMP-based parallel processing across matrix rows
     24 - **Tiled Algorithm**: Cache-blocking for improved locality at large matrix sizes
     25 - **No Dependencies**: Pure C11 implementation with no external libraries
     26 - **Well Tested**: Unit tests verify correctness across all implementations
     27 
     28 ## Quick Start
     29 
     30 ### Quantized Multiplication (u8 × i8 → u8)
     31 ```c
     32 #include "finwo/matmul.h"
     33 #include <stdio.h>
     34 
     35 int main() {
     36     // 2x3 matrix A (uint8_t)
     37     uint8_t A[6] = {1, 2, 3, 4, 5, 6};
     38     // 3x2 matrix B (int8_t)
     39     int8_t  B[6] = {1, 0, 0, 1, 0, 0};
     40     // 2x2 result matrix C (uint8_t)
     41     uint8_t C[4];
     42 
     43     // Multiply A(2x3) * B(3x2) = C(2x2)
     44     matmul(2, 3, 2, A, B, C, 0.0);  // scale=0: no scaling
     45 
     46     printf("C[0] = %u, C[1] = %u, C[2] = %u, C[3] = %u\n",
     47            C[0], C[1], C[2], C[3]);
     48 
     49     return 0;
     50 }
     51 ```
     52 
     53 ### Floating Point Multiplication (f32 × f32 → f32)
     54 ```c
     55 #include "finwo/matmul.h"
     56 #include <stdio.h>
     57 
     58 int main() {
     59     float A[4] = {1.0f, 2.0f, 3.0f, 4.0f}; // 2x2
     60     float B[4] = {5.0f, 6.0f, 7.0f, 8.0f}; // 2x2
     61     float C[4];
     62 
     63     matmul(2, 2, 2, A, B, C, 1.0);
     64 
     65     printf("C[0] = %f\n", C[0]);
     66     return 0;
     67 }
     68 ```
     69 
     70 Compile with: `cc -o example example.c -lm -fopenmp`
     71 
     72 ## API Reference
     73 
     74 ### Generic Macro (Recommended)
     75 ```c
     76 matmul(m, n, p, A, B, C, scale);
     77 ```
     78 - Automatically selects the correct function based on types of A, B, and C
     79 - `scale`: Divide each output element by this value before writing (0 or 1 = no scaling)
     80 
     81 ### Direct Function Calls
     82 ```c
     83 // Implemented type combinations
     84 int matmul_u8_i8_u8(size_t m, size_t n, size_t p,
     85                     const uint8_t *A, const int8_t *B,
     86                     uint8_t *C, double scale);
     87 
     88 int matmul_f32_f32_f32(size_t m, size_t n, size_t p,
     89                        const float *A, const float *B,
     90                        float *C, double scale);
     91 
     92 int matmul_f64_f64_f64(size_t m, size_t n, size_t p,
     93                        const double *A, const double *B,
     94                        double *C, double scale);
     95 
     96 // Scalar and SIMD variants (internal/specialized)
     97 int matmul_scalar_u8_i8_u8(...);
     98 int matmul_avx512vnni_u8_i8_u8(...);
     99 int matmul_scalar_f32_f32_f32(...);
    100 int matmul_avx2_f32_f32_f32(...);
    101 int matmul_avx512_f32_f32_f32(...);
    102 ```
    103 
    104 ### Type Naming Conventions
    105 | Shorthand | C Type     | Description           |
    106 |-----------|------------|-----------------------|
    107 | `f32`     | `float`    | 32-bit floating point |
    108 | `f64`     | `double`   | 64-bit floating point |
    109 | `i8`      | `int8_t`   | Signed 8-bit integer  |
    110 | `u8`      | `uint8_t`  | Unsigned 8-bit integer|
    111 
    112 Function names follow pattern: `matmul_{A_type}_{B_type}_{C_type}`
    113 
    114 ### Scale Parameter
    115 The `scale` parameter enables quantization-aware multiplication:
    116 - `scale = 0` or `scale = 1`: No scaling (write raw result)
    117 - `scale = 64`: Divide output by 64 before writing
    118 - Useful for emulating: full-scale A input with quantized B input representing -2..1.984375 instead of -128..127
    119 
    120 Implementation:
    121 ```c
    122 if (scale != 0 && scale != 1) {
    123     result = result / scale;
    124 }
    125 ```
    126 
    127 ## Building
    128 
    129 ```bash
    130 # Compile library and tests
    131 make
    132 
    133 # Run tests
    134 ./test_matmul
    135 
    136 # Run benchmarks (optional, needs ~800MB RAM for 16K×16K)
    137 ./benchmark
    138 
    139 # Clean build artifacts
    140 make clean
    141 ```
    142 
    143 Requires a C11 compiler (gcc, clang, MSVC) with OpenMP support.
    144 
    145 ## Testing
    146 
    147 The library includes unit tests verifying correctness across all implementations:
    148 - Run: `./test_matmul`
    149 - Tests verify correctness against reference scalar implementation
    150 - Output shows PASS/FAIL status for each implementation (scalar, AVX2, AVX512, dispatched)
    151 
    152 ## Implementation Notes
    153 
    154 - **Automatic dispatch**: The first call runtime-detects CPU features and selects the optimal implementation for the given types
    155 - **Dispatch priority**:
    156     - `u8_i8_u8`: AVX512-VNNI → Scalar
    157     - `f32_f32_f32`: AVX512 → AVX2 → Scalar
    158     - `f64_f64_f64`: AVX512 → AVX2 → Scalar
    159 - **Parallelization**: OpenMP `parallel for` with `static` scheduling across row blocks
    160 - **Tiling**: Blocking factors tuned for L1/L2 cache (ib=32/64, jb=64, kb=32/64 depending on SIMD width)
    161 
    162 ## License
    163 
    164 Licensed under custom terms (Copyright 2026 finwo); see LICENSE.md for full details.
    165 
    166 ---
    167 
    168 *Built with C11. Zero runtime overhead for type dispatch.*