README.md (5163B)
1 # Matmul - Accelerated Matrix Multiplication Library 2 3 A lightweight, type-safe matrix multiplication library with runtime dispatch and SIMD acceleration. 4 5 **Current Implementations:** 6 - `uint8_t` × `int8_t` → `uint8_t` 7 - `float` × `float` → `float` 8 - `double` × `double` → `double` 9 10 ## Installation 11 12 This library is installable through the [dep](https://github.com/finwo/dep) package manager. If using [dep-repository](https://github.com/finwo/dep-repository), installable through `dep add finwo/matmul`, or simply by adding the following line to your .dep file: 13 14 ``` 15 finwo/matmul https://git.finwo.net/lib/matmul.c/archives/heads/main.tar.gz 16 ``` 17 18 Alternatively, you can include the [matmul.c](src/matmul.c) and [matmul.h](src/matmul.h) files in your project directly. 19 20 ## Features 21 22 - **SIMD Acceleration**: AVX2, AVX512, AVX-VNNI, and AVX512-VNNI with automatic CPU feature detection 23 - **Multi-Core Parallelization**: OpenMP-based parallel processing across matrix rows 24 - **Tiled Algorithm**: Cache-blocking for improved locality at large matrix sizes 25 - **No Dependencies**: Pure C11 implementation with no external libraries 26 - **Well Tested**: Unit tests verify correctness across all implementations 27 28 ## Quick Start 29 30 ### Quantized Multiplication (u8 × i8 → u8) 31 ```c 32 #include "finwo/matmul.h" 33 #include <stdio.h> 34 35 int main() { 36 // 2x3 matrix A (uint8_t) 37 uint8_t A[6] = {1, 2, 3, 4, 5, 6}; 38 // 3x2 matrix B (int8_t) 39 int8_t B[6] = {1, 0, 0, 1, 0, 0}; 40 // 2x2 result matrix C (uint8_t) 41 uint8_t C[4]; 42 43 // Multiply A(2x3) * B(3x2) = C(2x2) 44 matmul(2, 3, 2, A, B, C, 0.0); // scale=0: no scaling 45 46 printf("C[0] = %u, C[1] = %u, C[2] = %u, C[3] = %u\n", 47 C[0], C[1], C[2], C[3]); 48 49 return 0; 50 } 51 ``` 52 53 ### Floating Point Multiplication (f32 × f32 → f32) 54 ```c 55 #include "finwo/matmul.h" 56 #include <stdio.h> 57 58 int main() { 59 float A[4] = {1.0f, 2.0f, 3.0f, 4.0f}; // 2x2 60 float B[4] = {5.0f, 6.0f, 7.0f, 8.0f}; // 2x2 61 float C[4]; 62 63 matmul(2, 2, 2, A, B, C, 1.0); 64 65 printf("C[0] = %f\n", C[0]); 66 return 0; 67 } 68 ``` 69 70 Compile with: `cc -o example example.c -lm -fopenmp` 71 72 ## API Reference 73 74 ### Generic Macro (Recommended) 75 ```c 76 matmul(m, n, p, A, B, C, scale); 77 ``` 78 - Automatically selects the correct function based on types of A, B, and C 79 - `scale`: Divide each output element by this value before writing (0 or 1 = no scaling) 80 81 ### Direct Function Calls 82 ```c 83 // Implemented type combinations 84 int matmul_u8_i8_u8(size_t m, size_t n, size_t p, 85 const uint8_t *A, const int8_t *B, 86 uint8_t *C, double scale); 87 88 int matmul_f32_f32_f32(size_t m, size_t n, size_t p, 89 const float *A, const float *B, 90 float *C, double scale); 91 92 int matmul_f64_f64_f64(size_t m, size_t n, size_t p, 93 const double *A, const double *B, 94 double *C, double scale); 95 96 // Scalar and SIMD variants (internal/specialized) 97 int matmul_scalar_u8_i8_u8(...); 98 int matmul_avx512vnni_u8_i8_u8(...); 99 int matmul_scalar_f32_f32_f32(...); 100 int matmul_avx2_f32_f32_f32(...); 101 int matmul_avx512_f32_f32_f32(...); 102 ``` 103 104 ### Type Naming Conventions 105 | Shorthand | C Type | Description | 106 |-----------|------------|-----------------------| 107 | `f32` | `float` | 32-bit floating point | 108 | `f64` | `double` | 64-bit floating point | 109 | `i8` | `int8_t` | Signed 8-bit integer | 110 | `u8` | `uint8_t` | Unsigned 8-bit integer| 111 112 Function names follow pattern: `matmul_{A_type}_{B_type}_{C_type}` 113 114 ### Scale Parameter 115 The `scale` parameter enables quantization-aware multiplication: 116 - `scale = 0` or `scale = 1`: No scaling (write raw result) 117 - `scale = 64`: Divide output by 64 before writing 118 - Useful for emulating: full-scale A input with quantized B input representing -2..1.984375 instead of -128..127 119 120 Implementation: 121 ```c 122 if (scale != 0 && scale != 1) { 123 result = result / scale; 124 } 125 ``` 126 127 ## Building 128 129 ```bash 130 # Compile library and tests 131 make 132 133 # Run tests 134 ./test_matmul 135 136 # Run benchmarks (optional, needs ~800MB RAM for 16K×16K) 137 ./benchmark 138 139 # Clean build artifacts 140 make clean 141 ``` 142 143 Requires a C11 compiler (gcc, clang, MSVC) with OpenMP support. 144 145 ## Testing 146 147 The library includes unit tests verifying correctness across all implementations: 148 - Run: `./test_matmul` 149 - Tests verify correctness against reference scalar implementation 150 - Output shows PASS/FAIL status for each implementation (scalar, AVX2, AVX512, dispatched) 151 152 ## Implementation Notes 153 154 - **Automatic dispatch**: The first call runtime-detects CPU features and selects the optimal implementation for the given types 155 - **Dispatch priority**: 156 - `u8_i8_u8`: AVX512-VNNI → Scalar 157 - `f32_f32_f32`: AVX512 → AVX2 → Scalar 158 - `f64_f64_f64`: AVX512 → AVX2 → Scalar 159 - **Parallelization**: OpenMP `parallel for` with `static` scheduling across row blocks 160 - **Tiling**: Blocking factors tuned for L1/L2 cache (ib=32/64, jb=64, kb=32/64 depending on SIMD width) 161 162 ## License 163 164 Licensed under custom terms (Copyright 2026 finwo); see LICENSE.md for full details. 165 166 --- 167 168 *Built with C11. Zero runtime overhead for type dispatch.*