
Low performance for mixed-type matmul due to lack of pipelining
Mar 23, 2023 · Profiling the two kernels through NSight Compute seems to support that the lack of pipelining and layout conversions are the issue: there are additional memory transfers through …
How to avoid nans when multiplying by negative infinity? #2673
Nov 16, 2023 · I'm trying to implement attention, and have my masked scores become negative infinity: keys = tl.load(scratchpad_key_ptr + kv_offs, mask=mask, other=-float('inf')) # keys …
kernels/models/llama/llama/math_ops.py at main - GitHub
Contribute to triton-lang/kernels development by creating an account on GitHub.
[PROTON][CUPTI_PCSAMPLING]: RuntimeError: Failed to execute …
Oct 8, 2024 · - stalled_dispatch_stall - stalled_drain - stalled_imc_miss - stalled_lg_throttle - stalled_long_scoreboard - stalled_math_pipe_throttle - stalled_membar - stalled_mio_throttle - …
feat(examples): implement vllm in triton · Issue #2200 - GitHub
Aug 28, 2023 · Vllm uses paged memory and has kernels that perform the generation part of the causal inference. The computation pattern of generation part - single Q for entire seq len of KV …
Precision issue in Triton kernel: zero gradients for k and v ... - GitHub
Sep 10, 2024 · Description In our RWKV6 implementation using Triton for CUDA, we've discovered a critical precision-related issue in the fused_recurrent_rwkv6_bwd_kernel_dkv …
[BUG] error load fp32 value from 2D tensor #4351 - GitHub
Jul 18, 2024 · Hello, I am training a model with triton kernel, but it comes NaN in the backward. I export the intermediate data and find that tl.load cannot keep the value as outside the kernel. …