About 7 results
Open links in new tab
  1. Low performance for mixed-type matmul due to lack of pipelining

    Mar 23, 2023 · Profiling the two kernels through NSight Compute seems to support that the lack of pipelining and layout conversions are the issue: there are additional memory transfers through …

  2. How to avoid nans when multiplying by negative infinity? #2673

    Nov 16, 2023 · I'm trying to implement attention, and have my masked scores become negative infinity: keys = tl.load(scratchpad_key_ptr + kv_offs, mask=mask, other=-float('inf')) # keys …

  3. kernels/models/llama/llama/math_ops.py at main - GitHub

    Contribute to triton-lang/kernels development by creating an account on GitHub.

  4. [PROTON][CUPTI_PCSAMPLING]: RuntimeError: Failed to execute …

    Oct 8, 2024 · - stalled_dispatch_stall - stalled_drain - stalled_imc_miss - stalled_lg_throttle - stalled_long_scoreboard - stalled_math_pipe_throttle - stalled_membar - stalled_mio_throttle - …

  5. feat(examples): implement vllm in triton · Issue #2200 - GitHub

    Aug 28, 2023 · Vllm uses paged memory and has kernels that perform the generation part of the causal inference. The computation pattern of generation part - single Q for entire seq len of KV …

  6. Precision issue in Triton kernel: zero gradients for k and v ... - GitHub

    Sep 10, 2024 · Description In our RWKV6 implementation using Triton for CUDA, we've discovered a critical precision-related issue in the fused_recurrent_rwkv6_bwd_kernel_dkv …

  7. [BUG] error load fp32 value from 2D tensor #4351 - GitHub

    Jul 18, 2024 · Hello, I am training a model with triton kernel, but it comes NaN in the backward. I export the intermediate data and find that tl.load cannot keep the value as outside the kernel. …