A General-purpose Task-parallel Programming System using Modern C++
-
Updated
Oct 25, 2024 - C++
A General-purpose Task-parallel Programming System using Modern C++
Sample codes for my CUDA programming book
🎉 Modern CUDA Learn Notes with PyTorch: CUDA Cores, Tensor Cores, fp32/tf32, fp16/bf16, fp8/int8, flash_attn, rope, sgemm, hgemm, sgemv, warp/block reduce, elementwise, softmax, layernorm, rmsnorm.
CUDA Core Compute Libraries
Thin, unified, C++-flavored wrappers for the CUDA APIs
Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models
TinyChatEngine: On-Device LLM Inference Library
Safe rust wrapper around CUDA toolkit
🚀 你的YOLO部署神器。TensorRT Plugin、CUDA Kernel、CUDA Graphs三管齐下,享受闪电般的推理速度。| Your YOLO Deployment Powerhouse. With the synergy of TensorRT Plugins, CUDA Kernels, and CUDA Graphs, experience lightning-fast inference speeds.
A simple GPU hash table implemented in CUDA using lock free techniques
A self-learning tutorail for CUDA High Performance Programing.
This is an archive of materials produced for an introductory class on CUDA programming at Stanford University in 2010
From zero to hero CUDA for accelerating maths and machine learning on GPU.
μ-Cuda, COVER THE LAST MILE OF CUDA. With features: intellisense-friendly, structured launch, automatic cuda graph generation and updating.
An implementation of HIP that works on CPUs, across OSes.
CUDA kernel author's tools
Install CUDA on Windows11 using WSL2
Speed up image preprocess with cuda when handle image or tensorrt inference
Add a description, image, and links to the cuda-programming topic page so that developers can more easily learn about it.
To associate your repository with the cuda-programming topic, visit your repo's landing page and select "manage topics."