You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am currently examining the execution of MPI_Send (Blocking send) with UCX in an intra_node scenario. At present, the memory transfer (ucs_memcpy_relaxed()) is executed in the receiver process (rank or processor), as depicted below.
By executing the same in the sender process, as shown below, we could significantly reduce cache-to-cache data transfers and conserve memory bandwidth.
However, I am struggling to find a runtime configuration that would allow me to execute this transfer in the sender process with the hint UCS_ARCH_MEMCPY_NT_DEST and benchmark it. Could anyone provide some guidance or suggestions on this matter?
Thank you in advance for your assistance.
--Arun
The text was updated successfully, but these errors were encountered:
Currently rkey_ptr protocol always does memcpy on the receiver. In order to do memcpy on the sender would need to implement a new variant of this protocol (with extra control message)
@arun-chandran-edarath, in case you would want more details, without much thinking and unsure about perf result, it might be possible to to implement as PoC either at:
UCT: src/uct/sm/mm/base/mm_*.c: maybe adding return fifo that would receive aggregated src/dst/len to perform memcpy at original source
Thank you for your responses. I would like to clarify if the two suggestions provided are identical:
a) Implementing a new variant of the rkey_ptr protocol (with an extra control message) to perform memcpy on the sender.
b) Using an rndv rtr flow with sm/mm put primitives in UCP.
Could you please provide more specific details or elaborate on these suggestions? Additionally, it would be helpful if you could point me towards the relevant source code files or any examples that I could refer to.
Hi Everyone,
@yosefe @tvegas1
I am currently examining the execution of MPI_Send (Blocking send) with UCX in an intra_node scenario. At present, the memory transfer (ucs_memcpy_relaxed()) is executed in the receiver process (rank or processor), as depicted below.
By executing the same in the sender process, as shown below, we could significantly reduce cache-to-cache data transfers and conserve memory bandwidth.
However, I am struggling to find a runtime configuration that would allow me to execute this transfer in the sender process with the hint UCS_ARCH_MEMCPY_NT_DEST and benchmark it. Could anyone provide some guidance or suggestions on this matter?
Thank you in advance for your assistance.
--Arun
The text was updated successfully, but these errors were encountered: