

! \ file ocelot/trace/interface/TraceEvent.NVIDIA CUDA is a general purpose parallel programming architecture with compilers and libraries to support programming of NVIDIA GPUs. Listing 30.2 TraceEvent Class Declaration This gives trace generators the opportunity to observe the result of an instruction and react accordingly. After the instruction commits, the PTX emulator calls each trace generator's TraceGenerator::postEvent() method, passing the same TraceEvent instance. TraceGenerator::event() is called before the instruction commits but after addresses into memory have been computed, so trace generators such as Memor圜hecker may ensure subsequent memory accesses are valid. The TraceEvent instance includes the PTX instruction's internal representation, a bit vector identifying active threads that will execute it, block ID, and a vector of memory addresses referenced by the instruction if it is a load, store, or texture sampler.
DIM3 DECLARATION UPDATE
Trace generators may update their own private data members to store the results of this analysis but should not modify the kernel or the application.ĭuring computation, each instruction is fetched, and a TraceEvent instance is dispatched to attached TraceGenerator instances by calling their TraceGenerator::event() methods. As another example, the number of live registers may be counted as well as the number of synchronization points. Trace generators may, for instance, count the number of static floating-point arithmetic instructions versus memory instructions or examine the sizes of global memory allocations. The entire state of the application is observable, and static analysis of the kernel may be completed at this point.
DIM3 DECLARATION GENERATOR
These are presented to each trace generator by calling its initialize() method and passing the internal representation of the PTX kernel to be executed as well as constant references to the entire structure of loaded modules and CUDA-managed resources. At launch configuration, the values of kernel parameters and grid dimensions are known. Trace generators implement event handlers corresponding to each of the three phases.

Sequence of TraceGenerator events in execution of PTX kernel.Įxecution can be partitioned into three phases: launch configuration, computation, and finalization.
DIM3 DECLARATION CODE
This method does not require modifications to CUDA application source code and is the approach taken with each of the trace generators discussed in Section 30.3.1.įigure 30.3. Alternatively and less intrusively, they may be added to Ocelot's trace-generators subproject which constructs and adds them when Ocelot initializes. Instances may be added and removed at runtime by CUDA applications using the Ocelot API function ocelot:: addTraceGenerator() to monitor specific kernels.

Trace generators are derived from the C++ class in Listing 30.1. In this section, we will explain how trace generators are invoked by the PTX emulator and discuss the scope of information that is available. GPU Ocelot's trace generation and analysis framework presents a clear and concise interface to user-extensible trace generators which are the preferred method to instrument and profile GPU applications. Sudhakar Yalamanchili, in GPU Computing Gems Jade Edition, 2012 30.2.2 Extensible Trace Generation Framework
