Skip to content
Jin Wang edited this page Mar 27, 2015 · 1 revision

Introduction

Starting with 0.4.46 Ocelot includes a memory checker that will detect when either a CUDA runtime function or a thread in a PTX kernel accesses a memory region not associated with the currently selected device.

Disabling Checking

Memory checking comes on by default after 0.4.46. However, it does introduce a fairly significant performance overhead 1.5-3x slowdown depending on how many memory regions are mapped and how many memory operations your program does. To turn it off

  1. Edit ocelot/executive/implementation/CooperativeThreadArray.cpp
  2. Set CHECK_GLOBAL_ACCESSES to 0

Interpreting Memory Errors

The following example shows a program that makes 3 different kind of memory errors and the output from each when run through Ocelot.

Example CUDA Program

/*!

*/

#include <string>

__global__ void init(int* data)
{
	int tid = blockIdx.x * blockDim.x + threadIdx.x;
	data[tid] = tid;
}

void wrongMemcpyDirection()
{
	int* hostA = new int[128];
	memset(hostA, 0, sizeof(int)*128);
	int* deviceA;
	cudaMalloc( (void**) &deviceA, sizeof(int)*128 );
	cudaMemcpy( hostA, deviceA, sizeof(int)*128, cudaMemcpyHostToDevice );
	cudaFree( deviceA );
	delete[] hostA;
}

void hostPointerOnDevice()
{
	int* hostA = new int[128];
	memset(hostA, 0, sizeof(int)*128);
	int* deviceA;
	cudaMalloc( (void**) &deviceA, sizeof(int)*128 );
	cudaMemcpy( deviceA, hostA, sizeof(int)*128, cudaMemcpyHostToDevice );
	init<<< 4, 32 >>>( hostA );
	cudaFree( deviceA );
	delete[] hostA;
}

void bufferOverrun()
{
	int* hostA = new int[128];
	memset(hostA, 0, sizeof(int)*128);
	int* deviceA;
	cudaMalloc( (void**) &deviceA, sizeof(int)*128 );
	cudaMemcpy( deviceA, hostA, sizeof(int)*128, cudaMemcpyHostToDevice );
	init<<< 4, 33 >>>( deviceA );
	cudaFree( deviceA );
	delete[] hostA;	
}

int main( int argc, char** argv )
{
	if( argc > 1 )
	{
		if( std::string(argv[1]) == "0" )
		{
			wrongMemcpyDirection();
		}
		else if( std::string(argv[1]) == "1" )
		{
			hostPointerOnDevice();
		}
		else if( std::string(argv[1]) == "2" )
		{
			bufferOverrun();
		}
	}
	else
	{
		wrongMemcpyDirection();
	}
	
	return 0;
}

Building

How I built the example on my machine.

nvcc --cuda memoryErrors.cu 
g++ -o memoryErrors memoryErrors.cu.cpp `OcelotConfig -l`

Wrong Memcpy Direction Error

Running the first test, tries to do a cudaMemcpyDeviceToHost with the pointers reversed.

./memoryErrors 0

terminate called after throwing an instance of 'hydrazine::Exception'
  what():  Invalid destination 0x1d4da50 ( 512bytes) in host to device memcpy.
Device 0 : Ocelot PTX Emulator
 Nearby Global Variable Allocations
  No Allocations.

 Nearby Device Memory Allocations
  [0x1d4dd20] -  [0x1d4df20] (512 bytes) (global)
   ****0x1d4da50****

Aborted

Host pointer in device function

The second test passes a host pointer to a device function.

==Ocelot== Emulator failed to run kernel "_Z4initPi" with exception: 
==Ocelot== [PC 9] [thread 0] [cta 0] st.global.s32 [%rd4 + 0], %r3 - Global memory address 0x1921a50 of size 4 is out of any allocated or mapped range.
==Ocelot== Memory Map:
==Ocelot== Device 0 : Ocelot PTX Emulator
==Ocelot==  Nearby Global Variable Allocations
==Ocelot==   No Allocations.
==Ocelot== 
==Ocelot==  Nearby Device Memory Allocations
==Ocelot==   [0x1921d20] -  [0x1921f20] (512 bytes) (global)
==Ocelot==    ****0x1921a50****
==Ocelot== 

Buffer Overrun

The final test has device threads write off the end of an allocated region.

==Ocelot== Emulator failed to run kernel "_Z4initPi" with exception: 
==Ocelot== [PC 9] [thread 29] [cta 3] st.global.s32 [%rd4 + 0], %r3 - Global memory address 0xf52f20 of size 4 is out of any allocated or mapped range.
==Ocelot== Memory Map:
==Ocelot== Device 0 : Ocelot PTX Emulator
==Ocelot==  Nearby Global Variable Allocations
==Ocelot==   No Allocations.
==Ocelot== 
==Ocelot==  Nearby Device Memory Allocations
==Ocelot==   [0xf52d20] -  [0xf52f20] (512 bytes) (global)
==Ocelot==    ****0xf52f20****
==Ocelot==