joemc wrote:woah you have a lot of memory allocation going on. allocate it all (or most of it) at once or put it on the stack. i bet you are having alot of cache misses.
Hmm, I don't know which version I have up on the thread here (ugh, this is going to get confusing discussing threading on a forum thread. Wow, I just used three participles in a row.)
In my current version, the SphereManager class performs this operation:
- Code: Select all
void CreateTestSpheres(UINT amount) //creates 'amount' of test spheres
{
srand(GetTickCount());
m_pSpheresSize = amount;
m_pSpheres = new Sphere*[amount];
for(UINT i = 0; i < amount; i++)
{
float mass = rand() / 32768.0f * 5.0f;
SphereTemplate tmp(
&D3DXVECTOR3(rand() / 32768.0f * 100.0f-50.0f,rand() / 32768.0f * 100.0f-50.0f,rand() / 32768.0f * 100.0f-50.0f),
&D3DXVECTOR3(0,0,0),
&D3DXVECTOR3(1,0,0),
&D3DXVECTOR3(0,1,0),
mass,//1.0f,
mass,//1.0f,
rand() / 32768.0f * .157f * D3DX_PI,
rand() / 32768.0f * D3DX_PI);
//CreateSpheres(&tmp, 1);
m_pSpheres[i] = new Sphere(&tmp);
}
}
Do you suggest instead of making an array of Sphere pointers, I just make an array of Sphere's? The purpose of making Sphere pointers was for eventual class polymorphism support, but I'm currently just interested in optimizing threading. Is there benefit to having all the objects in adjacent memory locations versus the current setup?
I was reading up on context switching, and see some of the costliness to changing processes, but does this apply to a single process, free-threaded setup? (Actually, I may be confusing some of the wording used between "processes" and "processors", and between the threading models "apartment" and "free threaded".)
I also began reading up on the model for multiprocessing support, and read some into SMP and NUMA. Since this is going to be a dx11 app anyway, thus on newer hardware, I'll use the NUMA model. The differences between the two are summed up in this Windows SDK comment:
The traditional model for multiprocessor support is symmetric multiprocessor (SMP). In this model, each processor has equal access to memory and I/O. As more processors are added, the processor bus becomes a limitation for system performance.
System designers are now using non-uniform memory access (NUMA) to increase processor speed without increasing the load on the processor bus. The architecture is non-uniform because each processor is close to some parts of memory and farther from other parts of memory. The processor quickly gains access to the memory it is close to, while it can take longer to gain access to memory that is farther away.
In a NUMA system, CPUs are arranged in smaller systems called nodes. Each node has its own processors and memory, and is connected to the larger system through a cache-coherent interconnect bus.
So I can obtain the number of physical cores using GetLogicalProcessorInformation(), and per your suggestion create only a number of threads equal to the number of cores. According to the docs, and in line with your statements, there isn't any guarantee to using all the cores because the scheduler makes the decisions, though it is possible to set process and thread affinity with certain processors.
Much of it's still over my head. I'd be interested in hearing some comments regarding this.
I was going to convert this into a dx11 compute shader, but I'm stuck!

I cannot get a simple DXTestRender11 class to render anything right now, which has me quite upset as it was supposed to be very easy to do. Oh, well...