The Quaternion

Anything relating to DirectX 9c

Re: The Quaternion

Postby Hieran_Del8 » Fri Jun 25, 2010 5:45 am

Well, I've adjusted the program to be multithreaded, and to adjust the number of threads depending on how many spheres and procs can run in each thread. The optimum setting is 500 spheres or less with 124751 procs, producing 1 thread. However, I've tried it at 2000 spheres, producing 17 threads. It runs a little better than it did before. I tried 10,000 spheres, which produced 401 threads. Needless to say, it was unbearably slow on my dual core cpu (though still much faster than 2000 objects single threaded!) The more cores a computer has, the performance will increase exponentially.

And that is why I will write this for dx11 compute shader. I can have 1000's of parallel floating point computations. 10,000 objects would be easily doable. I wonder at what point the my graphics card (ati hd5770) would begin to stutter. 100,000? 8-)

So here's the program. Sorry, but it's incredibly messy now. Lots of sections commented out... The only files that are different from the previous release are Spheres.h, SphereManager.h, and main.cpp.
Attachments
Spheres.zip
DXTestD3D.h, DXTestInput.h, DXTestLibs.h, DXTestRender.h, DXTestSound.h, DXTestWindow.h, HPCounter.h, smartptr.h, Spheres.h, SphereManager.h, main.cpp
(16.25 KiB) Downloaded 9 times
Image
User avatar
Hieran_Del8
Most Valuble Contributor
 
Posts: 349
Joined: Thu Nov 06, 2008 4:59 am
Location: Marengo, IL

Re: The Quaternion

Postby lp rob1 » Fri Jun 25, 2010 6:39 am

Woo! So much faster than the old one! I managed to get the object count up to 3300 with no stuttering (25fps+). Just. Mind you, the comp started howling like a beast... I suppose that 4.2GHz overclock doesn't help! :D
A Compute Shader for this would be great, but don't leave the CPU with nothing to do! It needs to have its fair share of work. It should have a limit to how much it can take. Might be hard to implement. You are going to have to work out how many calculations in total, then somehow [divide?] it. So the more objects there are, the less objects the CPU gets (same number of calcs though). I hope that made any sense. :D

Eager to test his HD5850 with this app,
lp rob1
User avatar
lp rob1
Veteren
 
Posts: 124
Joined: Tue Jan 12, 2010 6:34 am
Location: England, UK

Re: The Quaternion

Postby Hieran_Del8 » Fri Jun 25, 2010 9:23 am

lp rob1 wrote:Woo! So much faster than the old one! I managed to get the object count up to 3300 with no stuttering (25fps+). Just. Mind you, the comp started howling like a beast... I suppose that 4.2GHz overclock doesn't help! :D
A Compute Shader for this would be great, but don't leave the CPU with nothing to do! It needs to have its fair share of work. It should have a limit to how much it can take. Might be hard to implement. You are going to have to work out how many calculations in total, then somehow [divide?] it. So the more objects there are, the less objects the CPU gets (same number of calcs though). I hope that made any sense. :D

Eager to test his HD5850 with this app,
lp rob1

Wow! That's really good. My laptop is a dual core 2.0ghz intel centrino. Nothing fancy (and a few years old), but that is awesome that you got such good results. I haven't tested this with my quad core 2.6ghz amd phenom II desktop yet, though I will before I start writing the dx11 compute shader.

Well, I figure the cpu can handle collision detection, as that process will probably be lighter most of the time, though it will operate on the same data sets. I guess to start making way for this I should first write a simple DXTestD3D11 header, and a corresponding DXTestRender11 header--handling all the ugly initialization and rendering setup of dx11. Also, I think I'll start using XnaMath.h instead of d3dxmath. By default, d3dx11.h does not include any math headers, so one could technically use xnamath.h or d3dxmath.h, or both. xnamath has the advantage of aligning the data and using the sse2 instruction set.

That's a good point you brought up: have some of the workload performed on gpu and cpu. The only problem I can see is a duplication of data sets. I'll need to think this one through. Maybe in the next week I'll have something worked out.
Image
User avatar
Hieran_Del8
Most Valuble Contributor
 
Posts: 349
Joined: Thu Nov 06, 2008 4:59 am
Location: Marengo, IL

Re: The Quaternion

Postby lp rob1 » Sat Jun 26, 2010 11:54 pm

Hieran_Del8 wrote:and using the sse2 instruction set

That's a point... mabye we should try and get some CPU optimizations in? I don't have a clue where to start though. :o
User avatar
lp rob1
Veteren
 
Posts: 124
Joined: Tue Jan 12, 2010 6:34 am
Location: England, UK

Re: The Quaternion

Postby Hieran_Del8 » Sun Jun 27, 2010 4:20 pm

Cpu optimizations are good, but I'm currently baffled by current results. I tried running the program on my quad core desktop, and the performance was not noticeably different from my dual core desktop. I'm thinking the bottleneck is the memory fetch?

In reading about compute shader, it was recommended to store all input values into a local array so to instruct the program to keep the entries in gpu cache. Perhaps to see a noticeable improvement in performance, I should store all input values (ie. 2,000 objects * 4 floats for position and mass = 32k bytes) as well as an output buffer (ie. 2,000 objects * 3 floats for added accelerations = 24k bytes) for each thread. Combine the results of the threads, then perform state update. Maybe I should just convert everything to local cache instead of dynamic allocations....
Image
User avatar
Hieran_Del8
Most Valuble Contributor
 
Posts: 349
Joined: Thu Nov 06, 2008 4:59 am
Location: Marengo, IL

Re: The Quaternion

Postby joemc » Sat Jul 03, 2010 1:36 pm

really you only want 1 thread per core. check how many cores there are and create that many worker threads. unless they are waiting on something not cpu based like hard drive or network.

woah you have a lot of memory allocation going on. allocate it all (or most of it) at once or put it on the stack. i bet you are having alot of cache misses.

to do it with less modification to your code you could always make a memory manager class for the sphere that allocates a pool of memory and overload the new operator. but i don't like that idea :)

I will try to get the time to actually understand your code.

edit:
to add to the one thread per core part. if you have more than one you are just getting extra context switches. ideally you would only want one per physical core, but there is no bias for context switching on the "unused" other half of the hyperthreaded core and it can be given to a stupid process and slow things down. cache misses cost ALOT. make sure nothing is getting swapped out to a page file either.
joemc
Veteren
 
Posts: 127
Joined: Fri Jul 10, 2009 1:29 pm

Re: The Quaternion

Postby Hieran_Del8 » Sun Jul 04, 2010 7:37 am

joemc wrote:woah you have a lot of memory allocation going on. allocate it all (or most of it) at once or put it on the stack. i bet you are having alot of cache misses.

Hmm, I don't know which version I have up on the thread here (ugh, this is going to get confusing discussing threading on a forum thread. Wow, I just used three participles in a row.)

In my current version, the SphereManager class performs this operation:
Code: Select all
   void CreateTestSpheres(UINT amount) //creates 'amount' of test spheres
   {
      srand(GetTickCount());
      m_pSpheresSize = amount;
      m_pSpheres = new Sphere*[amount];

      for(UINT i = 0; i < amount; i++)
      {
         float mass = rand() / 32768.0f * 5.0f;
         SphereTemplate tmp(
            &D3DXVECTOR3(rand() / 32768.0f * 100.0f-50.0f,rand() / 32768.0f * 100.0f-50.0f,rand() / 32768.0f * 100.0f-50.0f),
            &D3DXVECTOR3(0,0,0),
            &D3DXVECTOR3(1,0,0),
            &D3DXVECTOR3(0,1,0),
            mass,//1.0f,
            mass,//1.0f,
            rand() / 32768.0f * .157f * D3DX_PI,
            rand() / 32768.0f * D3DX_PI);
         //CreateSpheres(&tmp, 1);
         m_pSpheres[i] = new Sphere(&tmp);
      }
   }

Do you suggest instead of making an array of Sphere pointers, I just make an array of Sphere's? The purpose of making Sphere pointers was for eventual class polymorphism support, but I'm currently just interested in optimizing threading. Is there benefit to having all the objects in adjacent memory locations versus the current setup?

I was reading up on context switching, and see some of the costliness to changing processes, but does this apply to a single process, free-threaded setup? (Actually, I may be confusing some of the wording used between "processes" and "processors", and between the threading models "apartment" and "free threaded".)

I also began reading up on the model for multiprocessing support, and read some into SMP and NUMA. Since this is going to be a dx11 app anyway, thus on newer hardware, I'll use the NUMA model. The differences between the two are summed up in this Windows SDK comment:
The traditional model for multiprocessor support is symmetric multiprocessor (SMP). In this model, each processor has equal access to memory and I/O. As more processors are added, the processor bus becomes a limitation for system performance.

System designers are now using non-uniform memory access (NUMA) to increase processor speed without increasing the load on the processor bus. The architecture is non-uniform because each processor is close to some parts of memory and farther from other parts of memory. The processor quickly gains access to the memory it is close to, while it can take longer to gain access to memory that is farther away.

In a NUMA system, CPUs are arranged in smaller systems called nodes. Each node has its own processors and memory, and is connected to the larger system through a cache-coherent interconnect bus.

So I can obtain the number of physical cores using GetLogicalProcessorInformation(), and per your suggestion create only a number of threads equal to the number of cores. According to the docs, and in line with your statements, there isn't any guarantee to using all the cores because the scheduler makes the decisions, though it is possible to set process and thread affinity with certain processors.

Much of it's still over my head. I'd be interested in hearing some comments regarding this.

I was going to convert this into a dx11 compute shader, but I'm stuck! :x I cannot get a simple DXTestRender11 class to render anything right now, which has me quite upset as it was supposed to be very easy to do. Oh, well...
Image
User avatar
Hieran_Del8
Most Valuble Contributor
 
Posts: 349
Joined: Thu Nov 06, 2008 4:59 am
Location: Marengo, IL

Re: The Quaternion

Postby joemc » Sun Jul 04, 2010 9:47 am

Is there benefit to having all the objects in adjacent memory locations versus the current setup

a HUGE one if there is processor intensive activity going on.

a processor has a L1, L2, and many times L3 cache. If it is getting a memory location it checks the L1 cache, if it is not there it checks L2 and so on. Register is going to be the quickest. how bad the delay is varies greatly depending upon the processor, but for instance a division might cost 80 clocks, an add might cost 2, and a L2 cache miss might cost 200 clocks. those numbers are all made up but it is a fairly devastating event to have alot of cache misses. having in adjacent memory or on the stack is going to make it more likely to be in the cache. not to mention one api call is alot faster than a bunch to allocate each object.

an array of pointers is going to be an extra level of indirection, but does not cost that much. the Load Effective Address operator is pretty cheap when it comes to cycles. but if it is not necessary i would cut it out.

a short example of a memory manager:

Code: Select all
#include <Windows.h>
#include <iostream>
class MemMan
{
private:
  HANDLE hMemory;
  void *pMemory;
  int *pCurrent;
public:
  MemMan(int nSize)
  {
   hMemory= GlobalAlloc(GMEM_MOVEABLE | GMEM_ZEROINIT, nSize );
   pMemory = GlobalLock(hMemory);
   pCurrent= (int*)pMemory;
  }

  ~MemMan(void)
  {
    GlobalUnlock(pMemory);
    GlobalFree(hMemory);
  }
  int * GetSome()
  {
    return ++pCurrent;
  }
};

int main()
{
  MemMan myMemory(1024);
  int *a =  myMemory.GetSome();
  int *b =  myMemory.GetSome();
  int *c =  myMemory.GetSome();
  (*a) = 1;
  (*b) = 2;
  (*c) = (*a) + (*b);

  std::cout << (*a) << " + " << (*b) << " = " << (*c);

  while(1);
  return 0;
}



you could investigate GMEM_FIXED and not needing to lock it. I don't know. MSDN recommends the newer Heap functions, but it is all the same really. Really all this is, is a simple container, the STL vector may be good enough for you, just make sure you call reserve() before you add those thousands of objects. Memory Managers and containers do go different directions when you implement them further.

I personally like putting as much on the stack as possible. I think it makes the code cleaner, its easier, and in many cases is faster. Not to mention scope cleans it up for you instead of having to worry about it.


this guy really knows what he is talking about : http://www.agner.org/optimize/
read page 91. he bashes the STL a little bit, but he has good reasons, resources, and tests to prove everything he says.
joemc
Veteren
 
Posts: 127
Joined: Fri Jul 10, 2009 1:29 pm

Re: The Quaternion

Postby Hieran_Del8 » Sun Jul 04, 2010 4:12 pm

Thanks for the reply! Yep, it is really over my head right now. So much so that I might just forget about it for right now and focus on converting it to a compute shader, if I can ever get that freaking DXTestRender11 to work!

Also, thanks for the links. They were quite informative. Though I haven't downloaded any of the pdf's, I'll be sure to reference this source when I decide to revisit this topic.
Image
User avatar
Hieran_Del8
Most Valuble Contributor
 
Posts: 349
Joined: Thu Nov 06, 2008 4:59 am
Location: Marengo, IL

Previous

Return to DirectX 9c

Who is online

Users browsing this forum: No registered users and 1 guest

cron