Can not start more than 12 threads simultaneously on Xeon E5 2650. RRS feed

  • Question

  • Hello Everyone.

    We just brought a Lenovo server which mounted Xeon E5 2650 v4. And Got a mission to migrate our server on it. Our server is a real-time image processing task which created serval threads. At first, we have run on i7-4790 with 6 threads but the performance was poor. After migrated to E5, the result is very unsatisfied. Even though the E5 owns 12 cores and 24 threads, i can only start 12 threads simultaneously no matter how many threads i have created. I wrote a test project to simulate the threads' working manner, which using Qt4.8 threads and synchronous objects. The demo is following.

    #include <QtCore/QCoreApplication> #include <QThread> #include <QSemaphore> #include <QMutex> #include <Windows.h> #define THREAD_NUMBER 20 QSemaphore m_producer; QSemaphore m_customer; volatile long long GolbalSum = 0; class MyThread: public QThread { public: MyThread(QObject *parent=0); ~MyThread(); void run(); public: long long *m_sum; bool m_bExit; QMutex m_singleTask; QMutex m_TaskDone; }; MyThread::MyThread(QObject *parent/* =0 */):QThread(parent) { m_sum = 0; m_bExit = false; m_TaskDone.lock(); start(); } MyThread::~MyThread() { m_bExit = true; m_singleTask.unlock(); } void MyThread::run() { m_singleTask.lock(); m_singleTask.lock(); while (!m_bExit) { if (!m_producer.tryAcquire(1, 1)) { continue; } // handle the task for (int i = 0; i < 500000; ++i) (*m_sum)++; m_customer.release(); m_singleTask.lock(); } } int main(int argc, char *argv[]) { QCoreApplication a(argc, argv); int loopCnt = 1000; MyThread threads[THREAD_NUMBER]; long long threadsSum[THREAD_NUMBER]; LARGE_INTEGER freq_detect, before_detect, after_detect; QueryPerformanceFrequency(&freq_detect); while(1) { for (int i = 0; i<THREAD_NUMBER; ++i) { threads[i].m_sum = &threadsSum[i]; threads[i].m_singleTask.unlock(); } if (0)//loopCnt-- <= 0) { for (int i = 0; i<THREAD_NUMBER; ++i) { threads[i].m_bExit = true; threads[i].wait(); } break; } Sleep(8); QueryPerformanceCounter(&before_detect); m_producer.release(THREAD_NUMBER); m_customer.acquire(THREAD_NUMBER); QueryPerformanceCounter(&after_detect); printf("%f %f %f\t", before_detect.QuadPart*1e3/freq_detect.QuadPart, after_detect.QuadPart*1e3/freq_detect.QuadPart, (after_detect.QuadPart-before_detect.QuadPart)*1e3/freq_detect.QuadPart); for (int i=0; i<THREAD_NUMBER; ++i) { GolbalSum += *threads[i].m_sum; } printf("Current sum:%I64d\n", GolbalSum); } //return a.exec(); }

    when the threads' processing time smaller than windows time slice, one half of CPUs do not running, but i7 will not. I have taken a great effort on optimizing the multithread concurrence by intel parrallel studio, but got no effects.

    Is there anyone familiar with multi-thread concurrence? I really need your help. Thanks a lot.

    Wednesday, August 23, 2017 1:14 PM

All replies

  • Hi Jason, are you using HPC Pack to run your workload?

    Qiufang Shi

    Thursday, August 24, 2017 2:06 AM
  • Hi Qiufang Shi, i just tested on windows 7 enterprise version.
    Thursday, August 24, 2017 2:46 AM
  • Hi Jason, in that case, you might need to post your questions in other forum.

    Qiufang Shi

    Thursday, August 24, 2017 10:37 AM
  • Hi Qiufang Shi, I'm sorry about that. Could you tell me which forum should i post to? Thank you.
    Thursday, August 24, 2017 1:00 PM