none
Efficiency problem on updating 6Million records RRS feed

  • 問題

  • Hi

    I am having a efficiency problem on updating a table.
    That table has 50 fields and it has 7 different indexes.

    I need to update one of the field (Called "ReferenceNum") on that table ("RecordDataTable") but joinning another lookup table ("ResultTable") which contains (3 fields) 5000 records. E.g. CustomerCode, Flag, ReferenceNum
    The tables is joined by a field called "CustomerCode"

    If the field are matched then update the RecordDataTable. I am wondering if it is possible to make the process quicker. At the moment it takes about 5.5 hours, so I guess it is not quite efficient.

    Here it is an example of my code.

    UPDATE RecordDataTable
        SET RecordDataTable.ReferenceNum = t.ReferenceNum
        FROM ResultTable t
        INNER JOIN RecordDataTablec
        ON t.CustomerCode = c.CustomerCode
        WHERE c.ReferenceNum=0 AND t.ReferenceNum<>0

    Could you give me some advice, is there any way I could make the query more efficient

    Many thanks

    Chi
    2008年4月27日 下午 01:32

解答

  • Yes, There are many ways to optimize,

    1) You should try to create a clustered Index on ResultTable.[CustomerCode, ReferenceNum].

    2) You should avoid to query 6 Million records, from your query, i found that you are only doing with those c.ReferenceNum = 0 and t.ReferenceNum <> 0, and then set the c.ReferenceNum = t.ReferenceNum. Will you reset both ReferenceNum? Cause i am not sure what this column value used for. If they are not going to be reset, you might consider to build another column, with true or false to indicate which records are already updated, which are not. And then just run update query against those haven't updated records. (Or even a new table which saving what customerCode has to do the update, by doign this, it should be less than 6 Million records generally.)

    3) create a view on RecordDateTable with only required columns, as you mention, there are 50 fields in that table, however, all your query is only required 2 columns, customerCode and referenceNum.

    4) Cosider to update the ReferenceNum in RecordDataTable directly when you update the ResultTable at the same time, but doing this, you don't have to update with {inner join, nested loops, hash match, table scan...etc} between these 2 tables against 6 millions records.

    5) Please look carefully on your execution plan, study and try to apply some change to see if the execution plan has improved.

    2008年4月28日 上午 07:37
  • Yes, you may try to update directly to the view, to see if there is any improvement. However, by looking into the execution plan is always the best way to optimize the query, see which step is having a high cost.

     

    2008年4月28日 上午 08:26

所有回覆

  • Do they have index on CustomerCode and ReferenceNum? What did execution plan say?
    2008年4月27日 下午 05:17
  • Yes, There are many ways to optimize,

    1) You should try to create a clustered Index on ResultTable.[CustomerCode, ReferenceNum].

    2) You should avoid to query 6 Million records, from your query, i found that you are only doing with those c.ReferenceNum = 0 and t.ReferenceNum <> 0, and then set the c.ReferenceNum = t.ReferenceNum. Will you reset both ReferenceNum? Cause i am not sure what this column value used for. If they are not going to be reset, you might consider to build another column, with true or false to indicate which records are already updated, which are not. And then just run update query against those haven't updated records. (Or even a new table which saving what customerCode has to do the update, by doign this, it should be less than 6 Million records generally.)

    3) create a view on RecordDateTable with only required columns, as you mention, there are 50 fields in that table, however, all your query is only required 2 columns, customerCode and referenceNum.

    4) Cosider to update the ReferenceNum in RecordDataTable directly when you update the ResultTable at the same time, but doing this, you don't have to update with {inner join, nested loops, hash match, table scan...etc} between these 2 tables against 6 millions records.

    5) Please look carefully on your execution plan, study and try to apply some change to see if the execution plan has improved.

    2008年4月28日 上午 07:37
  • Hi Rmiao

    Thanks for your information. I have indexed CustomerCode, but I have not indexed ReferenceNumber. I guess I do not need to as I only update the ReferenceNumber Field.

    -->2)  The field will nto get reset as I will generate a new ResultTable each time when I run it, so the table will be truncated and rebuild.
    <<3) create a view on RecordDateTable with only required columns, as you mention, there are 50 fields in that table, however, all your query is only required 2 columns, customerCode and referenceNum. >>
    Do you mean I should preform the update on that view? E.g.
    UPDATE vRecordDataTable
    SET ReferenceNumber = ResultTable.ReferenceNumber
    ...etc?

    <<4) Cosider to update the ReferenceNum in RecordDataTable directly when you update the ResultTable at the same time, but doing this, you don't have to update with {inner join, nested loops, hash match, table scan...etc} between these 2 tables against 6 millions records.>>
    I will have a tried on this.

    I will run my query tonight and study the execution plan, and see anyway I could improve the query

    Thanks very much your help.

    I will let you know the result later

    Many thanks

    Chi

    2008年4月28日 上午 07:49
  • Yes, you may try to update directly to the view, to see if there is any improvement. However, by looking into the execution plan is always the best way to optimize the query, see which step is having a high cost.

     

    2008年4月28日 上午 08:26
  • >> but I have not indexed ReferenceNumber. I guess I do not need to as I only update the ReferenceNumber Field.

     

    Not true because you used it in where clause.

    2008年4月28日 下午 06:30
  • Hi Rmiao

    THanks for your information. I understood now. I tried to Ken's suggestion to put Cluster index on CustomerCode and ReferenceNumber, Unfortunately, I still receive the same result. I will try to update the all the record at the same time instead of update the record by joining tables together. I will let you know my result a day after :)

    Thanks very much for your help.

    Chi
    2008年4月28日 下午 09:25
  • if you don't mind, I also recommend that you run the query and show the execution plan, save the screen shot and host in some pic/file hosting site, give us the url link and see it. Then we can understand it more.

    2008年4月29日 上午 06:48
  • Hi Ken

    Thanks very much for your help. Is there any way I could export the Execution plan? so that I could Export the image properly as it is quite long, it probably will be in 2 files. I will try again with different method today. if it still did not work, then i will upload that Execution plan.


    here is some statistic of the execution plan

    4 % on Parallelism (Repartition Streams)
    Hash Match Aggregate 20%
    Hash Match InnerJoin 13 %
    It spend 46% of Cost on the Clustered Index Scan on the CustomerCode and ReferenceNumber
    1% on table update

    Does it mean any sense?

    Do you think the problem is the size of the table?

    I will try to update the record at the same time as I update the Result table and see if it works better.

    I will let you know my result hopefully tomorrow.

    Thanks very much for your help : )

    Chi


    2008年4月29日 上午 07:55
  • Click the "Execution Plan" tab, right click it on the white area, you will see a "save Execution plan as..." selection, select it and save as .sqlplan file.

     

    Well, one solutino to optimize the performance, is horizontal partitioning the data, i.e. you may partition it based on "year".

     

    Do you remember what I said, you may adding a flag column to indicate which data are already modified the ReferenceNumber and which data are not. You may also consider to moving new data (means with ReferenceNumber = 0) to another table, then update thought this table (by adding record to original table and delete from this "new record" table, and saerching the result for other propose by using union all).

    There are many ways to do optimization. We have to try to see the best way for difference case.

    2008年4月29日 上午 08:38
  • Hi Ken

    Thanks for your information about how to save the Execution Plan, Unfortunately, I am not sure why I am not able to save my Execution plan on my SQL Server Management Studio. (This option just greyed out)

    Any suggestion?

    Thanks

    Chi
    2008年4月29日 上午 11:23
  • I never meet this happening......I have no idea on why Management Studio disable this function to you.....Sorry.

    2008年4月30日 上午 09:13
  • >> I am not sure why I am not able to save my Execution plan on my SQL Server Management Studio. (This option just greyed out)

    Do you have proper permission in sql?

    2008年4月30日 下午 12:56
  • Hi Rmiao

    Thanks for your reply. I guess I have the proper permission in SQL as I logged in as 'sa'.

    Is there any why I could check if I have the permission to save the execution plan?

    Many thanks

    Chi
    2008年4月30日 下午 03:20
  • Should be ok with sa. What did you try? Did you put cursor in execution plan panel and go to file -> save execution plan as in ssms?

    2008年5月1日 下午 01:26
  • Hi Rmiao

    What is did is click on the "Display Estimate Execution Plan" icon at the top.  Then I run my query, after that I saw my Execution Plan, then I right click on it the page and try to save the Execution Plan. Unfortunately this option is greyed out. I tried again with your suggestion, Go to File -> Save Execution Plan As, but the option is grey out as well.

    Thanks very much for your suggestion.

    Chi
    2008年5月1日 下午 02:03
  • Oh....read from somewhere, mention that if you are connecting SQL2000 from SSMS2005, the option of "Save execution plan" will be greyed out. Is your DB a SQL2000 in your case?

     

     

     

    2008年5月2日 上午 04:40
  • Hi Ken

    Yeah, I am connecting SQL 2000 from SSMS 2005. Do you think that is the reason to not allow me to save/ export the execution plan? Is there any way I could get around it?

    Many thanks

    Chi
    2008年5月2日 上午 07:43
  • I don't have the SQL 2000  Enterprise Manager, I remember that you could also display the Execution Plan from SQL 2000 Enterprise Manager. But I am not sure if you could save it. You couuld try and tell us back the result.

    2008年5月2日 上午 08:50
  • Hi

    I tried on MSSQL Query Analyzer, Unfortunately they do not have an option to save the Execution Plan as well : (
    2008年5月2日 上午 09:23
  • hm......then may be you find some screen shot program which allow u save the long windows as one jpg file.

    2008年5月2日 上午 11:25
  • Thanks : )
    2008年5月2日 下午 02:33
  • Any update? Have you changed the way of your program/DB logic yet?

    Are you still trying to capture the execution plan?

     

    One more nice note, Cause you are using SQL 2000. you may consider to purchase and upgrade to SQL2005 or even SQL2008. The performance will also improve a bit.

     

    2008年5月7日 上午 02:10
  • Hi Ken

    Sorry to reply late, I have changed my program's logic, it works very well, It only takes an hour.

    I would love to upgrade the database to SQL 2005 / 2008
    I am not sure how long it would takes, E.g. I think some of the data type is different, and more sure if it will cause any problem on UDF, Stored procedure ...etc

    Do you think I could just attach the MDB file into the new server and it should work fine?

    Thanks very much for your help.

    Chi
    2008年5月7日 上午 07:39
  •  

    Well, could you click the 這篇文章有用嗎?for this thread to true.

     

    Also, for the MDB file, I think you may start another thread to ask about the best way to "upgrade sql2000 to sql2008". Cause there are some SQL MVP in this forum too. He may reply you the best answer.

    2008年5月7日 上午 08:10
  • Hi Ken

    Sorry about forgetting to make the post as answered, I marked this post has been answered now and Ticked it the useful replies as well (several of them). I think I did tick the useful reply before, I am not sure why it does not show up after I click on another page and come back. Could you see if there any reply has marked useful in this topic? or it is a bug on this forum.

    I have made another post on my query about updating SQL Server from 2000 to 2005 or 2008  on the links below.

    http://forums.microsoft.com/hongkong/ShowPost.aspx?PostID=3302821&SiteID=82&mode=1

    Thanks very much for your help.

    Chi
    2008年5月7日 上午 08:58
  • Instead of optimizing anything, I would like to review the fundamental problem first: that's not a good way to update a table field according to the data from another table.  It is because that simply violates the principle of database normalization.

     

    You are experiencing bad performance, primarily because you perform a join of 2 tables, creating huge table space in memory.  Then you reference this big table and pick out the value you need to update the target table.  You are in virtual working on a join query of 2 millions record tables in memory.  If that can be completed within 5.5 hours, your server is already not a cheap one

     

    If I am handling this problem, I will simply create a view (or stored procedure, up to you), just selecting the fields I needed from RecordDataTable, and join ResultTable on the fly.  In this case, you can skip updating the RecordDataTable (writing to log file).  If you really want a persistence table, select the view into another temporary or permenant table (or you can even enable the bulk logged mode to further enhance performance, but that will decrease recoverability).

     

    Please try if that can meet your requirements, and I am interested on any performance enhancement

     

    Cheers.

    2008年5月16日 下午 02:09
  • hi Lawrence

    Thanks for your information. I created a view by selecting the record that will be update E.G. select ResultTable where ReferenceNum<>0
    anf Join the table by the query below.

    UPDATE RecordDataTable
        SET RecordDataTable.ReferenceNum = t.ReferenceNum
        FROM vResultTable t
        INNER JOIN RecordDataTablec
        ON t.CustomerCode = c.CustomerCode
        WHERE c.ReferenceNum=0

    Which takes a bit more time to finished the update than before.


    2) Do you mean I should create a view by SELECT RecordDataTable WHERE ReferenceNum=0 AND JOIN ResultTable for the update?

    I will try to test out the second approach and see if it works better.

    Do you think it still not a good way to do as by joining 2 tables together, consume too much processing power and memory?

    Many thanks

    2008年5月16日 下午 03:12
  • Hi Chi,

     

    I think I have misled you.  If you create a view and use update query, the result will be more or less the same.

     

    What I meant was, instead of updating a table using a value from another table, row by row, with joining 2 huge tables together, you can just use 1 query to do a cross table query, and forgetting updating the field you want to update. 

     

    Since the value is already in another table, what you need to do is to perform a simple join query to produce a result set.  Afterward, you can use SELECT INTO query to put the result set into a new table for later use.  Since SELECT INTO is a matter of bulk insert, speed will be much faster than doing row by row update.

     

    For example:

     

    CREATE VIEW vw_PopulateReferenceNum

    AS

    SELECT c.field1, c.field2, c.field3, c.field4, c.customercode, t.ReferenceNum

    FROM RecordDataTable c JOIN ResultTable t ON c.ReferenceNum = t.ReferenceNum

    WHERE c.ReferenceNum = 0 AND t.ReferenceNum <> 0

    GO

     

    Then, you can execute the view and put it into a table, or temp table for later use:

    SELECT * INTO tbl_NewResult FROM vw_PopulateReferenceNum

     

    In this case, if the ResultTable "t" is not very big, you will probably get a 6 million lines result set from vw_PopulateReferenceNum, then you bulk insert 6 million records into SQL server, the time should be significantly shorter.

     

    You can even create a "phantom" table, which is having the same schema as RecordDataTable as below:

     

    1. RecordDataTable -> duplicate schema -> "phantom" table of RecordDataTable

    2. TRUNCATE tbl_phantom

    3. SELECT * INTO tbl_phantom FROM vw_PopulateReferenceNum

    4. DELETE RecordDataTable

    5. rename tbl_phantom to RecordDataTable

     

    Wish this can help.

     

    Cheers.

    2008年5月17日 下午 02:54
  • Hi Lawrence,

    Thank you very much for your help. It just my mis-understood before. Now I understood a lot more. I will try it out later .  Let you know when I get the result.

    Thanks a lot.
    2008年5月19日 上午 07:31