最佳解答者
Efficiency problem on updating 6Million records

問題
-
Hi
I am having a efficiency problem on updating a table.
That table has 50 fields and it has 7 different indexes.
I need to update one of the field (Called "ReferenceNum") on that table ("RecordDataTable") but joinning another lookup table ("ResultTable") which contains (3 fields) 5000 records. E.g. CustomerCode, Flag, ReferenceNum
The tables is joined by a field called "CustomerCode"
If the field are matched then update the RecordDataTable. I am wondering if it is possible to make the process quicker. At the moment it takes about 5.5 hours, so I guess it is not quite efficient.
Here it is an example of my code.
UPDATE RecordDataTable
SET RecordDataTable.ReferenceNum = t.ReferenceNum
FROM ResultTable t
INNER JOIN RecordDataTablec
ON t.CustomerCode = c.CustomerCode
WHERE c.ReferenceNum=0 AND t.ReferenceNum<>0
Could you give me some advice, is there any way I could make the query more efficient
Many thanks
Chi
解答
-
Yes, There are many ways to optimize,
1) You should try to create a clustered Index on ResultTable.[CustomerCode, ReferenceNum].
2) You should avoid to query 6 Million records, from your query, i found that you are only doing with those c.ReferenceNum = 0 and t.ReferenceNum <> 0, and then set the c.ReferenceNum = t.ReferenceNum. Will you reset both ReferenceNum? Cause i am not sure what this column value used for. If they are not going to be reset, you might consider to build another column, with true or false to indicate which records are already updated, which are not. And then just run update query against those haven't updated records. (Or even a new table which saving what customerCode has to do the update, by doign this, it should be less than 6 Million records generally.)
3) create a view on RecordDateTable with only required columns, as you mention, there are 50 fields in that table, however, all your query is only required 2 columns, customerCode and referenceNum.
4) Cosider to update the ReferenceNum in RecordDataTable directly when you update the ResultTable at the same time, but doing this, you don't have to update with {inner join, nested loops, hash match, table scan...etc} between these 2 tables against 6 millions records.
5) Please look carefully on your execution plan, study and try to apply some change to see if the execution plan has improved.
所有回覆
-
Yes, There are many ways to optimize,
1) You should try to create a clustered Index on ResultTable.[CustomerCode, ReferenceNum].
2) You should avoid to query 6 Million records, from your query, i found that you are only doing with those c.ReferenceNum = 0 and t.ReferenceNum <> 0, and then set the c.ReferenceNum = t.ReferenceNum. Will you reset both ReferenceNum? Cause i am not sure what this column value used for. If they are not going to be reset, you might consider to build another column, with true or false to indicate which records are already updated, which are not. And then just run update query against those haven't updated records. (Or even a new table which saving what customerCode has to do the update, by doign this, it should be less than 6 Million records generally.)
3) create a view on RecordDateTable with only required columns, as you mention, there are 50 fields in that table, however, all your query is only required 2 columns, customerCode and referenceNum.
4) Cosider to update the ReferenceNum in RecordDataTable directly when you update the ResultTable at the same time, but doing this, you don't have to update with {inner join, nested loops, hash match, table scan...etc} between these 2 tables against 6 millions records.
5) Please look carefully on your execution plan, study and try to apply some change to see if the execution plan has improved.
-
Hi Rmiao
Thanks for your information. I have indexed CustomerCode, but I have not indexed ReferenceNumber. I guess I do not need to as I only update the ReferenceNumber Field.
-->2) The field will nto get reset as I will generate a new ResultTable each time when I run it, so the table will be truncated and rebuild.
<<3) create a view on RecordDateTable with only required columns, as you mention, there are 50 fields in that table, however, all your query is only required 2 columns, customerCode and referenceNum. >>
Do you mean I should preform the update on that view? E.g.
UPDATE vRecordDataTable
SET ReferenceNumber = ResultTable.ReferenceNumber
...etc?
<<4) Cosider to update the ReferenceNum in RecordDataTable directly when you update the ResultTable at the same time, but doing this, you don't have to update with {inner join, nested loops, hash match, table scan...etc} between these 2 tables against 6 millions records.>>
I will have a tried on this.
I will run my query tonight and study the execution plan, and see anyway I could improve the query
Thanks very much your help.
I will let you know the result later
Many thanks
Chi -
Hi Rmiao
THanks for your information. I understood now. I tried to Ken's suggestion to put Cluster index on CustomerCode and ReferenceNumber, Unfortunately, I still receive the same result. I will try to update the all the record at the same time instead of update the record by joining tables together. I will let you know my result a day after :)
Thanks very much for your help.
Chi -
Hi Ken
Thanks very much for your help. Is there any way I could export the Execution plan? so that I could Export the image properly as it is quite long, it probably will be in 2 files. I will try again with different method today. if it still did not work, then i will upload that Execution plan.
here is some statistic of the execution plan
4 % on Parallelism (Repartition Streams)
Hash Match Aggregate 20%
Hash Match InnerJoin 13 %
It spend 46% of Cost on the Clustered Index Scan on the CustomerCode and ReferenceNumber
1% on table update
Does it mean any sense?
Do you think the problem is the size of the table?
I will try to update the record at the same time as I update the Result table and see if it works better.
I will let you know my result hopefully tomorrow.
Thanks very much for your help : )
Chi -
Click the "Execution Plan" tab, right click it on the white area, you will see a "save Execution plan as..." selection, select it and save as .sqlplan file.
Well, one solutino to optimize the performance, is horizontal partitioning the data, i.e. you may partition it based on "year".
Do you remember what I said, you may adding a flag column to indicate which data are already modified the ReferenceNumber and which data are not. You may also consider to moving new data (means with ReferenceNumber = 0) to another table, then update thought this table (by adding record to original table and delete from this "new record" table, and saerching the result for other propose by using union all).
There are many ways to do optimization. We have to try to see the best way for difference case.
-
Hi Rmiao
What is did is click on the "Display Estimate Execution Plan" icon at the top. Then I run my query, after that I saw my Execution Plan, then I right click on it the page and try to save the Execution Plan. Unfortunately this option is greyed out. I tried again with your suggestion, Go to File -> Save Execution Plan As, but the option is grey out as well.
Thanks very much for your suggestion.
Chi -
Any update? Have you changed the way of your program/DB logic yet?
Are you still trying to capture the execution plan?
One more nice note, Cause you are using SQL 2000. you may consider to purchase and upgrade to SQL2005 or even SQL2008. The performance will also improve a bit.
-
Hi Ken
Sorry to reply late, I have changed my program's logic, it works very well, It only takes an hour.
I would love to upgrade the database to SQL 2005 / 2008
I am not sure how long it would takes, E.g. I think some of the data type is different, and more sure if it will cause any problem on UDF, Stored procedure ...etc
Do you think I could just attach the MDB file into the new server and it should work fine?
Thanks very much for your help.
Chi -
Hi Ken
Sorry about forgetting to make the post as answered, I marked this post has been answered now and Ticked it the useful replies as well (several of them). I think I did tick the useful reply before, I am not sure why it does not show up after I click on another page and come back. Could you see if there any reply has marked useful in this topic? or it is a bug on this forum.
I have made another post on my query about updating SQL Server from 2000 to 2005 or 2008 on the links below.
http://forums.microsoft.com/hongkong/ShowPost.aspx?PostID=3302821&SiteID=82&mode=1
Thanks very much for your help.
Chi -
Instead of optimizing anything, I would like to review the fundamental problem first: that's not a good way to update a table field according to the data from another table. It is because that simply violates the principle of database normalization.
You are experiencing bad performance, primarily because you perform a join of 2 tables, creating huge table space in memory. Then you reference this big table and pick out the value you need to update the target table. You are in virtual working on a join query of 2 millions record tables in memory. If that can be completed within 5.5 hours, your server is already not a cheap one
If I am handling this problem, I will simply create a view (or stored procedure, up to you), just selecting the fields I needed from RecordDataTable, and join ResultTable on the fly. In this case, you can skip updating the RecordDataTable (writing to log file). If you really want a persistence table, select the view into another temporary or permenant table (or you can even enable the bulk logged mode to further enhance performance, but that will decrease recoverability).
Please try if that can meet your requirements, and I am interested on any performance enhancement
Cheers.
-
hi Lawrence
Thanks for your information. I created a view by selecting the record that will be update E.G. select ResultTable where ReferenceNum<>0
anf Join the table by the query below.
UPDATE RecordDataTable
SET RecordDataTable.ReferenceNum = t.ReferenceNum
FROM vResultTable t
INNER JOIN RecordDataTablec
ON t.CustomerCode = c.CustomerCode
WHERE c.ReferenceNum=0
Which takes a bit more time to finished the update than before.
2) Do you mean I should create a view by SELECT RecordDataTable WHERE ReferenceNum=0 AND JOIN ResultTable for the update?
I will try to test out the second approach and see if it works better.
Do you think it still not a good way to do as by joining 2 tables together, consume too much processing power and memory?
Many thanks -
Hi Chi,
I think I have misled you. If you create a view and use update query, the result will be more or less the same.
What I meant was, instead of updating a table using a value from another table, row by row, with joining 2 huge tables together, you can just use 1 query to do a cross table query, and forgetting updating the field you want to update.
Since the value is already in another table, what you need to do is to perform a simple join query to produce a result set. Afterward, you can use SELECT INTO query to put the result set into a new table for later use. Since SELECT INTO is a matter of bulk insert, speed will be much faster than doing row by row update.
For example:
CREATE VIEW vw_PopulateReferenceNum
AS
SELECT c.field1, c.field2, c.field3, c.field4, c.customercode, t.ReferenceNum
FROM RecordDataTable c JOIN ResultTable t ON c.ReferenceNum = t.ReferenceNum
WHERE c.ReferenceNum = 0 AND t.ReferenceNum <> 0
GO
Then, you can execute the view and put it into a table, or temp table for later use:
SELECT * INTO tbl_NewResult FROM vw_PopulateReferenceNum
In this case, if the ResultTable "t" is not very big, you will probably get a 6 million lines result set from vw_PopulateReferenceNum, then you bulk insert 6 million records into SQL server, the time should be significantly shorter.
You can even create a "phantom" table, which is having the same schema as RecordDataTable as below:
1. RecordDataTable -> duplicate schema -> "phantom" table of RecordDataTable
2. TRUNCATE tbl_phantom
3. SELECT * INTO tbl_phantom FROM vw_PopulateReferenceNum
4. DELETE RecordDataTable
5. rename tbl_phantom to RecordDataTable
Wish this can help.
Cheers.