“Bulk” update in mysql with the use of temporary table
You need to insert or update a large number of rows (at least hundreds of thousands) from an external data source (either from other database or file) into an existing mysql table but you are not allowed to truncate the table first because data source contains less number of columns than the existing table. The external source has a column that will be checked against the table’s primary key to determine if a record is new or not. For new records, using bulk insertion is pretty easy. Updating existing rows is quite a different story as there’s not yet an official way to do “bulk” updating. Prepare/Execute method might help a bit but still the performance is not satisfactory.
Through a project I completed recently, I’ve learned a new way to solve the above problem, and the basic idea is to use bulk insertion to accomplish the goal of bulk updates, with the help of mysql temporary table. Let’s assume the table we are going to insert/update has the following structure:
tbl_A ( userid char(20) not null, email varchar(250) not null, url varchar(250) , flag_a tiny int, primary key(userid) )
The basic steps:
1) Before entering the loop, create a temporary table tbl_A_tmp
create temporary table if not exists tbl_A_tmp ( userid char(20) not null, email varchar(250) not null, url varchar(250) , flag_a tiny int )
Please note this table doesn’t have to have exactly the same data structure as tbl_A, as long as it holds the columns that will be used in step 5 and it’s also a good idea to remove any indexes to speed up insertion (to the temporary table).
2) Use delete or truncate statement to delete any existing rows in tbl_A_tmp
3) Inside the loop, save new records to an array list_new and existing records to an array list_existing, when num_of_elements(list_new) >=BATCHCOUNT, do batch insertion to tbl_A directly, when num_of_elements(list_existing) >=BATCHCOUNT, do batch insertion to tbl_A_tmp, clear list_new or list_existing after each batch insertion
4) When looping is done, Check remaining number of elements in both list_new and list_existing, perform batch insertions for any remaining records to the corresponding tables
4) Add unique or primary key to tbl_A_tmp, without this the next step will be very slow
5) Perform cross-table update like the following
update tbl_A dest, tbl_A_tmp src set dest.email=src.email, dest.url=src.url, dest.flag_a=src.flag_a where dest_A.userid=src.userid
The reason why temporary table is used is because of
a) speed, a temporary table is usually created in the memory and will be converted to use disk storage if the temporary table becomes too large (hence data loss is prevented)
b) convenience, when the db link is disconnected,temporary tables are dropped automatically
How much performance gain do I get by using this method? 45 minutes (bulk insertion for new records + row-by-row updates) vs. less than 2 minutes (bulk insertion for new records + temporary table method for existing records).