Recently, one of our clients asked us to improve the speed of one of their service applications that does some data processing for reporting purposes. Basically, this application runs periodically and checks if there is new data waiting to be processed. It checks if new files have been uploaded to the server. If yes, it moves those files to an appropriate location for further processing. Once the files are moved there, the application validates, converts and uploads them one by one to a SQL Server database using bulk import feature. Then it runs a stored procedure to insert the data in some other tables so it can be used for reporting.
Every time this application ran, it took a few minutes to complete the process. The client thought that the application is taking a lot of time and asked us to improve/optimize it. The problem with this application was that it was processing files one-by-one instead of in parallel. Since there was no dependencies between files, this application was a perfect case for distributed/parallel processing. The question was how to restructure the flow so we can process multiple files in parallel by using threads.
Keep in mind that when you decide to use threads, one important thing that you have to consider is how to make the code thread safe to prevent any race conditions. Another thing that we had to consider was how to save the threads so we can reuse them to process multiple files.
Think about the threads as your team members. You assign tasks to your team members and tell them to check with you once they’re done and before they go home. When the member comes to you to ask “I am done, can I go home?”, you check your list of tasks and see if you have anything else that should be finished that day. If yes, then you assign another task to the team member and you tell him/her to check with you again before he/she goes home. Once your list of tasks is taken care, then you tell your team members to go home as they are done. The best way your team members can work is if they do not have to wait on each other for something - they can work independently and finish their assigned tasks.
Let’s get back to the technology. We created a thread-safe collection where we saved the list of files that were identified as new files. Then we created a function that was used to check for pending files in the collection and get the first one. This function was used by threads to check is there another file that I can process. Each thread calls this function once the thread is done processing the file. Once this function returns nothing, then the thread terminates itself.
We restructured the code so each thread can process a file from beginning to the end. Once the file is assigned to the thread, the thread takes the file through different stages (copy, validate, import to database, etc). The intent was to have each thread work independently and finish each assigned file completely.
Once we implemented this methodology, the speed of the application was improved by over 60%. This was a significant improvement with minimal changes in the code. So every time you plan for a new application, spend some time thinking how you can split the process in such a way that you can use threads efficiently.