0

I have an Access Database with a table having around 3,500,000 rows. One of its column is a 10-digit 'Phone_No'. Performing a search on such a huge table would take too long. So i decided to split that one table into multiple tables based on first 5 digits of the 'Phone_No'.

For example, if I have a 'Phone_No' value as 1234500000 then I will create a new table named 12345 and insert into it all rows which have 'Phone_No' starting with 12345 i.e. values like 1234500001, 1234500002, etc.

This way if I have to search for a 'Phone_No' I just need to do a 'SELECT * FROM 12345 WHERE Phone_No = 1234500000'. This would narrow down my search to a very small dataset comparatively.

The sample code is as shown below

        int iRow = 0;
        int iRowCount;

        string strIndex;
        string strOutputTableName;

        DataTable dtOutput;

        DataRow dr;

        try
        {
            // Get staging row count
            iRowCount = dtStaging.Rows.Count;
            dsOutput = new DataSet();

            // Iterate each row for splitting
            while (dtStaging.Rows.Count > 0)
            {
                // Increment row count
                iRow++;

                // Get current Datarow
                dr = dtStaging.Rows[0];

                // Get column to be indexed by splitting
                strIndex = dr["Phone_No"].ToString();

                // Set output table name
                strOutputTableName = strIndex.Substring(0, 5);

                try
                {
                    // Create new datatable 
                    dtOutput = new DataTable(strOutputTableName);

                    // Add columns
                    dtOutput.Columns.Add("Phone_No", typeof(string));
                    dtOutput.Columns.Add("Subscriber_Name", typeof(string));
                    dtOutput.Columns.Add("Address_Line1", typeof(string));
                    dtOutput.Columns.Add("Address_Line2", typeof(string));
                    dtOutput.Columns.Add("Address_Line3", typeof(string));
                    dtOutput.Columns.Add("City", typeof(string));
                    dtOutput.Columns.Add("Pin_Code", typeof(string));
                    dtOutput.Columns.Add("SIM_Activation_Date", typeof(DateTime));

                    // Add datatable to dataset
                    dsOutput.Tables.Add(dtOutput);
                }
                // catch table already exists error and proceed
                catch
                {

                }

                // Import current datarow from staging to output table
                dsOutput.Tables[strOutputTableName].ImportRow(dr);

                // Report progress
                bwSplit.ReportProgress((iRow * 100) / iRowCount);

                // Remove current datarow to release some memory
                dtStaging.Rows.RemoveAt(0);
            }
        }
        catch
        {
            throw;
        }
        finally
        {
            dtStaging = null;
        }

Once my Dataset dsOutput is prepared, I insert it into another Access Database file. The problem is this splitting process takes ages. I am looking for an optimized version of the above code. Or any different approach which would make the splitting process faster.

Mufaddal Gulshan
  • 1,223
  • 3
  • 11
  • 14
  • 4
    You shouldn't have to resort to splitting the table; an `INDEX` on the `Phone_No` would solve this for almost all RDBMS. – Kenney Nov 29 '15 at 13:47
  • 1
    Besides what @Kenney mentioned, if you continue this path, you should take a look at ParallelFor – Camilo Terevinto Nov 29 '15 at 13:48
  • Are you keeping a dataset with 3500000 rows in memory just to split it in subtables that at the end are just the sum of the first dataset? – Steve Nov 29 '15 at 13:49
  • @Kenney This is a table of only one service provider. I have 11 such tables. Some tables even having approx. 100,000,000 records. – Mufaddal Gulshan Nov 29 '15 at 13:52
  • @Steve I am processing these records in a batch of 500,000 – Mufaddal Gulshan Nov 29 '15 at 13:53
  • @cFrozenDeath I'll look into the method – Mufaddal Gulshan Nov 29 '15 at 13:54
  • 5
    Not sure if Access is the right tool for this task – Steve Nov 29 '15 at 13:55
  • 1
    @Steve Access is not the right tool at all. 100 million records is a lot, SQL Server/Oracle/MySQL should be used for this – Camilo Terevinto Nov 29 '15 at 13:55
  • 1
    You might end up with 100,000 tables to hold phone numbers, that's crazy! Just use indexing, and a "proper" database instead of Access. What happens when you then need to search on a different column? You will need to run a separate query for every table. – DavidG Nov 29 '15 at 13:57
  • @FrozenDeath I revive the data in access only from the service providers. I don't mind using SQLite. But sql server/oracle/MySQL is out question. – Mufaddal Gulshan Nov 29 '15 at 13:58
  • 1
    @cFrozenDeath I was ironic. However in smaller applications with single user (or limited sharing) Access is better for some jobs. In this scenario a robust server side solution with evolute indexing capabilities is a must – Steve Nov 29 '15 at 13:58
  • @DavidG I wont end up with 100,000,000 tables to hold phone numbers. All numbers with same first five digits will be inserted into one table. – Mufaddal Gulshan Nov 29 '15 at 14:00
  • @MufaddalGulshan Yes but there are a possible 100,000 variations of numbers to make up 5 digits,hence why you might end up with 100,000 tables. – DavidG Nov 29 '15 at 14:01
  • @DavidG I am open to suggestions. Please let me know any other approach. – Mufaddal Gulshan Nov 29 '15 at 14:05
  • 2
    @MufaddalGulshan Several people already did that - use indexes and a database technology that is suited to the job. – DavidG Nov 29 '15 at 14:06
  • *"Performing a search on such a huge table [with 3.5 million rows] would take too long."* - Is that based on actual testing or past experience with similar datasets, or is it just a prior assumption you have made? If it is the latter then test first before investing more effort in this alternative approach. – Gord Thompson Nov 29 '15 at 14:56
  • 1
    Further to my previous comment: I just tested with a 170MB .accdb on a network share. The file contained a table with 3.5 million rows whose primary key was the 10-digit phone number (`Text(10)`). Performing a SELECT to retrieve the row containing a specific phone number (e.g.,`SELECT CompanyName FROM PhoneNumbers WHERE PhoneNumber='4036854444'`) consistently took less than 0.1 seconds. – Gord Thompson Nov 29 '15 at 15:59
  • @GordThompson Thanks for trying it out. I think I'll make separate database files for each service provider and make phone number the primary key. – Mufaddal Gulshan Nov 29 '15 at 16:08

1 Answers1

2

As stated in several comments to the question, indexing is the first thing to look at when performance is an issue.

I just ran a test with a 170MB .accdb on a network share. The file contained a table with 3.5 million rows whose primary key was the 10-digit phone number (Text(10)). Performing a SELECT to retrieve the row containing a specific phone number, e.g.,

SELECT CompanyName FROM PhoneNumbers WHERE PhoneNumber='4036854444' 

consistently took less than 0.1 seconds. To me, that does not seem slow enough to warrant any sort of scheme to segment the data into multiple tables.

For more information on performance considerations for applications using an Access database, see

C# program querying an Access database in a network folder takes longer than querying a local copy

Community
  • 1
  • 1
Gord Thompson
  • 116,920
  • 32
  • 215
  • 418