Am I taking the proper approach to dealing with these files? (CSV with PHP)

Question

I am a student working on a placement for the summer. I have been given the task of dealing with data entry from excel to a SQL Server database for surveys that were carried out over a number of years. The task is outlined below:

There are three tables, a main event, an individual event and an individual. An event has many individual events, an individual event has many individuals. My code regards just the last two tables.

I read two files, a list of all individual events in one file, and a list of all individuals in the other. The individual's data tells me what individual event it is associated with.

My code basically reads an individual event, then looks through the second file for any associated individuals. For each line in the individuals file, if it is associated, it is inserted to the proper table, else it is written to a new file. Once the whole file is traversed, the new file is copied to the old file, thus removing data already entered to the database.

This copying across has knocked a good 3 minutes of execution time off simply re-reading the full individuals file again and again. But is there a better approach to this? My execution time for my sample data is ~47 seconds...ideally I'd like that lower.

Any advice, regardless how trivial would be appreciated.

EDIT: This is a cut down version of the code I am using

<?php
//not shown:
//connect to database 
//input event data
//get the id of the event
//open files
$s_handle = fopen($_FILES['surveyfile']['tmp_name'],'r');//open survey file
copy($_FILES['cocklefile']['tmp_name'],'file1.csv');//make copy of the cockle file
//read files
$s_csv = fgetcsv($s_handle,'0',',');

//read lines and print lines
// then input data via sql

while (! feof($s_handle))
{
    $max_index = count($s_csv);
    $s_csv[$max_index]='';
    foreach($s_csv as $val)
    {
        if(!isset($val))
        $val = '';
    }
    $grid_no = $s_csv[0];
    $sub_loc = $s_csv[1];
    /*
    .define more variables
    .*/
    

    $sql = "INSERT INTO indipendant_event" 
        ."(parent_id,grid_number,sub_location,....)"
        ."VALUES ("
        ."'{$event_id}',"
        ."'{$grid_no}',"
        //...
        .");";

    if (!odbc_exec($con,$sql))
    {
        echo "WARNING: SQL INSERT INTO fssbur.cockle_quadrat FAILED. PHP.";
    }
    //get ID
    $sql = "SELECT MAX(ind_event_id)"
    ."FROM independant_event";
    $return =  odbc_exec($con,$sql);
    $ind_event_id = odbc_result($return, 1);
    
    //insert individuals
    $c_2 = fopen('file2.csv','w');//create file c_2 to write to 
    $c_1 = fopen('file1.csv','r');//open the data to read
    $c_csv = fgetcsv($c_1,'0',',');//get the first line of data
    while(! feof($c_1))
    {
        
        for($i=0;$i<9;$i++)//make sure theres a value in each column
        {
            if(!isset($c_csv[$i]))
            $c_csv[$i] = '';
        }
        //give values meaningful names
        $stat_no = $c_csv[0];
        $sample_method = $c_csv[1];
        //....
        
        //check whether the current line corresponds to the current station
        if (strcmp(strtolower($stat_no),strtolower($grid_no))==0)
        {
            $sql = "INSERT INTO fssbur2.cockle"
                ."(parent_id,sampling_method,shell_height,shell_width,age,weight,alive,discarded,damage)"
                ."VALUES("
                ."'{$ind_event_id}',"
                ."'{$sample_method}',"
                //...
                ."'{$damage}');";
            //write data if it corresponds
            if (!odbc_exec($con,$sql))
            {
                echo "WARNING: SQL INSERT INTO fssbur.cockle FAILED. PHP.";
            }     
            $c_csv = fgetcsv($c_1,'0',',');  
        }
        else//no correspondance
        {
            fputcsv($c_2,$c_csv);//write line to the new file
            $c_csv = fgetcsv($c_1,'0',',');//get new line
            continue;//rinse and repeat
        }
    }//end while, now gone through all individuals, and filled c_2 with the unused data
    fclose($c_1);//close files
    fclose($c_2);
    copy('file2.csv','file1.csv');//copy new file to old, removing used data
    $s_csv = fgetcsv($s_handle,'0',',');
}//end while

//close file
fclose($s_handle);
?>

Please show some code. Have you tried anything to improve the process? Have you used a profiler to measure the execution time? — Gordon, Jul 04 '11 at 09:32
@Gordon I avoided posting code as it is quite lengthy. I simply used microtime() to see how long it took. I will post some code asap — Aido, Jul 04 '11 at 09:37

score 3 · Answer 1 · answered Jul 04 '11 at 09:40

I may not have fully understood the process but why not just insert the entire CSV into your database table. This might seem like wasted work but it will likely pay off. Once you have done your initial import, finding any individual associated with an event should be much faster as the DBMS will be able to use an index to speed up these lookups (compared to your file based linear traversal). To be precise: your "individual" table will presumably have a foreign key into your "individual_event" table. As long as you create an index on this foreign key, lookups will be significantly faster (it's possible that simply declaring this field to be a foreign key will cause SQL server to automatically index it but I can't say for sure, I don't really use MSSQL).

As an aside, how many records are we talking about? If we are dealing with 1000s of records, it's definitely reasonable to expect this type of thing to run in a couple of seconds.

I had not thought of that approach. Although, I am using auto generated primary keys as some events have the same name. Do you know if it is relatively easy to assign the proper keys after they are in the database? [my php is stronger than my SQL] The test data is about 300 lines in the first file then 3000 in the second. so effectively 300*3000 comparisons — Aido, Jul 04 '11 at 09:50

score 2 · Accepted Answer · answered Jul 04 '11 at 09:32

2

You can create a temporary database with the data from the files and then use the temporary database/tables to bring the data into the new form. This probably works faster especially if you need to do lookups and you need to flag entries as processed.

answered Jul 04 '11 at 09:32

hakre

193,403
52
435
836

Just what I was thinking load into temp table and use SQL select/join to match the individuals to the events. If there are a lot of entries in the files (several thousand) consider using the "MERGE" statement. – James Anderson Jul 04 '11 at 09:42
I have not implemented this yet but it has pointed me in the right direction. Thanks! – Aido Jul 04 '11 at 10:59

Am I taking the proper approach to dealing with these files? (CSV with PHP)

2 Answers2