3

I have two different tables from two different databases. I have huge amount of data in both the tables but i do have same columns in both the tables.

When i am trying with the below code i am facing some performance issue (although it has only 2 records in employee table but in department table i have 100 000 records) comparison is taking more than 10 minutes.

Is there any way i can reduce the performance and make it faster.

EmplTbl = cur.execute("select A , B , C from EmployeeTable where EmplName in ('A','B')") 
for line in EmplTbl:
    EmplData.append(line)

DeptTbl = cur.execute("select A , B , C from DeptTable") 

for line in DeptTbl:
    DeptData.append(line)

for Empl in EmplData:
    DeptResult = all(Empl in DeptData for elm in DeptData)
    if DeptResult:
        print("Yes")
    else:
        print("No")
Arya
  • 528
  • 6
  • 26
  • 2
    two questions: first, what do you actually try to achieve ? Second, Is the database SQL ? If so, why not using a join ? – Derlin Aug 13 '19 at 13:41
  • provide a snippet of your dataframes & the expected output that is easier to work with than text explanation – moys Aug 13 '19 at 13:42
  • Just curious, is `DepData` a dataframe? Or just a normal list? If its a dataframe (which i suspect) it is not recommended to iterate through the df like that. – Axois Aug 13 '19 at 13:45
  • 1
    @Derlin .. Thanks a lot for the reply. Database is oracle and i am trying to get the exact match records between both the tables with all the 3 columns. Both are from different databases thats why i cant use join. – Arya Aug 13 '19 at 13:47
  • As already suggested, if you are looking for performance, I would try to achieve the result on the database side, with left join, right join, outter join, etc. – Valentin B. Aug 13 '19 at 13:47
  • @ValentinB. : Maybe SQLs `MINUS` is the way to go here. – Matthias Aug 13 '19 at 13:53
  • @Axois. I am capturing into a list – Arya Aug 13 '19 at 13:53
  • @ValentinB.. Yes but both the tables are from different databases. Thats why i am trying to capture into a list and trying to compare. – Arya Aug 13 '19 at 13:54
  • Ah this is indeed a problem. Can you sql dump both tables and import them in a third database ? – Valentin B. Aug 13 '19 at 13:57
  • @ValentinB. Probably no. Is there any way i can make much faster while comparing two different Lists having same columns – Arya Aug 13 '19 at 14:02
  • if there are only a few rows (e.g. less than a few million) in one table just dump that one, pull in into a temp table in the other database and do it with SQL. SQL is much easier than Python for ad-hoc queries like this where you basically have everything in a RDBMS – Sam Mason Aug 13 '19 at 14:08
  • @SamMason. Thanks a lot for the reply. I may have million rows and there is no limitation in the data. Is there any other way without creating temp tables please suggest – Arya Aug 13 '19 at 14:16
  • you are only pulling out two names in your example! this seems like a small number, why not just use that as a constraint on what you copy across or at least just put the same constraint on the other query – Sam Mason Aug 13 '19 at 14:20
  • @SamMason. Yes sam i have just given an example to showcase that with 2 records as well i am facing this issue ( I just tested whether is this performance issue happening with few records also or no ). – Arya Aug 13 '19 at 14:28
  • just noticed that you're doing `all(Empl in DeptData for elm in DeptData)`. why isn't this just `Empl in DeptData`? – Sam Mason Aug 13 '19 at 14:33
  • @SamMason . i am trying to compare all the 3 columns exact match and return rows from empl table – Arya Aug 13 '19 at 18:19

4 Answers4

3

If the tables are almost identical it would be faster to compare hashes of chunks of data, and then only compare all the data for the chunks with a difference.

I bet most of the run time was spent transferring and transforming data. Reading 100,000 rows from an Employee table probably only takes a few seconds in the database. Using the function DBMS_SQLHASH.GETHASH, Oracle can quickly generate hashes for a large set of data. (You will probably need to have a DBA run grant execute on sys.dbms_sqlhash to your_user;)

For example, imagine these two tables (in reality they are much larger, and on separate databases):

create table EmployeeTable1 as
select 1 a, 2 b, 3 c, 'abcdefg' EmplName from dual union all
select 1 a, 2 b, 3 c, 'bcdefg'  EmplName from dual union all
select 1 a, 2 b, 3 c, 'cdefg'   EmplName from dual;

create table EmployeeTable2 as
select 1 a, 2 b, 3 c, 'abcdefg' EmplName from dual union all
select 1 a, 2 b, 3 c, 'bcdefg'  EmplName from dual union all
select 9 a, 9 b, 9 c, 'cdefg'   EmplName from dual;

Generate a hash for each first-letter of the employee names.

--Table 1 hashes:
select 'a', dbms_sqlhash.gethash('select a,b,c,EmplName from EmployeeTable1 where EmplName like ''a%'' order by 1,2,3', 3) from dual union all
select 'b', dbms_sqlhash.gethash('select a,b,c,EmplName from EmployeeTable1 where EmplName like ''b%'' order by 1,2,3', 3) from dual union all
select 'c', dbms_sqlhash.gethash('select a,b,c,EmplName from EmployeeTable1 where EmplName like ''c%'' order by 1,2,3', 3) from dual;

a   923920839BFE25A44303718523CBFE1CEBB11053
b   355CB0FFAEBB60ECE2E81F3C9502F2F58A23F8BC
c   F2D94D7CC0C82329E576CD867CDC52D933C37C2C <-- DIFFERENT


--Table 2 hashes:
select 'a', dbms_sqlhash.gethash('select a,b,c,EmplName from EmployeeTable2 where EmplName like ''a%'' order by 1,2,3', 3) from dual union all
select 'b', dbms_sqlhash.gethash('select a,b,c,EmplName from EmployeeTable2 where EmplName like ''b%'' order by 1,2,3', 3) from dual union all
select 'c', dbms_sqlhash.gethash('select a,b,c,EmplName from EmployeeTable2 where EmplName like ''c%'' order by 1,2,3', 3) from dual;

a   923920839BFE25A44303718523CBFE1CEBB11053
b   355CB0FFAEBB60ECE2E81F3C9502F2F58A23F8BC
c   6B7B1D374568B353E9A37EB35B4508B6AE665F8A <-- DIFFERENT

The Python program only has to compare the hashes, and can quickly discover that "a" and "b" are identical, and the difference is in employees that start with "c". Then the program only has to compare all the details for a smaller set of results.

Unfortunately, this solution requires more coding, as you have to build loops in Python, and construct multiple SQL statements. And the solution will be slower if your tables are wildly different, and you'll need to play around with the data to find the right chunk size.

Jon Heller
  • 34,999
  • 6
  • 74
  • 132
  • 1
    In your opinion, would it not be sensible to add ORDER BY clauses to the SELECT ... FROM EmployeeTable1 statements? (ie to the "sqltext" parameter of dbms_sqlhash.gethash())? – stefan Aug 14 '19 at 07:10
  • @stefan You're right, thanks for the reminder. An `ORDER BY` is necessary or the results won't be deterministic. I've been burned by that before, I should have remembered. – Jon Heller Aug 14 '19 at 12:17
  • @JonHeller. Thanks a lot for the reply.. Do we have any possibility to create hash value in python iteself without having dependency on DB. ? – Arya Aug 15 '19 at 18:47
  • @Arya Creating the hash in Python wouldn't help because then you'd still have to transfer all the data to the program. The point of the hashes is to significantly reduce network traffic and overhead, so they need to be generated at the source. – Jon Heller Aug 15 '19 at 20:06
1

your code seems to be doing exponentially more work than you expect. your line:

DeptResult = all(Empl in DeptData for elm in DeptData)

is implicitly doing:

DeptResult = True
for elem in DeptData:
  for tmp in DeptData:
     DeptResult = DeptResult and Empl == tmp

i.e. you're making two nested passes over DeptData when you only need one, so it'll take len(DeptData) ** 2 operations. which would mean you're trying to do 1e10 comparisons and it would indeed take a very long time to complete

I'd rewrite this as:

cur.execute("select A , B , C from DeptTable")
dept_entries = set(cur)

cur.execute("select A , B , C from EmployeeTable where EmplName in ('A','B')") 
for empl in cur:
  if empl in dept_entries:
    print(empl, 'Yes')
  else:
    print(empl, 'No')

note that Python database connectors don't normally return results from their execute method, and you should call one of their fetch* methods or iterate on the cursor. I don't use Oracle, but other posts suggest they should follow standards and your code is broken

putting the DeptTable into a set means that lookup is now O(1) and hence the empl in dept_entries is very cheap

note: it might be worth going over some tutorials on how tuple equality works in Python, as well as data structures like set and maybe even just basic iteration

Sam Mason
  • 15,216
  • 1
  • 41
  • 60
  • .Hi Sam, Thanks a lot for the reply. I tried with this solution but still it was taking more time. Although in "Empl" i have 2 records and they have match with department table it was taking lot of time to give output. – Arya Aug 15 '19 at 18:50
-1

Consider a pure SQL solution to avoid the multiple Python loops. Specifically, run aLEFT JOIN and test NULL matches in all three columns:

select e.A as emp_A, e.B as emp_B, e.C as emp_C, 
       d.A as dept_A, d.B as dept_B, d.C as dept_C,
       case
           when not (d.A is null or d.B is null or d.C is null) then 'Yes'
           else 'No'
       end as DeptResult
from EmployeeTable e
left join DeptTable d 
    on e.A = d.A and e.B = d.B and e.C = d.C
where e.EmplName in ('A', 'B')
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • I am looking for a solution to achieve in python not in SQL. – Arya Aug 13 '19 at 17:55
  • Curious why? Once again, this solution is **one** SQL call avoiding *ALL* loops. Otherwise, consider a vectorized `pandas` solution (doing same thing of merge and if/then logic). Your above comments appeared to be open to SQL. I can delete this. But by the way, I am curious -does this query work? – Parfait Aug 13 '19 at 18:09
-1

pure sql.

select * from DeptResult
minus 
select * from EmployeeTable

union 

select * from EmployeeTable
minus 
select * from DeptResult
Needle file
  • 508
  • 3
  • 7