0

I am trying to setup a PANDAS project that I can use to compare and return the differences in excel and csv files over time. Currently I load the excel/csv files into pandas and assign them a version column. I assign them a "Version" column because in my last step, I want the program to create me a file containing only what has changed in the "new" version file so that I do not have to update the entire database, only the data points that have changed.

old = pd.read_excel(landdata20201122.xlsx')
new = pd.read_excel(landdata20210105.xlsx')
old['version'] = "old"
new['version'] = "new"

I merge the sheets into one, I then drop duplicate rows based on all the columns in the original files. I have to subset the data because if the program looks at my added version column, it will not be seen as a duplicate row. Statement is listed below

df2 = df1.drop_duplicates(subset=["UWI", "Current DOI Partners", "Encumbrances", "Lease Expiry Date", "Mineral Leases", "Operator", "Attached Land Rights", "Surface Leases"])
df2.shape

I am wondering if there is a quicker way to subset the data, basically the way I currently have it setup, I have to list each column title. Some of my sheets have 100+ columns, so it is a lot of work when I only want it to negate 1 column. Is there a way that I can populate all the column titles and remove the ones I do not want looked at? Or is there a way to enter the columns I DO NOT want compared in the drop duplicates command instead of entering all the columns except one?

If I can just list the columns I do not want to compare, I will be able to use the same script for much more of the data that I am working with as I will not have to edit the drop_duplicates statement each time I compare sheets.

Any help is appreciated, thank you in advance!

TBergy
  • 3
  • 1

1 Answers1

0

If I've understood well:

  1. Store the headers in a list.
  2. Remove the names you don't want by hand.
  3. Inside the subset of drop_duplicates(), place the list.

In case that the columns you want to remove are more than those you want to keep, add by hand all the wanted columns in the list. With a list, you won't need to write them every time.

How to iterate a list:

list=['first', 'second', 'third']
for i in list:
    print(i)

# Output: 'first', 'second', 'third'
LoukasPap
  • 1,244
  • 1
  • 8
  • 17
  • Thank you for the tip, this gets me closer. Will I not I have to compile and edit the list every time I get new data to compare though? This is still easier than typing them all out, but Is there a way to automatically negate the "Version" column from the list? I am still having to take manual extra steps to the list instead of having it automated when I load the files in to compare – TBergy Jan 06 '21 at 20:09
  • Thanks Again. To build on my above comment, I created a list and used the drop function to drop the column names not required. I no longer need to type any heading out. – TBergy Jan 06 '21 at 20:29
  • Sorry, I just saw the message. So, you solved your problem? – LoukasPap Jan 06 '21 at 21:56