I wrote a python script to dedupe a csv and I think it's 90% working. Could really use some help troubleshooting one issue -
the code supposed find duplicates comparing firstname, lastname, , email. duplicates should written dupes.csv file, , uniques should written deduplicated.csv, not happening..
example:
if row shows in orginal.csv 10 times, code writes a1 deduplicated.csv, , writes a2 - a10 dupes.csv.
this incorrect. a1-a10 should written dupes.csv file, leaving unique rows in deduplicated.csv.
another strange behavior a2-a10 getting written dupes.csv twice!
i appreciate , feedback first professional python script , i'm feeling pretty disheartened.
here code:
import csv def read_csv(filename): the_file = open(filename, 'r', encoding='latin1') the_reader = csv.reader(the_file, dialect='excel') table = [] #as long table row has values add table row in the_reader: if len(row) > 0: table.append(tuple(row)) the_file.close() return table def create_file(table, filename): join_file = open(filename, 'w+', encoding='latin1') row in table: line = "" #build new row - don't comma on last item add last item separate in range(len(row)-1): line += row[i] + "," line += row[-1] #adds string new file join_file.write(line+'\n') join_file.close() def main(): original = read_csv('contact.csv') print('finished read') #hold duplicate values dupes = [] #holds of values without duplicates dedup = set() #pairs know if have seen match before pairs = set() row in original: #if row in dupes: #dupes.append(row) if (row[4],row[5],row[19]) in pairs: dupes.append(row) else: pairs.add((row[4],row[5],row[19])) dedup.add(row) print('finished first parse') #go through , add in 1 more of each duplicate seen = set() row in dupes: if row in seen: continue else: dupes.append(row) seen.add(row) print ('writing files') create_file(dupes, 'duplicate_leads.csv') create_file(dedup, 'deduplicated_leads.csv') if __name__ == '__main__': main()
you should pandas module this, extremely fast, , easier rolling own.
import pandas pd x = pd.read_csv('contact.csv') duplicates = x.duplicated(['row4', 'row5', 'row19'], keep = false) #use names of columns want check x[duplicates].to_csv('duplicates.csv') #write duplicates x[~duplicates].to_csv('uniques.csv') #write uniques
Comments
Post a Comment