I wrote a python script to dedupe a csv and I think it's 90% working. Could really use some help troubleshooting one issue -


the code supposed find duplicates comparing firstname, lastname, , email. duplicates should written dupes.csv file, , uniques should written deduplicated.csv, not happening..

example:

if row shows in orginal.csv 10 times, code writes a1 deduplicated.csv, , writes a2 - a10 dupes.csv.

this incorrect. a1-a10 should written dupes.csv file, leaving unique rows in deduplicated.csv.

another strange behavior a2-a10 getting written dupes.csv twice!

i appreciate , feedback first professional python script , i'm feeling pretty disheartened.

here code:

import csv  def read_csv(filename):     the_file = open(filename, 'r', encoding='latin1')     the_reader = csv.reader(the_file, dialect='excel')     table = []     #as long table row has values add table     row in the_reader:         if len(row) > 0:             table.append(tuple(row))     the_file.close()     return table   def create_file(table, filename):     join_file = open(filename, 'w+', encoding='latin1')     row in table:         line = ""         #build new row - don't comma on last item add last item separate         in range(len(row)-1):             line += row[i] + ","         line += row[-1]         #adds string new file         join_file.write(line+'\n')     join_file.close()   def main():     original = read_csv('contact.csv')      print('finished read')     #hold duplicate values     dupes = []     #holds of values without duplicates     dedup = set()     #pairs know if have seen match before     pairs = set()     row in original:         #if row in dupes:             #dupes.append(row)         if (row[4],row[5],row[19]) in pairs:             dupes.append(row)         else:             pairs.add((row[4],row[5],row[19]))             dedup.add(row)      print('finished first parse')     #go through , add in 1 more of each duplicate     seen = set()     row in dupes:         if row in seen:             continue         else:             dupes.append(row)             seen.add(row)      print ('writing files')     create_file(dupes, 'duplicate_leads.csv')     create_file(dedup, 'deduplicated_leads.csv')  if __name__ == '__main__':     main() 

you should pandas module this, extremely fast, , easier rolling own.

import pandas pd  x = pd.read_csv('contact.csv')  duplicates = x.duplicated(['row4', 'row5', 'row19'], keep = false)  #use names of columns want check  x[duplicates].to_csv('duplicates.csv') #write duplicates  x[~duplicates].to_csv('uniques.csv') #write uniques 

Comments

Popular posts from this blog

php - Permission denied. Laravel linux server -

google bigquery - Delta between query execution time and Java query call to finish -

python - Pandas two dataframes multiplication? -