How to: Compare Two CSV Files in Python

In data analysis, comparing CSV files is a typical operation, but without the proper tools and methods, it can be challenging.

We’ll look at some useful approaches to comparing two CSV files in Python in this blog post.

Three approaches will be covered. Moreover, we’ll share potential problems and their fixes that can emerge during the comparison process.

This post will provide you with the knowledge and abilities to efficiently compare CSV files and find discrepancies, regardless of your level of Python programming experience.

Advertising links are marked with *. We receive a small commission on sales, nothing changes for you.

The Quick Answer: Use Python to compare two CSV files and display the differences.

How to: Compare Two CSV Files in Python

To compare two CSV files in Python, the files must first be read and parsed before being compared row by row and differences between the files are found.

The most pythonic technique is iterating through the rows of both files at once using the built-in zip() and enumerate() functions. Another approach involves using the csv-diff external module, which offers more sophisticated comparison options like disregarding particular columns or rows.

Moreover, Pandas DataFrames can read and compare CSV files, giving us more possibilities for data analysis and modification.

In the chapters that follow, we’ll delve more deeply into each of these approaches and look at workable options for comparing CSV files in Python.

3 Common Solutions to Compare Two CSV Files in Python

There are a number of strategies and methods that may be applied when comparing CSV files in Python to find differences between the files.

We’ll look at three useful strategies and methods for comparing CSV files in Python in this chapter.

Method 1: Compare Two CSV Files Using the Most Pythonic Solution

This technique entails iterating through the rows of the two files at the same time using the built-in zip() and enumerate() functions.

The procedures are as follows:

  • The csv module should be used to open the two CSV files and store the rows in two different lists.
  • Using zip, iterate through the rows of both files at once ().
  • Compare the values of each cell in each row, then keep track of the differences in a list for later analysis.
  • Analyze the differences list to find the disparities between the files after comparing all the rows.
  • For smaller CSV files, this approach is simple and effective. However, it might not be the greatest choice for larger files because it might cause speed problems.

Example:

import csv

# Open the two CSV files and store their rows in two separate lists
with open('file1.csv') as f1, open('file2.csv') as f2:
    csv1 = list(csv.reader(f1))
    csv2 = list(csv.reader(f2))

# Iterate over the rows of both files simultaneously using zip()
for i, (row1, row2) in enumerate(zip(csv1, csv2)):
    # For each row, compare the values of each cell and store the differences in a list for further analysis
    diff = [i, []]
    for j, (cell1, cell2) in enumerate(zip(row1, row2)):
        if cell1 != cell2:
            diff[1].append(j)
    # Once all the rows have been compared, analyze the differences list to identify the discrepancies between the files
    if diff[1]:
        print(f'Difference found in row {diff[0]}: cells {diff[1]}')

Method 2: Compare Two CSV Files Using csv-diff – An External Module

An extra module called csv-diff offers more comparison options, including supporting various delimiters and quotation characters as well as disregarding certain columns or rows.

Use it as follows:

  • Using pip or a different package manager, install the csv-diff module.
  • Create two distinct lists of dictionaries by using the csv module to open the two CSV files.
  • Send these lists and any other optional parameters to the csv diff function of the csv-diff library (such as column and row ignore lists).
  • The list of differences that the csv diff function gives can be examined and worked with further.
  • CSV file comparison is made easier thanks to csv-diff, which is especially helpful for bigger files. The installation of an external module is necessary, though.

Example:

import csv_diff

# Open the two CSV files using the csv module and create two separate lists of dictionaries
with open('file1.csv') as f1, open('file2.csv') as f2:
    csv1 = list(csv.DictReader(f1))
    csv2 = list(csv.DictReader(f2))

# Pass these lists to the csv_diff function of csv-diff, along with any optional parameters
diff = csv_diff.diff(csv1, csv2, key='id', delimiter=',', quotechar='"')

# The csv_diff function returns a list of differences, which can be further analyzed and manipulated
if diff:
    print('Differences found:')
    for d in diff:
        print(d)

Method 3: Using Pandas DataFrames

For reading, modifying, and comparing CSV files, Pandas DataFrames provide a robust and adaptable solution.

Using them is as follows:

  • The two CSV files should be read into separate dataframes using the read_csv() function after importing the pandas module.
  • Use the dataframes’ compare() function to compare them.
  • The difference between the two files is emphasized in the dataframe that the compare() method produces.
  • This dataframe is capable of additional analysis and customization.
  • When comparing CSV files, Pandas DataFrames are a fantastic alternative because they provide a wide range of possibilities for data analysis and customization. But, compared to the other approaches, they could be trickier to use.

Example:

import pandas as pd

# Read the two CSV files into separate dataframes
df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file2.csv')

# Use the compare() method of the dataframes to compare them
df_diff = df1.compare(df2)

# Further analysis and manipulation can be performed on this dataframe
if not df_diff.empty:
    # Highlight the differences using color coding
    styled_diff = df_diff.style.set_properties(subset=['column'], **{'background-color': 'yellow'})
    # Print the styled differences
    print(styled_diff)
    # Write the results to a new CSV file
    df_diff.to_csv('diff.csv', index=False)
    # Visualize the differences using a bar chart
    df_diff.abs().sum().plot(kind='bar')

Conclusion

Python makes it simple to compare CSV files, which is a common activity in data analysis. In this article, we looked at three distinct approaches, including leveraging pandas dataframes, external modules, and built-in functions, for comparing CSV files in Python.

Identifying differences, color-coding changes to emphasize them, publishing results to a new CSV file, displaying differences using graphs and charts, and advanced approaches for comparing big CSV files were also covered in our discussion of practical CSV file comparison techniques.

You can compare CSV files in Python successfully and learn a lot about your data by using the methods and solutions covered in this blog post. Python offers robust and adaptable capabilities for working with CSV files and conducting data analysis, whether you’re working with little or huge datasets.

Advertising links are marked with *. We receive a small commission on sales, nothing changes for you.