Python for Beginners: Compare Pandas DataFrames in Python :

Python for Beginners: Compare Pandas DataFrames in Python
by:
blow post content copied from  Planet Python
click here to view original post


We use dataframes to handle tabular data in python. Sometimes, we might need to compare different dataframes according to values in their columns for each record. In this article, we will discuss how we can compare two dataframes in python.

How to Compare Two DataFrames in Python?

To compare two pandas dataframe in python, you can use the compare() method. However, the compare() method is only available in pandas version 1.1.0 or later. Therefore, if the codes in this tutorial don’t work for you, you should consider checking the version of the pandas module on your machine. For this, you can execute the following code.

import pandas as pd
pd.__version__

Output:

If the pandas’ version in your machine is older than 1.1.0, you can upgrade it using PIP as shown below.

pip3 install pandas --upgrade

For python2, you can use pip instead of pip3 in the above command.

The compare() Method

The compare() method, when invoked on a dataframe object, takes the second dataframe as its first input argument and three optional input arguments. The syntax for the compare() method is as follows.

df1.compare(df2, align_axis=1, keep_shape=False, keep_equal=False)

Here, 

  • df1 is the first dataframe.
  • The parameter df2 denotes the second dataframe to which df1 is to be compared.
  • The parameter align_axis is used to decide whether we need to compare rows or columns. By default, it has the value 1, which means that the output is shown by comparing the columns. If the value 0 is assigned to the align_axis parameter, the comparison results are shown by comparing rows.
  • The parameter keep_shape is used to decide if we want to display all the columns of the data frames or only the columns with different values for each row in the input dataframes. It has the default value of False, which means that only the columns with different values for each row will be shown in the resultant dataframe. If you want to display all the columns of the dataframe, you can pass the value True as an input argument to the keep_shape parameter.
  • If the values in a column of the rows that are being compared are equal, NaN is assigned as the resultant value of the column in the comparison data frame. To keep the original values instead of the NaN values, we use the keep_equal parameter. The keep_equal parameter has the default value False, which means that the columns that have equal values will be assigned the value NaN in the resultant dataframe. To keep the original values for the columns that have equal values, you can assign the value True to the keep_equal parameter.

Compare Pandas DataFrames Column-wise

To compare the dataframes so that the output values are organized horizontally, you can simply invoke the compare() method on the first dataframe and pass the second dataframe as the input argument as shown in the following example.

import pandas as pd
myDicts1=[{"Roll":1,"Maths":100, "Physics":87, "Chemistry": 82},
        {"Roll":2,"Maths":75, "Physics":100, "Chemistry": 90},
        {"Roll":3,"Maths":87, "Physics":84, "Chemistry": 76},
        {"Roll":4,"Maths":100, "Physics":100, "Chemistry": 90},
        {"Roll":5,"Maths":90, "Physics":87, "Chemistry": 84},
        {"Roll":6,"Maths":79, "Physics":75, "Chemistry": 72}]
df1=pd.DataFrame(myDicts1)
print("The first dataframe is:")
print(df1)
myDicts2=[{"Roll":1,"Maths":95, "Physics":92, "Chemistry": 75},
        {"Roll":2,"Maths":73, "Physics":98, "Chemistry": 90},
        {"Roll":3,"Maths":88, "Physics":85, "Chemistry": 76},
        {"Roll":4,"Maths":100, "Physics":99, "Chemistry": 90},
        {"Roll":5,"Maths":90, "Physics":70, "Chemistry": 96},
        {"Roll":6,"Maths":89, "Physics":75, "Chemistry": 72}]
df2=pd.DataFrame(myDicts2)
print("The second dataframe is:")
print(df2)
output_df=df1.compare(df2)
print("The output dataframe is:")
print(output_df)

Output:

The first dataframe is:
   Roll  Maths  Physics  Chemistry
0     1    100       87         82
1     2     75      100         90
2     3     87       84         76
3     4    100      100         90
4     5     90       87         84
5     6     79       75         72
The second dataframe is:
   Roll  Maths  Physics  Chemistry
0     1     95       92         75
1     2     73       98         90
2     3     88       85         76
3     4    100       99         90
4     5     90       70         96
5     6     89       75         72
The output dataframe is:
   Maths       Physics       Chemistry      
    self other    self other      self other
0  100.0  95.0    87.0  92.0      82.0  75.0
1   75.0  73.0   100.0  98.0       NaN   NaN
2   87.0  88.0    84.0  85.0       NaN   NaN
3    NaN   NaN   100.0  99.0       NaN   NaN
4    NaN   NaN    87.0  70.0      84.0  96.0
5   79.0  89.0     NaN   NaN       NaN   NaN

In the above output, the Roll column has the same value in each row. Hence, this column is dropped from the output. To display all the columns in the resultant dataframe, you can assign the value True to the keep_shape parameter as follows.

import pandas as pd
myDicts1=[{"Roll":1,"Maths":100, "Physics":87, "Chemistry": 82},
        {"Roll":2,"Maths":75, "Physics":100, "Chemistry": 90},
        {"Roll":3,"Maths":87, "Physics":84, "Chemistry": 76},
        {"Roll":4,"Maths":100, "Physics":100, "Chemistry": 90},
        {"Roll":5,"Maths":90, "Physics":87, "Chemistry": 84},
        {"Roll":6,"Maths":79, "Physics":75, "Chemistry": 72}]
df1=pd.DataFrame(myDicts1)
print("The first dataframe is:")
print(df1)
myDicts2=[{"Roll":1,"Maths":95, "Physics":92, "Chemistry": 75},
        {"Roll":2,"Maths":73, "Physics":98, "Chemistry": 90},
        {"Roll":3,"Maths":88, "Physics":85, "Chemistry": 76},
        {"Roll":4,"Maths":100, "Physics":99, "Chemistry": 90},
        {"Roll":5,"Maths":90, "Physics":70, "Chemistry": 96},
        {"Roll":6,"Maths":89, "Physics":75, "Chemistry": 72}]
df2=pd.DataFrame(myDicts2)
print("The second dataframe is:")
print(df2)
output_df=df1.compare(df2,keep_shape=True)
print("The output dataframe is:")
print(output_df)

Output:

The first dataframe is:
   Roll  Maths  Physics  Chemistry
0     1    100       87         82
1     2     75      100         90
2     3     87       84         76
3     4    100      100         90
4     5     90       87         84
5     6     79       75         72
The second dataframe is:
   Roll  Maths  Physics  Chemistry
0     1     95       92         75
1     2     73       98         90
2     3     88       85         76
3     4    100       99         90
4     5     90       70         96
5     6     89       75         72
The output dataframe is:
  Roll        Maths       Physics       Chemistry      
  self other   self other    self other      self other
0  NaN   NaN  100.0  95.0    87.0  92.0      82.0  75.0
1  NaN   NaN   75.0  73.0   100.0  98.0       NaN   NaN
2  NaN   NaN   87.0  88.0    84.0  85.0       NaN   NaN
3  NaN   NaN    NaN   NaN   100.0  99.0       NaN   NaN
4  NaN   NaN    NaN   NaN    87.0  70.0      84.0  96.0
5  NaN   NaN   79.0  89.0     NaN   NaN       NaN   NaN

To keep the original values for the columns that have equal values instead of NaN, you can assign the value True to the keep_equal parameter as shown below.

import pandas as pd
myDicts1=[{"Roll":1,"Maths":100, "Physics":87, "Chemistry": 82},
        {"Roll":2,"Maths":75, "Physics":100, "Chemistry": 90},
        {"Roll":3,"Maths":87, "Physics":84, "Chemistry": 76},
        {"Roll":4,"Maths":100, "Physics":100, "Chemistry": 90},
        {"Roll":5,"Maths":90, "Physics":87, "Chemistry": 84},
        {"Roll":6,"Maths":79, "Physics":75, "Chemistry": 72}]
df1=pd.DataFrame(myDicts1)
print("The first dataframe is:")
print(df1)
myDicts2=[{"Roll":1,"Maths":95, "Physics":92, "Chemistry": 75},
        {"Roll":2,"Maths":73, "Physics":98, "Chemistry": 90},
        {"Roll":3,"Maths":88, "Physics":85, "Chemistry": 76},
        {"Roll":4,"Maths":100, "Physics":99, "Chemistry": 90},
        {"Roll":5,"Maths":90, "Physics":70, "Chemistry": 96},
        {"Roll":6,"Maths":89, "Physics":75, "Chemistry": 72}]
df2=pd.DataFrame(myDicts2)
print("The second dataframe is:")
print(df2)
output_df=df1.compare(df2,keep_shape=True, keep_equal=True)
print("The output dataframe is:")
print(output_df)

Output:

The first dataframe is:
   Roll  Maths  Physics  Chemistry
0     1    100       87         82
1     2     75      100         90
2     3     87       84         76
3     4    100      100         90
4     5     90       87         84
5     6     79       75         72
The second dataframe is:
   Roll  Maths  Physics  Chemistry
0     1     95       92         75
1     2     73       98         90
2     3     88       85         76
3     4    100       99         90
4     5     90       70         96
5     6     89       75         72
The output dataframe is:
  Roll       Maths       Physics       Chemistry      
  self other  self other    self other      self other
0    1     1   100    95      87    92        82    75
1    2     2    75    73     100    98        90    90
2    3     3    87    88      84    85        76    76
3    4     4   100   100     100    99        90    90
4    5     5    90    90      87    70        84    96
5    6     6    79    89      75    75        72    72

You should remember that the dataframes can be compared only if their schema is the same. In other words, the dataframes that are being compared should have the same number of columns and the columns should be in the same order. Otherwise, the program will run into errors.  

Similarly, if the dataframes have an equal number of columns, but they are not in the same order, the program will run into an exception.

Compare DataFrames Row-wise in Python

To show the output after comparing the dataframes row-wise, you can assign the value 1 to the align_axis parameter as shown below.

import pandas as pd
myDicts1=[{"Roll":1,"Maths":100, "Physics":87, "Chemistry": 82},
        {"Roll":2,"Maths":75, "Physics":100, "Chemistry": 90},
        {"Roll":3,"Maths":87, "Physics":84, "Chemistry": 76},
        {"Roll":4,"Maths":100, "Physics":100, "Chemistry": 90},
        {"Roll":5,"Maths":90, "Physics":87, "Chemistry": 84},
        {"Roll":6,"Maths":79, "Physics":75, "Chemistry": 72}]
df1=pd.DataFrame(myDicts1)
print("The first dataframe is:")
print(df1)
myDicts2=[{"Roll":1,"Maths":95, "Physics":92, "Chemistry": 75},
        {"Roll":2,"Maths":73, "Physics":98, "Chemistry": 90},
        {"Roll":3,"Maths":88, "Physics":85, "Chemistry": 76},
        {"Roll":4,"Maths":100, "Physics":99, "Chemistry": 90},
        {"Roll":5,"Maths":90, "Physics":70, "Chemistry": 96},
        {"Roll":6,"Maths":89, "Physics":75, "Chemistry": 72}]
df2=pd.DataFrame(myDicts2)
print("The second dataframe is:")
print(df2)
output_df=df1.compare(df2,keep_shape=True, keep_equal=True, align_axis=0)
print("The output dataframe is:")
print(output_df)

Output:

The first dataframe is:
   Roll  Maths  Physics  Chemistry
0     1    100       87         82
1     2     75      100         90
2     3     87       84         76
3     4    100      100         90
4     5     90       87         84
5     6     79       75         72
The second dataframe is:
   Roll  Maths  Physics  Chemistry
0     1     95       92         75
1     2     73       98         90
2     3     88       85         76
3     4    100       99         90
4     5     90       70         96
5     6     89       75         72
The output dataframe is:
         Roll  Maths  Physics  Chemistry
0 self      1    100       87         82
  other     1     95       92         75
1 self      2     75      100         90
  other     2     73       98         90
2 self      3     87       84         76
  other     3     88       85         76
3 self      4    100      100         90
  other     4    100       99         90
4 self      5     90       87         84
  other     5     90       70         96
5 self      6     79       75         72
  other     6     89       75         72

Conclusion

In this article, we have discussed how to compare two dataframes in python. To learn more about python programming, you can read this article on dictionary comprehension in python. You might also like this article on list comprehension in python.

I hope you enjoyed reading this article. Stay tuned for more informative articles.

Happy Learning!

The post Compare Pandas DataFrames in Python appeared first on PythonForBeginners.com.


January 06, 2023 at 07:30PM
Click here for more details...

=============================
The original post is available in Planet Python by
this post has been published as it is through automation. Automation script brings all the top bloggers post under a single umbrella.
The purpose of this blog, Follow the top Salesforce bloggers and collect all blogs in a single place through automation.
============================

Salesforce