Python for Beginners: Drop Rows With Nan Values in a Pandas Dataframe :

Python for Beginners: Drop Rows With Nan Values in a Pandas Dataframe
by:
blow post content copied from  Planet Python
click here to view original post


Handling nan values is a tedious task while data cleaning. In this article, we will discuss different ways to drop rows with nan values from a pandas dataframe using the dropna() method.

The dropna() Method

The dropna() method can be used to drop rows having nan values in a pandas dataframe. It has the following syntax.

DataFrame.dropna(*, axis=0, how=_NoDefault.no_default, thresh=_NoDefault.no_default, subset=None, inplace=False)

Here, 

  • The axis parameter is used to decide if we want to drop rows or columns that have nan values. By default, the axis parameter is set to 0. Due to this, rows with nan values are dropped when the dropna() method is executed on the dataframe.
  • The “how” parameter is used to determine if the row that needs to be dropped should have all the values as NaN or if it can be deleted for having at least one NaN value. By default, the “how” parameter is set to “any”.  Due to this even if a single nan value is present, the row will be deleted from the dataframe.
  • The thresh parameter is used when we want to drop rows if they have at least a specific number of non-NaN values present. For instance, if you want to delete a row if it has less than n non-null values, you can pass the number n to the thresh parameter. 
  • The subset parameter is used when we want to check for NaN values in only specific columns in each row. By default, the subset parameter is set to None. Hence, the dropna() method searches for NaN values in all the columns. If you want it to search for nan values in only a specific column in each row, you can pass the column name to the subset parameter. To check for nan value in two or more columns, you can pass the list of column names to the subset parameter.  
  • The inplace parameter is used to decide if we get a new dataframe after the drop operation or if we want to modify the original dataframe. When inplace is set to False, which is its default value, the original dataframe isn’t changed and the dropna() method returns the modified dataframe after execution. To modify the original dataframe, you can set inplace to True. 

Drop Rows Having NaN Values in Any Column in a Dataframe

To drop rows from a pandas dataframe that have nan values in any of the columns, you can directly invoke the dropna() method on the input dataframe. After execution, it returns a modified dataframe with nan values removed from it. You can observe this in the following example.

import pandas as pd
df=pd.read_csv("grade2.csv")
print("The dataframe is:")
print(df)
df=df.dropna()
print("After dropping NaN values:")
print(df)

Output:

The dataframe is:
    Class  Roll        Name  Marks Grade
0     2.0  27.0       Harsh   55.0     C
1     2.0  23.0       Clara   78.0     B
2     3.0  33.0         NaN    NaN   NaN
3     3.0  34.0         Amy   88.0     A
4     3.0  15.0         NaN   78.0     B
5     3.0  27.0      Aditya   55.0     C
6     NaN   NaN         NaN    NaN   NaN
7     3.0  23.0  Radheshyam   78.0     B
8     3.0  11.0       Bobby   50.0   NaN
9     NaN   NaN         NaN    NaN   NaN
10    3.0  15.0      Lokesh   88.0     A
After dropping NaN values:
    Class  Roll        Name  Marks Grade
0     2.0  27.0       Harsh   55.0     C
1     2.0  23.0       Clara   78.0     B
3     3.0  34.0         Amy   88.0     A
5     3.0  27.0      Aditya   55.0     C
7     3.0  23.0  Radheshyam   78.0     B
10    3.0  15.0      Lokesh   88.0     A

In the above example, the input dataframe contains many rows with NaN values. Once we invoke the dropna() method on the input dataframe, it returns a dataframe that has no null values in it.

Drop Rows Having NaN Values in All the Columns in a Dataframe

By default, the dropna() method drops rows from a dataframe if it has NaN value in at least one column. If you want to drop a dataframe only if it has NaN values in all the columns, you can set the “how” parameter in the dropna() method to “all”. After this, the rows are dropped from the dataframe only when all the columns in any row contain NaN values. 

import pandas as pd
df=pd.read_csv("grade2.csv")
print("The dataframe is:")
print(df)
df=df.dropna(how="all")
print("After dropping NaN values:")
print(df)

Output:

The dataframe is:
    Class  Roll        Name  Marks Grade
0     2.0  27.0       Harsh   55.0     C
1     2.0  23.0       Clara   78.0     B
2     3.0  33.0         NaN    NaN   NaN
3     3.0  34.0         Amy   88.0     A
4     3.0  15.0         NaN   78.0     B
5     3.0  27.0      Aditya   55.0     C
6     NaN   NaN         NaN    NaN   NaN
7     3.0  23.0  Radheshyam   78.0     B
8     3.0  11.0       Bobby   50.0   NaN
9     NaN   NaN         NaN    NaN   NaN
10    3.0  15.0      Lokesh   88.0     A
After dropping NaN values:
    Class  Roll        Name  Marks Grade
0     2.0  27.0       Harsh   55.0     C
1     2.0  23.0       Clara   78.0     B
2     3.0  33.0         NaN    NaN   NaN
3     3.0  34.0         Amy   88.0     A
4     3.0  15.0         NaN   78.0     B
5     3.0  27.0      Aditya   55.0     C
7     3.0  23.0  Radheshyam   78.0     B
8     3.0  11.0       Bobby   50.0   NaN
10    3.0  15.0      Lokesh   88.0     A

In this example, we have set the how parameter to "all" in the dropna() method. Due to this, only those rows are deleted from the input dataframe where all the values are Null. Thus, only two rows having NaN values in all the columns are dropped from the input dataframe instead of the five rows as observed in the previous example.

Drop Rows Having Non-null Values in at Least N Columns

Instead of one or all, you might also want to have control over the number of nan values in each row. For this, you can specify the minimum number of non-null values in each row in the output dataframe using the thresh parameter in the dropna() method. After this, the output dataframe returned by the dropna() method will contain at least N on null values in each row. Here, N is the number passed as an input argument to the thresh parameter. You can observe this in the following example.

import pandas as pd
df=pd.read_csv("grade2.csv")
print("The dataframe is:")
print(df)
df=df.dropna(thresh=4)
print("After dropping NaN values:")
print(df)

Output:

The dataframe is:
    Class  Roll        Name  Marks Grade
0     2.0  27.0       Harsh   55.0     C
1     2.0  23.0       Clara   78.0     B
2     3.0  33.0         NaN    NaN   NaN
3     3.0  34.0         Amy   88.0     A
4     3.0  15.0         NaN   78.0     B
5     3.0  27.0      Aditya   55.0     C
6     NaN   NaN         NaN    NaN   NaN
7     3.0  23.0  Radheshyam   78.0     B
8     3.0  11.0       Bobby   50.0   NaN
9     NaN   NaN         NaN    NaN   NaN
10    3.0  15.0      Lokesh   88.0     A
After dropping NaN values:
    Class  Roll        Name  Marks Grade
0     2.0  27.0       Harsh   55.0     C
1     2.0  23.0       Clara   78.0     B
3     3.0  34.0         Amy   88.0     A
4     3.0  15.0         NaN   78.0     B
5     3.0  27.0      Aditya   55.0     C
7     3.0  23.0  Radheshyam   78.0     B
8     3.0  11.0       Bobby   50.0   NaN
10    3.0  15.0      Lokesh   88.0     A

In this example, we have specified the parameter thresh=4 in the dropna() method. Due to this, only those rows are dropped from the input dataframe that have less than 4 Non-null values. Even if a row has a null value and has more than 4 non-null values, it isn’t dropped from the dataframe.

Drop Rows Having at Least N Null Values in Pandas Dataframe

Instead of keeping at least N non-null values in each row, you might want to drop all the rows from the input dataframe that have more than N null values. For this, we will first find the number of columns in the input dataframe using the columns attribute and the len() function. Next, we will subtract N from the total number of columns in the dataframe. The resultant number will be the least number of non-null values that we want in the output dataframe. Hence, we will pass the number to the thresh parameter in the dropna() method. 

After execution of the dropna() method, we will get the output dataframe after dropping all the rows having at least n null values in each row. You can observe this in the following example.

import pandas as pd
df=pd.read_csv("grade2.csv")
print("The dataframe is:")
print(df)
N=3
number_of_columns=len(df.columns)
df=df.dropna(thresh=number_of_columns-N+1)
print("After dropping NaN values:")
print(df)

Output:

The dataframe is:
    Class  Roll        Name  Marks Grade
0     2.0  27.0       Harsh   55.0     C
1     2.0  23.0       Clara   78.0     B
2     3.0  33.0         NaN    NaN   NaN
3     3.0  34.0         Amy   88.0     A
4     3.0  15.0         NaN   78.0     B
5     3.0  27.0      Aditya   55.0     C
6     NaN   NaN         NaN    NaN   NaN
7     3.0  23.0  Radheshyam   78.0     B
8     3.0  11.0       Bobby   50.0   NaN
9     NaN   NaN         NaN    NaN   NaN
10    3.0  15.0      Lokesh   88.0     A
After dropping NaN values:
    Class  Roll        Name  Marks Grade
0     2.0  27.0       Harsh   55.0     C
1     2.0  23.0       Clara   78.0     B
3     3.0  34.0         Amy   88.0     A
4     3.0  15.0         NaN   78.0     B
5     3.0  27.0      Aditya   55.0     C
7     3.0  23.0  Radheshyam   78.0     B
8     3.0  11.0       Bobby   50.0   NaN
10    3.0  15.0      Lokesh   88.0     A

This example is just a variation of the previous example. If you want to drop rows having more than N null values, you need to preserve rows having the number of columns-N+1 or more non-null values. That’s what we have done in this example.

Drop Rows Having NaN Values in Specific Columns in Pandas

By default, the dropna() method searches for NaN values in all the columns in each row. If you want to drop rows from a dataframe only if it has null values in specific columns, you can use the subset parameter in the dropna() method. 

The subset parameter in the dropna() method takes a list of column names as its input argument. After this, the dropna() method drops rows with null values only in the specified columns. You can observe this in the following example.

import pandas as pd
df=pd.read_csv("grade2.csv")
print("The dataframe is:")
print(df)
df=df.dropna(subset=["Class","Roll","Marks"])
print("After dropping NaN values:")
print(df)

Output:

The dataframe is:
    Class  Roll        Name  Marks Grade
0     2.0  27.0       Harsh   55.0     C
1     2.0  23.0       Clara   78.0     B
2     3.0  33.0         NaN    NaN   NaN
3     3.0  34.0         Amy   88.0     A
4     3.0  15.0         NaN   78.0     B
5     3.0  27.0      Aditya   55.0     C
6     NaN   NaN         NaN    NaN   NaN
7     3.0  23.0  Radheshyam   78.0     B
8     3.0  11.0       Bobby   50.0   NaN
9     NaN   NaN         NaN    NaN   NaN
10    3.0  15.0      Lokesh   88.0     A
After dropping NaN values:
    Class  Roll        Name  Marks Grade
0     2.0  27.0       Harsh   55.0     C
1     2.0  23.0       Clara   78.0     B
3     3.0  34.0         Amy   88.0     A
4     3.0  15.0         NaN   78.0     B
5     3.0  27.0      Aditya   55.0     C
7     3.0  23.0  Radheshyam   78.0     B
8     3.0  11.0       Bobby   50.0   NaN
10    3.0  15.0      Lokesh   88.0     A

In this example, we have passed the list ["Class", "Roll", "Marks"] to the subset parameter in the dropna() method. Due to this the dropna() method searches for NaN values in only these columns of the dataframe. Any row having NaN values in these columns is dropped from the dataframe after execution of the dropna() method. If a row has non-null values in these columns, it won’t be dropped from the dataframe if it has NaN values in other columns.

Suggested Reading: If you are into machine learning, you can read this MLFlow tutorial with code examples. You might also like this article on 15 Free Data Visualization Tools for 2023.

Drop Rows With NaN Values Inplace From a Pandas Dataframe

In all the examples in the previous sections, the dropna() method doesn’t modify the input dataframe. Every time, it returns a new dataframe. To modify the input dataframe by dropping nan values, you can use the inplace parameter in the dropna() method. When the inplace parameter is set to True, the dropna() method modifies the original dataframe instead of creating a new one. You can observe this in the following example.

import pandas as pd
df=pd.read_csv("grade2.csv")
print("The dataframe is:")
print(df)
df.dropna(inplace=True)
print("After dropping NaN values:")
print(df)

Output:

The dataframe is:
    Class  Roll        Name  Marks Grade
0     2.0  27.0       Harsh   55.0     C
1     2.0  23.0       Clara   78.0     B
2     3.0  33.0         NaN    NaN   NaN
3     3.0  34.0         Amy   88.0     A
4     3.0  15.0         NaN   78.0     B
5     3.0  27.0      Aditya   55.0     C
6     NaN   NaN         NaN    NaN   NaN
7     3.0  23.0  Radheshyam   78.0     B
8     3.0  11.0       Bobby   50.0   NaN
9     NaN   NaN         NaN    NaN   NaN
10    3.0  15.0      Lokesh   88.0     A
After dropping NaN values:
    Class  Roll        Name  Marks Grade
0     2.0  27.0       Harsh   55.0     C
1     2.0  23.0       Clara   78.0     B
3     3.0  34.0         Amy   88.0     A
5     3.0  27.0      Aditya   55.0     C
7     3.0  23.0  Radheshyam   78.0     B
10    3.0  15.0      Lokesh   88.0     A

In this example, we have set the inplace parameter to True in the dropna() method. Hence, the dropna() method modifies the original dataframe instead of creating a new one.

Conclusion

In this article, we have discussed different ways to drop rows with NaN values from a pandas dataframe using the dropna() method.

To know more about the pandas module, you can read this article on how to sort a pandas dataframe. You might also like this article on how to drop columns from a pandas dataframe.

I hope you enjoyed reading this article. Stay tuned for more informative articles.

Happy Learning!

The post Drop Rows With Nan Values in a Pandas Dataframe appeared first on PythonForBeginners.com.


December 16, 2022 at 07:30PM
Click here for more details...

=============================
The original post is available in Planet Python by
this post has been published as it is through automation. Automation script brings all the top bloggers post under a single umbrella.
The purpose of this blog, Follow the top Salesforce bloggers and collect all blogs in a single place through automation.
============================

Salesforce