Coding Ref

How to drop duplicate columns in Pandas

How to drop duplicate columns in Pandas

To drop duplicate columns in a Pandas DataFrame, you can use the following code:

main.py
df = df.loc[:,~df.columns.duplicated()].copy()

Here is how this code works:

  • The df.columns attribute is used to get a list of all the columns in the DataFrame.
  • The duplicated() method is called on this list of columns to identify any duplicate columns.
  • The ~ operator is used to invert the logical values returned by the duplicated() method. This means that the ~df.columns.duplicated() expression will evaluate to True for columns that are not duplicates, and False for columns that are duplicates.
  • The loc attribute is used to subset the DataFrame, using the ~df.columns.duplicated() expression as a filter to select only the columns that are not duplicates.
  • The copy() method is called on the resulting DataFrame to create a new DataFrame object with the duplicate columns removed.
  • The resulting DataFrame is assigned to the df variable, overwriting the original DataFrame.

How to remove duplicated indexes

Use the following code to remove duplicated indexes.

main.py
df = df.loc[~df.index.duplicated(),:].copy()
  • The df.index attribute is used to get a list of all the indices (i.e. row labels) in the DataFrame.
  • The duplicated() method is called on this list of indices to identify any duplicate rows.
  • The ~ operator is used to invert the logical values returned by the duplicated() method. This means that the ~df.index.duplicated() expression will evaluate to True for rows that are not duplicates, and False for rows that are duplicates.
  • The loc attribute is used to subset the DataFrame, using the ~df.index.duplicated() expression as a filter to select only the rows that are not duplicates.
  • The copy() method is called on the resulting DataFrame to create a new DataFrame object with the duplicate rows removed.
  • The resulting DataFrame is assigned to the df variable, overwriting the original DataFrame.

How to remove duplicate columns by checking values

Use the following code:

main.py
df = df.loc[:,~df.apply(lambda x: x.duplicated(),axis=1).all()].copy()
  • The apply() method is called on the DataFrame, with a lambda function as the argument. This lambda function takes a column (x) as input and returns a boolean value indicating whether the values in that column are all duplicates of each other (using the duplicated() method on the column).
  • The all() method is called on the resulting Series of boolean values to determine if all the columns in the DataFrame have duplicate values.
  • The ~ operator is used to invert the logical value returned by the all() method. This means that the ~df.apply(lambda x: x.duplicated(),axis=1).all() expression will evaluate to True for columns that are not all duplicates, and False for columns that are all duplicates.
  • The loc attribute is used to subset the DataFrame, using the ~df.apply(lambda x: x.duplicated(),axis=1).all() expression as a filter to select only the columns that are not all duplicates.
  • The copy() method is called on the resulting DataFrame to create a new DataFrame object with the duplicate columns removed.
  • The resulting DataFrame is assigned to the df variable, overwriting the original DataFrame.

You'll also like

Related tutorials curated for you

    How to use ewm() in Pandas

    How to use nunique() in Pandas

    How to convert Pandas timestamp to datetime

    How to drop duplicate columns in Pandas

    How to split a Pandas DataFrame by a column value

    How to find the minimum in Pandas

    How to print a specific row in a Pandas DataFrame?

    How to normalize a column in Pandas

    How to create a bar chart in Pandas

    What is .notnull in Pandas?

    How to read a TSV file in Pandas

    How to use qcut() in Pandas