Coding Ref

What does factorize() do in Pandas?

What does factorize() do in Pandas?

The factorize method in Pandas is used to encode categorical data as numerical values.

This method takes a Series object containing categorical data as an argument, and returns a tuple with two elements: a Series of the encoded values, and an Index object containing the unique values in the original Series.

For example, consider the following Series object:

main.py
import pandas as pd

s = pd.Series(['apple', 'banana', 'apple', 'orange', 'banana'])

This Series has five elements, with the values 'apple', 'banana', 'orange', and so on.

To encode the values in this Series as numerical values, you could do the following:

main.py
import pandas as pd

s = pd.Series(['apple', 'banana', 'apple', 'orange', 'banana'])

# Encode the values as numerical values
encoded, categories = pd.factorize(s)

# Print the encoded Series
print(encoded)

# Print the categories Index
print(categories)

In the code above, the factorize method is applied to the Series object, which encodes the values as numerical values. The result is a tuple containing two elements: a Series of the encoded values, and an Index object containing the unique values in the original Series.

In this case, the resulting Series of encoded values has the values 0, 1, 0, 2, 1, which indicates that the value 'apple' is encoded as 0, the value 'banana' is encoded as 1, and the value 'orange' is encoded as 2. The resulting Index object contains the unique values in the original Series, with the values 'apple', 'banana', and 'orange'.

The factorize method is commonly used to encode categorical data in a Pandas DataFrame.

For example, if you had a DataFrame with a column of categorical data, you could use the factorize method to encode the values in that column as numerical values, as shown in the following example:

main.py
import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': ['apple', 'banana', 'apple', 'orange', 'banana'],
    'C': [100, 200, 300, 400, 500]
})

# Encode the values in column B as numerical values
df['B'], categories = pd.factorize(df['B'])

# Print the resulting DataFrame
print(df)

# Print the categories Index
print(categories)

In the code above, the factorize method is applied to the B column of the DataFrame, which encodes the values in that column as numerical values.

The result is a new DataFrame object with the encoded values in the B column, and an Index object containing the unique values in the original B column.

In this case, the resulting DataFrame has the values:

output
   A  B    C
0  1  0  100
1  2  1  200
2  3  0  300
3  4  2  400
4  5  1  500
Index(['apple', 'banana', 'orange'], dtype='object')

You'll also like

Related tutorials curated for you

    How to fix: ValueError: pandas cannot reindex from a duplicate axis

    How to apply a function to multiple columns in Pandas

    How to groupby mean in Pnadas

    How to select multiple columns in Pandas

    How to convert a series to a list in Pandas

    How to create a bar chart in Pandas

    How to round in Pandas

    What is isna() in Pandas?

    How to find the mode in a Pandas DataFrame

    How to use Timedelta in Pandas

    How to calculate the variance in Pandas DataFrame

    How to use intertuples() in Pandas