The factorize
method in Pandas is used to encode categorical data as numerical values.
This method takes a Series
object containing categorical data as an argument, and returns a tuple with two elements: a Series
of the encoded values, and an Index
object containing the unique values in the original Series
.
For example, consider the following Series
object:
import pandas as pd
s = pd.Series(['apple', 'banana', 'apple', 'orange', 'banana'])
This Series
has five elements, with the values 'apple'
, 'banana'
, 'orange'
, and so on.
To encode the values in this Series
as numerical values, you could do the following:
import pandas as pd
s = pd.Series(['apple', 'banana', 'apple', 'orange', 'banana'])
# Encode the values as numerical values
encoded, categories = pd.factorize(s)
# Print the encoded Series
print(encoded)
# Print the categories Index
print(categories)
In the code above, the factorize
method is applied to the Series
object, which encodes the values as numerical values. The result is a tuple containing two elements: a Series
of the encoded values, and an Index
object containing the unique values in the original Series
.
In this case, the resulting Series
of encoded values has the values 0, 1, 0, 2, 1, which indicates that the value 'apple'
is encoded as 0, the value 'banana'
is encoded as 1, and the value 'orange'
is encoded as 2. The resulting Index
object contains the unique values in the original Series
, with the values 'apple'
, 'banana'
, and 'orange'
.
The factorize
method is commonly used to encode categorical data in a Pandas DataFrame.
For example, if you had a DataFrame with a column of categorical data, you could use the factorize
method to encode the values in that column as numerical values, as shown in the following example:
import pandas as pd
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5],
'B': ['apple', 'banana', 'apple', 'orange', 'banana'],
'C': [100, 200, 300, 400, 500]
})
# Encode the values in column B as numerical values
df['B'], categories = pd.factorize(df['B'])
# Print the resulting DataFrame
print(df)
# Print the categories Index
print(categories)
In the code above, the factorize
method is applied to the B
column of the DataFrame, which encodes the values in that column as numerical values.
The result is a new DataFrame
object with the encoded values in the B
column, and an Index
object containing the unique values in the original B
column.
In this case, the resulting DataFrame
has the values:
A B C
0 1 0 100
1 2 1 200
2 3 0 300
3 4 2 400
4 5 1 500
Index(['apple', 'banana', 'orange'], dtype='object')
Related tutorials curated for you
How to fix: ValueError: pandas cannot reindex from a duplicate axis
How to apply a function to multiple columns in Pandas
How to groupby mean in Pnadas
How to select multiple columns in Pandas
How to convert a series to a list in Pandas
How to create a bar chart in Pandas
How to round in Pandas
What is isna() in Pandas?
How to find the mode in a Pandas DataFrame
How to use Timedelta in Pandas
How to calculate the variance in Pandas DataFrame
How to use intertuples() in Pandas