Label Encoding is a process of converting categorical variables into numerical variables. In Python, we can perform label encoding using the scikit-learn library.
Here is an example of how to perform label encoding in Python:
# Importing the necessary libraries from sklearn.preprocessing import LabelEncoder # Creating a list of categorical variables categories = ['blue', 'red', 'green', 'yellow', 'blue', 'green', 'red'] # Creating an instance of LabelEncoder label_encoder = LabelEncoder() # Fitting the label encoder to the categorical data label_encoder.fit(categories) # Transforming the categorical data into numerical data encoded_categories = label_encoder.transform(categories) # Printing the encoded categories print(encoded_categories)
Output:
[0 2 1 3 0 1 2]
In the above example, we created a list of categorical variables called categories
. We then created an instance of LabelEncoder()
and fit it to the categories
list using the fit()
method. We then transformed the categories
list into numerical data using the transform()
method of the LabelEncoder()
object, and stored the result in a variable called encoded_categories
. Finally, we printed the encoded_categories
variable to display the resulting numerical encoding of the original categorical data.
Note that the label encoding is done in alphabetical order, where blue
is assigned 0, green
is assigned 1, red
is assigned 2, and yellow
is assigned 3.
In the above example, we created a list of categorical variables called categories
. We then created an instance of LabelEncoder()
and fit it to the categories
list using the fit()
method. We then transformed the categories
list into numerical data using the transform()
method of the LabelEncoder()
object, and stored the result in a variable called encoded_categories
. Finally, we printed the encoded_categories
variable to display the resulting numerical encoding of the original categorical data.
Note that the label encoding is done in alphabetical order, where blue
is assigned 0, green
is assigned 1, red
is assigned 2, and yellow
is assigned 3.
Understanding Nominal Scale:
Nominal Scale is a measurement scale that classifies variables into two or more categories, where the categories are mutually exclusive and unordered. In other words, variables that are measured on a nominal scale cannot be ranked or ordered. Examples of variables measured on a nominal scale include gender, ethnicity, eye color, religion, and country of origin.
The key features of a nominal scale are:
- The categories are mutually exclusive: Each observation can only belong to one category, and the categories cannot overlap.
- The categories are unordered: The categories have no natural order or ranking. For example, when categorizing eye color as blue, brown, or green, there is no inherent order to these categories.
- The categories are qualitative: Nominal variables are qualitative rather than quantitative. This means that we cannot perform mathematical operations such as addition or subtraction on them.
In statistical analysis, nominal data is typically analyzed using frequency tables or cross-tabulations, which display the number or percentage of observations in each category. Nominal data can also be encoded as numerical values using techniques such as label encoding or one-hot encoding, but this does not imply any ordering or ranking of the categories.
Understanding Ordinal Scale:
Ordinal Scale is a measurement scale used to rank or order the observations of a variable based on a particular attribute or characteristic. In other words, ordinal data are measurements that can be ranked or ordered, but the distance between the categories is not known or not meaningful. Examples of variables measured on an ordinal scale include socioeconomic status, education level, and satisfaction ratings.
The key features of an ordinal scale are:
- The categories are ordered: The categories have a natural order or ranking. For example, when categorizing education level as high school, college, or graduate school, there is a clear order to these categories.
- The categories are not equally spaced: The distance or interval between categories is not known or meaningful. For example, the difference in satisfaction level between “very satisfied” and “somewhat satisfied” may not be the same as the difference between “somewhat satisfied” and “neutral”.
- The categories are qualitative: Ordinal variables are qualitative rather than quantitative. This means that we cannot perform mathematical operations such as addition or subtraction on them.
In statistical analysis, ordinal data is typically analyzed using non-parametric tests such as the Mann-Whitney U test or the Kruskal-Wallis test. Ordinal data can also be encoded as numerical values using techniques such as label encoding or assigning numerical scores to the categories, but care must be taken to ensure that the ordering is preserved and that the distance between the categories is not overstated.
Label Encoding using Category codes:
In Python, Pandas library provides a convenient method to perform label encoding using category codes. The cat.codes
method of Pandas Series can be used to convert a categorical variable to numerical codes.
Here is an example of how to perform label encoding using category codes in Python:
# Importing the necessary libraries import pandas as pd # Creating a Pandas Series of categorical variables categories = pd.Series(['blue', 'red', 'green', 'yellow', 'blue', 'green', 'red']) # Encoding the categorical variables using category codes encoded_categories = categories.astype('category').cat.codes # Printing the encoded categories print(encoded_categories)
Output:
0 0 1 2 2 1 3 3 4 0 5 1 6 2 dtype: int8
In the above example, we created a Pandas Series of categorical variables called categories
. We then converted the categories
Series to a categorical data type using the astype()
method and applied the cat.codes
method to the categorical data type to encode it numerically. Finally, we printed the resulting encoded categories.
Note that the resulting numerical encoding is the same as the previous example using scikit-learn LabelEncoder. However, the encoding process is simplified in this example using Pandas’ cat.codes
method.