Chapter 4_ Working With Dates And Categorical Data
I. Working with dates
1. Importing/Converting datetimes
When importing, use parse_dates. If the data is already loaded, use pd.to_datetime.
# Option A: At Import
rides = pd.read_csv('capital-onebike.csv',
parse_dates=['Start date', 'End date'])
# Option B: Manual Conversion
# Note: Ensure no spaces in the format string (e.g., "%H:%M:%S")
rides['Start date'] = pd.to_datetime(rides['Start date'],
format="%Y-%m-%d %H:%M:%S")
2. Operations
To access datetime properties, use the
.dtaccessor. Subtracting two datetimes creates a Timedelta.
# Calculate duration (Result is a Timedelta object)
rides['duration'] = rides['End date'] - rides['Start date']
# Convert Timedelta to a numeric value (seconds)
rides['duration'].dt.total_seconds().head(5)
# Extract specific parts
divorce["marriage_year"] = divorce["marriage_date"].dt.year
divorce["marriage_day_name"] = divorce["marriage_date"].dt.day_name() # e.g., "Sunday"
# Aggregations on Timedeltas
rides['duration'].sum()
rides['duration'].mean()
rides['duration'].sum() / timedelta(days=91)
# Check for anomalies (Negative duration usually implies a Daylight Savings shift)
negative_durations = rides[rides['duration'].dt.total_seconds() < 0]
3. Resampling Data .resample()
.resample()is like.groupby(), but specifically for time frequencies.
# 'D' = Day, 'W' = Week, 'ME' = Month End, 'YE' = Year End
rides.resample('ME', on='Start date')['duration'].mean()
"""
Output:
Start date
2017-10-31 1886.454
2017-11-30 854.175
2017-12-31 635.101
Freq: ME, Name: Duration seconds, dtype: float64
"""
4. Grouping and Resampling
You can combine standard grouping with time-based resampling to see trends across categories.
# Median duration per month, split by Member type
grouped = rides.groupby('Member type')\
.resample("ME", on="Start date")['duration']\
.median()
"""
Output:
Member type Start date
Casual 2017-10-31 1636.0
2017-11-30 1159.5
2017-12-31 850.0
Member 2017-10-31 671.0
2017-11-30 655.0
2017-12-31 387.5
Name: Duration, dtype: float64
"""
5. Timezones & Daylight Saving Time (DST)
By default, Python datetimes are naive (no offset). Moving them to aware requires
.dt.tz_localize().
Note: During DST “Fall Back” events, the same time occurs twice. This causes an
AmbiguousTimeError.
# rides['Start date'] contains timezone-naive datetimes
# (timestamps without timezone information)
# Add a timezone using tz_localize()
rides['Start date'].dt.tz_localize("America/New_York")
# Assign the localized result back to the column
rides['Start date'] = rides['Start date']\
.dt.tz_localize("America/New_York")
"""
Error:
AmbiguousTimeError → happens during DST change
when the same local time occurs twice.
"""
# Handle ambiguous timestamps
# ambiguous='NaT' → unclear times become NaT (Not a Time)
rides['Start date'] = rides['Start date']\
.dt.tz_localize("America/New_York", ambiguous='NaT')
rides['End date'] = rides['End date']\
.dt.tz_localize("America/New_York", ambiguous='NaT')
# NaT = missing/invalid datetime value
For more details, check the Naive datetime section in the Python datetime chapter.
6. plot results
Pandas handles the x-axis formatting automatically when you plot a Series with a datetime index.
# line plot by default:
rides\
.resample ('D', on = 'Start date')\
['Duration seconds']\
.mean()\
.plot()
II. Working with categorical data
1. What does it mean to be “categorical”?
Categorical
- Finite number of groups (or categories)
- These categories are usually fixed or known (eye color, hair color, etc.)
- Known as qualitative data
Numerical
- Known as quantitative data
- Expressed using a numerical value
- Is usually a measurement (height, weight, IQ, etc.)
2. Ordinal vs. nominal variables
Ordinal
- Categorical variables that have a natural order
- Strongly-Disagree(1) Disagree(2) Neutral(3) Agree(4) Strongly-Agree(5)
Nominal
- Categorical variables that cannot be placed into a natural order
- Blue Green Red Yellow Purple
3. dtypes: categorical
Default dtype:
adult["Marital Status"].dtype
"""
Output: (Object type)
dtype ('0')
"""
Set as categorical:
adult["Marital Status"] = adult["Marital Status"].astype("category") # this saves memory
adult["Marital Status"].dtype
"""
Output:
CategoricalDtype (categories=[' Divorced', ' Married-AF-spouse',
' Married-civ-spouse', ' Married-spouse-absent', ' Never-married',
' Separated', ' Widowed' ], ordered=False)
"""
Memory saving
adult = pd.read_csv("data/adult.csv")
adult["Marital Status"].nbytes # 260488
adult["Marital Status"] = adult["Marital Status"].astype("category")
adult["Marital Status"].nbytes # 32617
4. Creating a categorical Series
my_data = ["A", "A", "C", "B", "C", "A"]
my_series1 = pd.Series(my_data, dtype="category")
print(my_series1)
"""
Output:
0 A
1 A
2 C
...
dtype: category
Categories (3, object): [A, B, C]
"""
my_data = ["A", "A", "C", "B", "C", "A"]
# this way allows us to tell pandas that the categories have a logical order.
my_series2 = pd.Categorical(my_data, categories=["C", "B", "A"], ordered=True)
my_series2
"""
Output:
[A, A, C, B, C, A]
Categories (3, object): [C < B < A]
"""
4.1 Specify dtypes when reading data
# 1. Create a dictionary:
adult_dtypes = {"Marital Status": "category"}
# 2. Set the dtype parameter:
adult = pd.read_csv("data/adult.csv", dtype=adult_dtypes)
# 3. Check the dtype :
adult["Marital Status"].dtype
5. Setting category variables
5.1 The .cat accessor object
Series.cat.method_name
Common parameters:
- new_categories : a list of categories
- inplace, ordered
5.2 Set categories .cat.set_categories()
dogs["coat"] = dogs["coat"].cat.set_categories(
new_categories=["short", "medium", "long"]
)
- Values not found in
'new_categories'will be coerced to NaN.
Check value counts:
dogs["coat"].value_counts(dropna=False)
"""
Output:
short 1972
medium 565
NaN 220
Long 180
"""
Setting Order
dogs["coat"] = dogs["coat"].cat.set_categories(
new_categories=["short", "medium", "long"],
ordered=True
)
dogs ["coat"].head(3)
"""
Output:
0 short
1 short
2 short
Name: coat, dtype: category
Categories (3, object): ['short' < 'medium' < 'long' ]
"""
Equality (==, !=): Works on all categorical data.
Inequality (<, >, <=, >=): Works only if
ordered=True.*Arithmetic (+, -, , /): Never works directly; requires conversion to numerical codes first.
5.3 Missing categories
dogs["likes_people"].value_counts(dropna=False)
"""
yes 1991
NaN 938
no 8
"""
A NaN could mean:
- Truly unknown (we didn’t check)
- Not sure (dog likes “some” people)
5.4 Adding categories .cat.add_categories()
Add categories
dogs["Likes_people"] = dogs["likes_people"].astype("category")
dogs["likes_people"] = dogs["likes_people"].cat.add_categories(
new_categories=["did not check", "could not tell"]
)
- Does not change the value of any data in the DataFrame
- Categories not listed in this method are left alone
Check categories
dogs["likes_people"].cat.categories
# Output: Index(['no', 'yes', 'did not check' , 'could not tell'], dtype='object')
dogs["Likes_people"].value_counts(dropna=False)
"""
Output:
yes 1991
NaN 938
no 8
could not tell 0
did not check 0
"""
5.5 Removing categories .cat.remove_categories()
dogs ["coat"] = dogs["coat"].astype ("category")
dogs ["coat"] = dogs["coat"].cat.remove_categories(removals=["wirehaired"])
- Values matching categories listed are set to NaN
Check the categories
dogs["coat"].cat.categories
# Output: Index(['long', 'medium' , 'short'], dtype='object' )
6. Updating/collapsing categories
6.1 Renaming categories cat.rename_categories()
# Make a dictionary:
my_changes = {"Unknown Mix": "Unknown"}
# Rename the category:
dogs["breed"] = dogs["breed"].cat.rename_categories(my_changes)
Renaming categories with a function
# Update multiple categories:
dogs ['sex'] = dogs['sex'].cat.rename_categories( lambda c: c.title() )
Common replacement issues
- Must use new category names
# Does not work! "Unknown" already exists
use_new_categories = {"Unknown Mix": "Unknown"}
- Cannot collapse two categories into one
# Does not work! New names must be unique
cannot_repeat_categories = {
"Unknown Mix": "Unknown",
"Mixed Breed": "Unknown"
}
6.2 Collapsing categories .replace()
# Create a dictionary and use .replace :
update_colors = {
"black and brown": "black",
"black and tan": "black",
"black and white": "black",
}
dogs["main_color"] = dogs["color"].replace(update_colors)
# Check the Series data type:
dogs["main_color"].dtype # Output: dtype('O')
Convert back to categorical
dogs ["main_color"] = dogs["main_color"].astype("category")
dogs["main_color"].cat.categories
"""
Output:
Index(['apricot', 'black', 'brown', 'brown and white' , 'dotted' , 'golden' ,
'gray', 'gray and black', 'gray and white', 'red', 'red and white',
'sable', 'saddle back', 'spotty', 'striped', 'tricolor', 'white',
'wild boar', 'yellow', 'yellow-brown' ],
dtype='object')
"""
6.3 Reordering categories .cat.reorder_categories()
dogs ['coat' ] = dogs["coat"].cat.reorder_categories(
new_categories = ['short', 'medium', 'wirehaired', 'long' ],
ordered=True
)
# Using inplace:
dogs["coat"].cat.reorder_categories(
new_categories = ['short', 'medium', 'wirehaired', 'long' ],
ordered=True,
inplace=True
)
Note: Category sequence is preserved in GroupBy output for both
orderedandunorderedtypes;'ordered=True'only affects comparison logic.
7. Categorical pitfalls
Using categories can be frustrating
-
Using the .str accessor object to manipulate data converts the Series to an object.
-
The
'.apply()' / '.replace()' ..methods outputs a new Series as an object. -
The common methods of adding, removing, replacing, or setting categories do not all handle missing categories the same way.
-
NumPy functions generally do not work with categorical Series.
Solutions
7.1 Check and convert
# Check
used_cars["color"] = used_cars["color"].astype("category")
used_cars["color"] = used_cars["color"].str.upper()
print(used_cars["color"].dtype) # object
# Convert
used_cars["color"] = used_cars["color"].astype("category")
print(used_cars["color"].dtype) # category
7.2 Looking for missing values
Anytime you are updating categories, whether that is setting, adding, or removing, use value-counts to make sure the changes you made worked as intended.
used_cars["color"] = used_cars["color"].astype("category")
used_cars["color"].cat.set_categories(["black", "silver", "blue"], inplace=True)
used_cars["color"].value_counts(dropna=False)
"""
Output:
NaN 18172
black 7705
silver 6852
blue 5802
Name: color, dtype: int64
we see that over 18,000 entries have become NaN values. If this was not intended, we may need to use a different method for updating the categories.
"""
7.3 Using NumPy arrays
used_cars['number_of_photos' ] = used_cars['number_of_photos'].astype("category")
used_cars['number_of_photos' ].sum() # <--- Gives an Error
# TypeError: Categorical cannot perform the operation sum
used_cars['number_of_photos'].astype(int).sum()
Note:
# .str converts the column to an array
used_cars["color"].str.contains("red")
8. Label encoding
The basics:
- Codes each category as an integer from
Othroughn - 1, wherenis the number of categories. - A
-1code is reserved for any missing values - Can save on memory
- Often used in surveys
The drawback:
- Is not the best encoding method for machine learning (see next lesson)
8.1 Creating codes .cat.codes
Convert to categorical and sort by manufacturer name
used_cars['manufacturer_name'] = used_cars['manufacturer_name'].astype("category")
Use .cat.codes
# Generates integer codes for categorical data.
# If the category is ordered, codes follow that specific sequence;
# otherwise, they are assigned based on alphabetical order.
used_cars['manufacturer_code'] = used_cars['manufacturer_name'].cat.codes
Check Output
print(used_cars[['manufacturer_name', 'manufacturer_code']])
manufacturer_name manufacturer_code
0 Subaru 45
1 Subaru 45
2 Subaru 45
... ... ...
38526 Chrysler 8
38527 Chrysler 8
8.2 Code books / data dictionaries
label encoding is often used in surveys. The responses and their corresponding codes are often kept in a code book or a data dictionary. Consider this variable from the American Housing Survey, where a 1 represents YES and a 2 represents NO, for if a house was built in the last four years
Creating a code book
codes = used_cars['manufacturer_name'].cat.codes
categories = used_cars['manufacturer_name']
name_map = dict(zip(codes, categories))
print(name_map)
"""
Ouput:
{45: 'Subaru',
24: 'LADA',
12: 'Dodge',
...
}
"""
Creating the codes
used_cars['manufacturer_code'] = used_cars['manufacturer_name'].cat.codes
# Reverting to previous values:
used_cars['manufacturer_code'].map(name_map)
8.3 Boolean coding
Create a boolean coding:
used_cars["van_code"] = np.where(
used_cars["body_type"].str.contains("van", regex=False), 1, 0)
used_cars["van_code"].value_counts()
"""
Output:
0 34115
1 4416
Name: van_code, dtype: int64
"""
9. One-Hot Encoding
Using Label Encoding can cause models to assume a mathematical relationship (e.g., $3 > 0$) where none exists. To prevent the model from thinking “Gasoline” is “greater than” “Diesel,” we use One-Hot Encoding to create binary columns for each category.
9.1 Basic Implementation
import pandas as pd
# 1. Convert ALL categorical variables in the dataframe:
# This automatically identifies 'object' or 'category' types.
used_cars_oneHot = pd.get_dummies(used_cars)
# 2. Choosing specific columns:
# Note: Usually, you pass the whole DF and specify 'columns' to keep other data intact.
used_cars_oneHot = pd.get_dummies(used_cars, columns=['odometer_values', 'color'])
# 3. Customizing prefixes:
# If you set prefix="", you get "_black", "_blue".
# If you want NO underscore at all, you must specify prefix_sep="".
used_cars_oneHot = pd.get_dummies(used_cars, columns=['color'], prefix="cat", prefix_sep="_")
9.2 The Dummy Variable Trap
df_encoded = pd.get_dummies(df, columns=["color"], drop_first=True)
The problem: This refers to Multicollinearity. If you have two categories, “Male” and “Female,” and you know someone is NOT Male (0), then they MUST be Female (1). Including both columns provides redundant information. In mathematical terms, one column can be perfectly predicted from the others, which can cause issues for certain models like Linear Regression (it makes the inversion of the feature matrix impossible or unstable).
9.2 Key Reminders
- NaN Values: By default,
pd.get_dummiesignoresNaN. If a row isNaN, all resulting dummy columns for that feature will be0. If you want a specific column for missing data, usedummy_na=True. - Curse of Dimensionality: Using one-hot on a column with 1,000 unique values will add 1,000 new columns, which may lead to overfitting.
10. Etc
10.1 Cross-Tabulation pd.crosstab()
A Crosstab (or contingency table) is a powerful tool used to identify how observations occur in combination across multiple categorical variables. Instead of looking at a single column’s frequency, a crosstab displays the intersection of categories, helping you uncover hidden relationships, dependencies, or data imbalances.
Using pd.crosstab(), you can analyze simple frequencies or even aggregate numeric values (like salaries) across categories.
import pandas as pd
# 1. Simple Frequency Count
# Relationship between Company Size and Experience level
print(pd.crosstab(salaries["Company_Size"], salaries["Experience"]))
# 2. Crosstab with Aggregation
# Calculate the average Salary_USD for each Job Category within each Company Size
print(pd.crosstab(salaries["Job_Category"], salaries["Company_Size"],
values=salaries["Salary_USD"], aggfunc="mean"))
"""
Output 1:
Experience EN EX MI SE
Company_Size
L 24 7 49 44
M 25 9 58 136
S 18 1 21 15
Output 2:
Company_Size L M S
Job_Category
Data Analytics 112851.749 95912.685 53741.877
Data Engineering 118939.035 121287.061 86927.136
Data Science 96489.520 116044.456 62241.749
Machine Learning 140779.492 100794.237 78812.586
Managerial 190551.449 150713.628 31484.700
Other 92873.911 89750.579 69871.248
"""