Feb 25, 2026

Chapter 0_ Introduction to pandas

Achraf Salimi

Feb 25, 2026

5 min read

I. Introduction

1. Pandas, NumPy, and Matplotlib — How They Fit Together

Pandas is built on Numpy and matplotlib

Numpy provides fast, multidimensional array objects (ndarray) for easy data manipulation that pandas uses to store data.
Matplotlib has powerful data visualization.

2. Rectangular Data in Pandas

In Pandas, rectangular (tabular) data is represented by a DataFrame object.

This concept exists in nearly every data analysis ecosystem:

Pandas (Python): DataFrame
R: data.frame
SQL: database tables

3. Why Use a DataFrame If Pandas Uses NumPy Under the Hood?

Question:
If Pandas uses NumPy ndarrays internally, what is the real utility of a DataFrame?

3.1 ✅ Answer

While NumPy arrays are homogeneous (all elements must share the same data type for performance), a Pandas DataFrame is heterogeneous.

A DataFrame acts as a high-level container for multiple NumPy arrays, allowing:

Each column to hold a different data type in a single table.
A single table-like structure for mixed data.

3.2 Beyond Data Types

DataFrames also provide rich metadata and high-level features, including:

Column labels and row indexes
SQL-like operations (joins, merges, group-bys)
Built-in missing data handling (NaN)
etc

All of this would be extremely difficult and error-prone to implement using raw NumPy arrays alone.

4. Pandas Series

A Pandas Series is like a column in a table. It is a one-dimensional array holding data of any type.

4.1 Create a simple Pandas Series from a list:

import pandas as pd

a = [1, 7, 2]

# If nothing else is specified, the values are labeled with their index number.([1, 2, ... n])
myvar = pd.Series(a)

print(myvar)
"""
Output:
    0    1
    1    7
    2    2
    dtype: int64
"""
print(myvar[0]) # 1



# Create Labels
myvar = pd.Series(a, index = ["x", "y", "z"])

print(myvar)
"""
Output:
    x    1
    y    7
    z    2
    dtype: int64
"""
print(myvar["y"]) # 7

4.2 Key/Value Objects as Series

import pandas as pd

calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories)

print(myvar)
"""
Output:
    day1    420
    day2    380
    day3    390
    dtype: int64
"""



# To select only some of the items in the dictionary, use the index argument and specify only the items you want to include in the Series.

myvar = pd.Series(calories, index = ["day1", "day2"])
print(myvar)
"""
Output:
    day1    420
    day2    380
    dtype: int64
"""

Note: The keys of the dictionary become the labels.

II. Note

1. Difference between `np.logical_and`,`np.logical_or()`,`np.logical_and.reduce()`, `np.logical_or.reduce()` and `'&' / '|'`

import numpy as np
import pandas as pd

# ⚠️ WARNING: Never use 'and' or 'or' with DataFrames/Series. 
# They aren't "vectorized" (cannot handle arrays), leading to Ambiguity Errors.
# Use Bitwise operators (&, |) or NumPy logical functions instead.

# 1. THE BITWISE WAY (Standard Pandas)
# & is the vectorized equivalent of 'and'
# | is the vectorized equivalent of 'or'
# ~ is the vectorized equivalent of 'not'
mask = (df['A'] > 5) & (df['B'] < 10) 

# 2. THE NUMPY EQUIVALENTS (Functional)
# np.logical_and()  <==>  &
# np.logical_or()   <==>  |
# np.logical_not()  <==>  ~
mask = np.logical_and(df['A'] > 5, df['B'] < 10)

# 3. SCALING (Best for 3+ conditions)
conds = [df['A'] > 5, df['B'] < 10, df['C'] == 0]
mask = np.logical_and.reduce(conds)

df[mask]


import numpy as np
import pandas as pd

# 1. THE STANDARD (Best for 2-3 conditions)
# Note: Parentheses () are MANDATORY
mask = (df['A'] > 5) & (df['B'] < 10) & (df['C'] == 0)

# 2. THE LIMITATION (Binary only)
# np.logical_and(cond1, cond2, cond3)  # ❌ ERROR: Only takes 2 args
mask = np.logical_and(df['A'] > 5, df['B'] < 10)  # ✅ Works

# 3. THE PRO SCALABLE WAY (Best for long lists)
# Works like a loop: cond1 AND cond2 AND cond3...
cond_list = [
    df['A'] > 5, 
    df['B'] < 10, 
    df['C'] == 0, 
    df['D'] != 'Pending'
]
mask = np.logical_and.reduce(cond_list)

df[mask]

2. Usage of (`.any()` and `.all`) for `DataFrames` and `Series`

On a Series (One Column) It returns a single True/False.

.any(): Is at least one value True? .all(): Are all values True?

# Returns True if there is at least one missing value in 'age'
df["age"].isna().any()

On a DataFrame (Full Table) It returns a Series of True/False values (by default, one per column). It effectively “reduces” the dimension of the data.

# Returns a Series showing which COLUMNS have at least one NaN
df.isna().any()

# Output looks like:
# name      False
# age       True
# breed     True
# dtype: bool


# Returns a bool answering the question: "Does my entire dataset have ANY missing values anywhere?"
df.isna().any().any() 

# Translation: 
# 1. First .any() -> Checks each column (Returns a Series)
# 2. Second .any() -> Checks that Series (Returns a single Boolean)