Predict Income with UCI Adult Dataset: Machine Learning Tutorial

Introduction

Machine learning is everywhere—from Netflix recommendations to AI-powered job screening. One classic dataset that teaches the fundamentals of binary classification is the UCI Adult dataset, also known as the Census Income dataset. In this tutorial, you'll learn how to select an appropriate model, train it on this dataset, and make predictions. Whether you're a student tackling a new assignment or a budding data scientist, this guide will help you understand the process end-to-end.

Think of it like scouting players for a fantasy sports team: you have historical stats (age, education, occupation) and you want to predict if a player will be a "high performer" (income >50K). By the end, you'll have a model that can classify new individuals.

Understanding the Dataset

The UCI Adult dataset contains around 48,842 instances with 14 features. The target variable is income, with two classes: >50K and <=50K. Features include numeric attributes like age, hours-per-week, capital-gain, and categorical ones like workclass, education, marital-status, occupation, sex, etc.

This dataset is perfect for practicing data preprocessing, feature engineering, and model selection. It's also a great way to understand bias in AI—for example, historical income disparities may be reflected in the data.

Step 1: Load and Explore the Data

First, import necessary libraries and load the dataset. You can download it from the UCI repository or use pandas to read it directly.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
columns = ['age','workclass','fnlwgt','education','education-num','marital-status',
           'occupation','relationship','race','sex','capital-gain','capital-loss',
           'hours-per-week','native-country','income']
df = pd.read_csv(url, header=None, names=columns, na_values=' ?', skipinitialspace=True)

Check for missing values and basic statistics:

print(df.isnull().sum())
print(df.describe())

You'll notice missing values in workclass, occupation, and native-country. Decide how to handle them—drop rows or impute with the mode.

Step 2: Data Preprocessing

Clean the data: drop missing values (or impute), encode categorical variables, and scale numeric features.

df.dropna(inplace=True)

# Encode target variable
df['income'] = df['income'].apply(lambda x: 1 if x == ' >50K' else 0)

# Separate features and target
X = df.drop('income', axis=1)
y = df['income']

# Identify categorical and numeric columns
cat_cols = X.select_dtypes(include=['object']).columns
num_cols = X.select_dtypes(include=['int64', 'float64']).columns

# Encode categoricals
le = LabelEncoder()
for col in cat_cols:
    X[col] = le.fit_transform(X[col])

# Scale numeric features
scaler = StandardScaler()
X[num_cols] = scaler.fit_transform(X[num_cols])

Now split into training and testing sets (e.g., 80/20).

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 3: Model Selection

For binary classification, several models work well. We'll compare three popular ones: Logistic Regression, Random Forest, and XGBoost. Think of it like choosing a gaming strategy: Logistic Regression is a simple, fast baseline (like a quick attack), Random Forest is robust and handles non-linearity (like a balanced team), and XGBoost is powerful but requires tuning (like a pro-level combo).

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report

models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=100),
    'XGBoost': XGBClassifier(eval_metric='logloss')
}

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(f'{name}: Accuracy = {acc:.4f}')

Typically, Random Forest and XGBoost outperform Logistic Regression on this dataset due to complex feature interactions.

Step 4: Hyperparameter Tuning

To improve performance, tune hyperparameters using GridSearchCV. For example, for Random Forest:

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5]
}
rf = RandomForestClassifier()
grid = GridSearchCV(rf, param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)
print('Best params:', grid.best_params_)
print('Best CV accuracy:', grid.best_score_)

After tuning, evaluate on the test set.

best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
print(classification_report(y_test, y_pred))

Step 5: Making Predictions on New Data

Once you have a trained model, you can predict income for new individuals. For example:

new_data = pd.DataFrame([{
    'age': 35,
    'workclass': 'Private',
    'fnlwgt': 200000,
    'education': 'Bachelors',
    'education-num': 13,
    'marital-status': 'Never-married',
    'occupation': 'Prof-specialty',
    'relationship': 'Not-in-family',
    'race': 'White',
    'sex': 'Male',
    'capital-gain': 0,
    'capital-loss': 0,
    'hours-per-week': 40,
    'native-country': 'United-States'
}])
# Preprocess new_data (encode, scale) similarly to training
# ...
prediction = best_model.predict(new_data_processed)
print('Predicted income class:', '>50K' if prediction[0] == 1 else '<=50K')

Remember to apply the same preprocessing steps (encoding and scaling) using the fitted transformers.

Conclusion

In this tutorial, you learned how to train a classification model on the UCI Adult dataset to predict income. This pipeline—data exploration, preprocessing, model selection, tuning, and prediction—is applicable to many real-world classification problems. Whether you're analyzing census data, building a credit scoring model, or even predicting customer churn, these skills are foundational.

As AI continues to shape our world, understanding how models make decisions is crucial. The Adult dataset also reminds us to be mindful of bias: historical inequalities can be encoded in data, so always evaluate fairness. Happy modeling!