2022 ~ Data Science for Lifelong Learning

Tuesday, December 27, 2022

Feature Selection

December 27, 2022Dimensionality Reduction No comments

Feature selection, also known as variable or attribute selection, is the process of selecting a subset with the most relevant or useful features for use in a model construction. Instead of creating new features, feature selection aims to identify the most discriminative features from the original dataset and discard the irrelevant or redundant ones.

This algorithm can be seen as the combination of a search technique for proposing new feature subsets along with an evaluation measure, which scores the different feature subsets. Feature selection techniques can be based on various criteria, such as statistical measures, like correlation analysis or mutual information scores; or machine learning algorithms. By reducing the feature space, feature selection simplifies the data representation, enhances interpretability, and potentially improves the performance of machine learning models by removing noise or irrelevant information.

According to their relationship with learning methods, feature selection techniques can be classified into three different classes: embedded, filter and wrapper methods.

Filter Methods

Filter models are based on the association between each feature and the variable to predict (class variable). These methods employ ranks based on statistical measures, and are independent of the machine learning model because operate on the features themselves, without considering the relationship between them and the specific model's internal mechanism. This means that filter methods can be applied with any algorithm or modeling technique.

One of the advantages of filter methods is their efficiency in terms of time consumption. They are computationally inexpensive and can handle large-scale datasets efficiently, making them an excellent choice for the data preprocessing step. They are faster compared to wrapper methods or embedded methods, which involve repeatedly training and evaluating the model.

Filter methods typically assign a score to each feature, reflecting its relevance or importance. These scores are calculated using statistical measures such as Pearson correlation coefficient, Chi-squared test and mutual information score.

Pearson correlation coefficient: It represent the linear relationship between each feature and the class variable. The the resulting values are in the range [-1,1] indicating the strength and direction of the correlation, so variables with high absolute correlation coefficients (close to 1 or -1) are considered more relevant.

Mutual Information: It measures the amount of information shared between two variables. In the context of feature selection, it quantifies the amount of information that a feature provides about the class variable. Features with high mutual information scores are considered more informative and are likely to have a strong relationship with the class variable.

Chi-Square Test: This method is specifically used for categorical variables. It measures the dependence between each categorical feature and the class variable using the chi-square statistic. Higher chi-square values indicate a stronger association between the feature and the class variable.

Wrapper Methods

In simple words, Wrapper methods can be seen as "brute force" searching methods, because they select the subsets of features by evaluating the performance of a machine learning model, they tries different subset of features to train an specific machine learning model, and the subset with the best generalization performance is selected. Unlike filter methods that are independent of the machine learning model, wrapper methods select features based on the specific model's performance.

Wrapper methods have several advantages, like handling complex interactions between features and selecting features that are important for a specific model, rather than for the dataset as a whole. But, they can be computationally expensive and may overfit the training data if the number of features is too large.

The subsets are usually selected from a large feature pool using search algorithms such as forward selection, backward elimination, exhaustive search, recursive feature elimination, etc. These algorithms explore different combinations of features and evaluate their impact on the model's performance.

Forward Selection: This algorithm starts with an empty set of features and iteratively adds one feature at a time. At each step, it evaluates the performance of the model using a specific criterion (e.g. accuracy) and selects the feature that improves the performance the most.

Backward Elimination: It begins with a set of all available features and removes one feature at a time in each iteration. The algorithm evaluates the model's performance after removing each feature and eliminates the one that has the least impact on the model's performance.

Exhaustive Search: It evaluates all possible feature subsets, systematically tries every combination of features and measures the model's performance with each subset. This method guarantees finding the optimal subset of features in terms of performance but is computationally expensive, especially for large feature spaces.

Embedded Methods

Embedded models combines the advantages of both Filter and Wrapper methods. In these, the features are selected during the training process of a specific machine learning model, and the feature selection ends automatically when the training process concludes.

Embedded methods may initially appear similar to wrapper methods, because they both select features based on the learning procedure of a machine learning model. However, the biggest difference between both lies in how they perform the feature selection in relation to the training process. Wrapper methods select features iteratively based on the evaluation metric, while embedded methods perform feature selection and train the machine learning algorithm in parallel.

Embedded methods typically involve adding a penalty term to the loss function during model training, which encourages the model to select only the most important features. This also means that embedded methos are more continuous and thus, don’t suffer much from high variability compared with filter or wrapped models. Some examples of algorithms commonly used as embedded methods are:

LASSO: LASSO means Least Absolute Shrinkage and Selection Operator, it's a linear regression method that adds a regularization term to the loss function. It applies a L1 penalty, resulting in some coefficients being exactly zero. If a coefficient is zero, the corresponding feature is discarded, while the features corresponding to non-zero coefficients are selected as important features. Lasso is widely used for feature selection and can handle both continuous and categorical variables.

Ridge Regression: It is a linear regression method equal to LASSO, but ridge regression applies a L2 penalty regularization instead of L1, which shrinks the coefficients towards zero without eliminating them entirely. Ridge regression helps control multicollinearity and reduces the impact of less important features, promoting more stable and reliable feature selection, but is not preferable when the data contains a huge number of features of which only a few are important.

Elastic Net: It combines the L1 (LASSO) and L2 (Ridge Regression) regularization penalties to overcome their individual limitations. It aims to get a balance between feature selection and coefficient shrinkage, which results particularly useful when dealing with highly correlated features.

Dimensionality Reduction

December 20, 2022Dimensionality Reduction No comments

Dimensionality reduction is an important task in machine learning which facilities analysis, compression and visualization of high dimensional data. The problem is that most of the times the machine learning algorithms (e.g. methos for clustering, classification, regression, etc.) don't need to use all the features from the dataset to get great results; in fact, some cases they work better with a reduced dataset. It has a long history as a method for data visualization and for extracting low dimensional features.

The primary objective of dimensionality reduction, as the name implies, is to reduce the number of features in the data while retaining as much relevant information as possible. This type of technique is useful in various scenarios:

Data and Model Interpretation: Dimensionality reduction methods can help simplify the data and model for users and stakeholders. By reducing the quantity of variables, we can focus on the key factors that influence the outcomes. This makes it easier to understand and interpret the model's behavior, as well as communicate the results to non-technical audiences.

Decrease Training Time: Dimensionality reduction can contribute to reducing the training time of machine learning models. When working with high-dimensional datasets, the training process can become computationally expensive and time-consuming. By removing irrelevant or redundant information and features, we reduce the complexity of the data and the model, which improves the speed of the training process. This is particularly beneficial when working real-time applications, where efficiency is crucial.

Variance Reduction: Dimensionality reduction techniques can reinforce the generalization of machine learning models by reducing the variance. High-dimensional datasets often contain noise and irrelevant features that can lead to overfitting, this occurs when the model memorizes the data instead of learning patterns. By focusing on the most informative and discriminative features, we improve the model's ability to generalize well to unseen data. In this way, the machine learning model can perform more robust and reliable predictions.

Data Compression: Dimensionality reduction methods can be applied to compress the data in situations where storage or computational resources are limited. By reducing the number of features, we also reduce the memory footprint required to store the data and the computational resources needed to process it. This compression is beneficial in scenarios such as big data analysis, where efficient storage and processing are critical for scalability and performance.

There are two different techniques in dimensionality reduction: feature selection and feature extraction.

Feature Selection: It focuses on selecting a subset of the original features that are most relevant and informative for the given task. Instead of creating new features, feature selection aims to identify the most discriminative features from the original dataset. These methods can be based on various criteria, such as statistical measures like correlation analysis and mutual information scores, or machine learning algorithms.

Feature extraction: On the other hand, feature extraction involves transforming the original features into a new set of features with reduced dimensionality. Rather than selecting a subset from the original features, feature extraction techniques create new features that capture the most important information from the original data.

It's important to note that not all dimensionality reduction methods works equal for all the problems, the choice of technique depends on the specific characteristics of the dataset and the objectives of the analysis. Experimentation and evaluation of different dimensionality reduction methods are necessary to find the most suitable approach for a given task.

K-Means Algorithm

December 12, 2022Clustering No comments

It is one of the simplest and most efficient clustering algorithms proposed in the literature of data clustering. It corresponds to a flat (or partitional) algorithm, which means that the dataset is classified into multiple groups in one shot based on the characteristics and similarity of the data, in Distance-Based Algorithms for Clustering you can read more about the types of clustering algorithms.

The k-Means algorithm starts by choosing k representative points as initial centroids, these centroids are not drawn from the original dataset. Each point is then assigned to the closest centroid based on a specific distance function. When the groups are formed, the centroids of each cluster are updated, replacing them with the mean values of each cluster. The whole process is described in the next image, which provides an outline of the basic k-Means algorithm.

Implementing algorithm in Python

According to each dataset, the selection process for the number of clusters K may be different. In this post the number k will be taken as arbitrary, but in a future post I'll talk about how to correctly select the number K.

Importing Libraries and Generating Data

First, we import the necessary libraries for the code and generate a synthetic dataset using the make_blobs function, this is a convenient tool for creating a dataset with a specified number of samples and clusters.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

# Choosing parameters
Num_points = 10000
k = 5
np.random.seed(0)

# Creating dataset
X, y_true = make_blobs(n_samples = Num_points, centers = k, cluster_std=0.8, random_state = 3)

The numpy and matplotlib libraries are used for numerical computations and plotting, respectively. In this case, we generate a dataset with 10000 points and 5 clusters. The X variable is an array that contains the points of the dataset. The y_true variable is another array that contains the cluster to which each point belongs

Defining Helper Functions

Here, we define two helper functions: getCluster and K_means

# Assigning one point to closest centroid
def getCluster(point,Centroids):
    Distances = []
    for j in range(len(Centroids)):
        Dif = point - Centroids[j]
        Distances.append(sum(Dif**2))       
    Cluster = Distances.index(min(Distances))     
    return Cluster
  
# Kmeans algorithm
def K_means(X, k):
  # Initializing the k centroids by random
  Centroids = np.array(X[np.random.choice(len(X), k, replace=False)])
  # Repeating K-means algorithm 30 times
  for m in range(30):
    # Assigning each point to the closest centroid
    Clusters = np.array([getCluster(point,Centroids) for point in X])
    # Updating centroids
    for i in range(len(Centroids)):
      Centroids[i] = np.mean(X[Clusters == i], axis = 0)
  return Centroids, Clusters

The getCluster function takes a data point from the dataset and the array of centroids as input. It calculates the Euclidean distance (omitting the square root because it's unnecessary in this case) between the data point and each centroid and returns the number (index) of the closest centroid.

The K_means function performs the K-means algorithm on the given dataset X with k clusters. The algorithm consists of the following steps:

Initialization: The k centroids are randomly chosen from the data points.

Iteration: The algorithm repeats a fixed number of times (30 in this case).

Assignment: Each data point is assigned to the closest centroid using the getCluster function.

Update: The centroids are updated by calculating the mean of the data points assigned to each centroid.

Applying K-means Algorithm

Here, we apply the K-means algorithm to the generated dataset. We can use the the previously programmed function K_means

# Applying K-means algorithm
Centroids, Labels = K_means(X, k)

Another option is to import the KMeans model from the library sklearn

from sklearn.cluster import KMeans
# Applying K-means algorithm
kmeans = KMeans(n_clusters=k, random_state=0).fit(X)
Labels = kmeans.labels_
Centroids = kmeans.cluster_centers_

In both cases, the resulting variables are the centroids and labels (cluster assignments) arrays.

Plotting the Results

In order to have a visual interpretation of the algorithm, we plot the original dataset and the dataset after applying the K-means algorithm

# Defining figure and axes
f, axs = plt.subplots(1,2, figsize=(14,5), facecolor='#F5F5F5')

# Plotting original dataset
axs[0].set_facecolor('#F5F5F5')
axs[0].scatter(X[:, 0], X[:, 1], s=50) 
axs[0].set_title("Original Points")

# Plotting clustered dataset
axs[1].set_facecolor('#F5F5F5')
scatter = axs[1].scatter(X[:, 0], X[:, 1], c=Labels, s=50, cmap='spring' ) 
axs[1].legend(handles=scatter.legend_elements()[0], labels=range(k), title = 'Clusters')
axs[1].scatter(Centroids[:, 0], Centroids[:, 1], c='black', s=200, alpha=0.5);    
axs[1].set_title("Clustered Points")

We create a figure with two subplots using plt.subplots. The first subplot (axs[0]) shows the original dataset, where each point represents a data sample. The second subplot (axs[1]) displays the dataset after applying the K-means algorithm, this time we color the points based on their assigned cluster labels and the centroids are marked with black circles. These graphs are shown below.

Below is a gif that contains the entire clustering process, in other words, it contains the changes of the centroids for each iteration.

As can be seen, the algorithm performed excellent results and found the best centroids to cluster the dataset into 5 groups at iteration 18.

In conclusion, the K-means algorithm is a widely used and efficient clustering technique in data science. It offers simplicity, scalability, and fast convergence, making it suitable for various applications. Despite the challenges of selecting the optimal number of clusters and initializing centroids, K-means remains a valuable tool for uncovering patterns and gaining insights from datasets. Its ease of implementation and ability to handle large datasets make it a good choice for data scientists who need efficient clustering solutions.

Distance-Based Algorithms for Clustering

December 02, 2022Clustering No comments

The differences of clustering algorithms, as explained on the Data Clustering post, lies in how "similarity" is defined. In probabilistic models, the main idea is to model the data from a generative process, where a specific distribution is assumed (e.g. Gaussian), then parameters of the model are estimated using Expectation Maximum algorithm (EM). One time the parameters are found, the generative probabilities (fit probabilities) of the underlying data points are calculated.

Many times, different kind of generative models can be reduced to distance-based algorithms, because the mixture components often use a distance function within the probability distribution. For example, the Gaussian distribution represents probabilities in terms of the Euclidean distance from the mean of the mixture. Distance-based methods are more desirable than probabilistic models, because of the simplicity and easy implementation, these can be generally divided in two types, flat and hierarchical algorithms.

Flat methods: The data is divided into different clusters in one shot using partitioning representatives, optimizing specific objective functions (generally distance functions) by iteratively improving the quality of partitions. In each iteration, the data points are assigned to their closest partitioning representatives, an then the representative is adjusted according the data points assigned to the cluster, in this way the quality of partitions improves with each iteration.

k-Means: These algorithm employs the Euclidean Distance to calculate the distance between the data points and the partitioning representative of each cluster, which correspond to the mean value of each cluster. This method is considered one of the simplest and most classical models for data clustering and is widely used because of its simplicity.

k-Medians: The operation of this algorithm is equivalent to k-Means but the partitioning representative is the median along each dimension. This provides several advantages, such as stability in presence of outliers and certain types of noise, because the median is less sensitive than the mean to extreme values in the data.

k-Medoids: In this algorithm, contrary to k-Means and k-Medians, the partitioning representative is sampled from the original data. Such techniques are useful in cases where de data points are arbitrary objects, for example, when a set of discrete sequence objects need to be clustered, since it may not make sense to talk about mean or median.

Hierarchical methods: The clusters are represented hierarchically through a dendrogram (binary tree-based data structure) at varying levels of granularity. Hierarchical algorithms were developed to overcome some disadvantages of flat models, because these generally require a predefined parameter K (number of clusters) to obtain a solution and they are often nondeterministic. Hierarchical methods can be categorized into agglomerative and divisive algorithms.

Agglomerative: These methos start by taking singleton clusters (clusters with only one data object) at the bottom level, then continue by merging two clusters at time to build a bottom-up hierarchy. There are a variety of options about how these clusters are merged, which provide different levels of quality and efficiency, for example, single-linkage, all-pairs linkage, sampled-linkage and centroid-linkage clustering.

Divisive: These works as opposite than agglomerative, start with all the data objects in a huge cluster and split it continuously in two groups producing a top-down hierarchy. Divisive algorithms generally are more efficient compared to agglomerative clustering, specially where there is no need to generate a complete hierarchy all the way down to individual leaves, in other words, it's not necessary to have a perfectly balanced tree in terms of node depths.

Data Clustering

November 24, 2022Clustering No comments

The problem of Data Clustering has been extensively examined in the fields of data mining and machine learning, because of its applications to summarization, segmentation and target marketing. When there isn't labeled information, clustering is considered one of the best models to interpret data. In a few words the problem of clustering may be described as:

Given an unlabeled dataset, the clustering algorithms part it into a set of groups which are as similar as possible

There are a huge number of ways to define "similarity", for example, it can be defined using euclidean distance between points, where the closest points are similar to each other; or based on a probabilistic generative mechanism, where similarity is related to the probability of the points coming from the same distribution. The most common applications of clustering models are:

Collaborative Filtering: It can be used to group similar users or items based on their preferences by ratings. Once these clusters are identified, collaborative filtering techniques can be used within each cluster to make personalized recommendations. For example, Netflix uses a recommendation engine, which clusters users with similar viewing habits and provides personalized show suggestions.

Customer Segmentation: Similarity to Collaborative Filtering, it creates groups of similar customers in the data, the difference is that instead of using rating information, arbitrary attributes about the objects can be used. For example, in an e-commerce campaign, the company divides the customer base into distinct segments based on various attributes such as age, gender, purchasing behavior, and preferences.

Data Summarization: Some clustering models are related with dimensionality reduction methods, which can be considered a form of data summarization and may be employed to group similar documents together, allowing for topic modeling and the creation of compact data representations.

Multimedia Data Analysis: It can be used to identify patterns and group similar items based on visual or audio features. It is suited for use in data compression and retrieval by reducing redundancy, and simplify complex data. In image processing, clustering algorithms like K-means can be used to segment regions or identify similar images. In audio analysis, clustering can help to identify music genres.

Social Network Analysis: It determines the communities in a specific social network, which provides an important understanding of consumers and active people. It also has applications in social network summarization, which is essential in personalized recommendations, anomaly detection, nodes identification, etc. For example, in a social media platform, clustering algorithms can be employed to identify communities with similar interests, connections, or interactions.

Before implementing clustering methods, it's necessary to preprocess the datasets. In particular, the feature selection phase is an important step which is needed in order to improve the results and quality of the underlying clustering. Not all features has the same information to find the clusters, because some may be more noisy than others, for this reason it's important to include a preprocessing phase in which noisy and irrelevant features are removed. Dimensionality reduction algorithms can be employed for this tasks.

K-Means

Principal Components Analysis

Principal Components Analysis

Principal Components Analysis

Principal Components Analysis

Principal Components Analysis

Principal Components Analysis

Principal Components Analysis

Tuesday, December 27, 2022

Feature Selection

Filter Methods

Wrapper Methods

Embedded Methods

Tuesday, December 20, 2022

Dimensionality Reduction

Monday, December 12, 2022

K-Means Algorithm

Implementing algorithm in Python

Importing Libraries and Generating Data

Defining Helper Functions

Applying K-means Algorithm

Plotting the Results

Friday, December 2, 2022

Distance-Based Algorithms for Clustering

Thursday, November 24, 2022

Data Clustering

Blog Archive

About Me

Recent Post

Drawing the Line: Understanding Decision Boundaries

Pages