Facial Expression Recognition for Remote Psychological Consultation

This project focused on developing a facial expression recognition (FER) system tailored for a psychology consultancy center offering remote therapy services. The goal was to help psychologists monitor facial expressions of patients in real time during online sessions, especially for remote, elderly, or disabled individuals who may not attend in-person consultations. The system supports clinicians in creating richer electronic health documents (EHDs) and delivering more personalized and trackable mental health care.

Overview

 My Role

I led the end-to-end development of the system as a Machine Learning Engineer. My responsibilities included:

  • Evaluating and selecting facial landmark detection strategies
  • Designing a low-latency FER model to run in real time
  • Collecting and preprocessing facial data for training and validation
  • Training and evaluating models using TensorFlow, OpenCV, MediaPipe, and Scikit-learn
  • Developing a REST API for remote model inference in the second version
  • Integrating the model into a real-time communication system using WebRTC and WebSocket technology in the final version
  • Collaborating with cross-functional teams and addressing both technical and ethical considerations

Collaborators

This project was developed in close collaboration with:
Stackeholder – provided the overall vision and defined priorities based on client needs
Project Manager (PM) – coordinated timelines, deliverables, and feedback loops
Business Analyst (BA) – translated clinical requirements into technical specifications
Front-End Engineer – integrated the ML model into the UI and handled client-server interaction
Clinical Psychologist – advised on facial expression relevance and provided feedback on model outputs

Duration Time

Total Duration: 11 months across 3 development phases:

2022 (6 months): Initial desktop-based version. Focused on research, model design, local testing, and choosing optimal facial landmarks to reduce latency and jitter.
2023 (2 months): Improved REST API integration and model performance based on stakeholder feedback.
2024 (3 months): Final WebRTC-based version for real-time deployment, with infrastructure to support future features such as voice-based emotion cues.

Client

A private psychology consultancy center committed to enhancing accessibility and personalization in mental health services through digital tools.

Completed Date

Completed: Final version delivered in 2024.

Briefing

In response to the growing need for remote mental health services, this project developed a facial expression recognition (FER) system to help clinicians interpret patients’ emotional cues during virtual sessions. Recognizing the importance of nonverbal communication in therapy, the system enables real-time emotion analysis while maintaining strict privacy through user consent. It supports more personalized care and enhances electronic mental health records. Developed over three phases from 2022 to 2024, the project addressed challenges in model accuracy and latency, ultimately delivering a WebRTC-powered platform optimized for ethical, responsive, and accessible psychological support.

Outcomes

I developed a functional, real-time facial expression recognition (FER) system integrated into a remote therapy platform, enabling clinicians to observe and interpret patients’ emotional cues during virtual sessions. To ensure smooth performance, I optimized the model by selectively choosing facial landmarks, which significantly reduced latency and jitter. The system also supports secure, consent-based data sharing, giving both patients and therapists confidence in its ethical use. By providing a data-driven emotional insight layer, the platform empowers psychologists to deliver more informed and personalized care. Additionally, it contributes to the center’s vision of comprehensive digital mental health documentation, with infrastructure ready for future integration of voice-based emotion analysis and multi-modal assessment tools.

Problem Statements – The 4 Ws

As a Machine Learning Engineer, I applied the “4 Ws – Problem Statements” framework to clearly define and communicate the core challenge driving this project. This approach helped structure the technical and ethical scope of our work, ensuring the development of a solution that was both clinically relevant and technically feasible within a real-time, privacy-conscious context.

Who is affected?

Psychologists and their patients, especially those in rural areas, with physical disabilities, or facing mobility issues that prevent in-person visits.

What is the problem?

Remote therapy lacks the full spectrum of emotional cues, especially subtle facial expressions that are critical in assessing mental state and engagement.

Where does the problem occur?

During video-based remote psychological consultations where only basic video calls are used, with no automated emotional or facial expression tracking.

Why is it important?

Emotional expression is a key indicator in therapy. Without tools to assess it remotely, psychologists face reduced diagnostic accuracy and therapeutic effectiveness, limiting the quality of care for remote patients.

Reflection

“Even slight asymmetries in a patient’s expression, like a tensed jaw or a furrowed brow, can tell us volumes. Having a tool that doesn’t overlook these nuances is a game-changer.”

The lead psychologist shared

Working on this project was both technically rewarding and personally impactful. Although I had prior experience in machine learning and computer vision, applying these skills in the context of mental health introduced new challenges and responsibilities. I had to consider not just accuracy, but also usability, patient dignity, and real-world conditions like limited bandwidth and inconsistent video quality. These factors pushed me to refine the model with a focus on subtle expression cues and emotional nuance, rather than broad emotion categories. Transitioning from a desktop prototype to a real-time WebRTC-enabled platform also sharpened my systems thinking. I had to optimize for low latency, data privacy, and seamless integration. Ultimately, the project’s ability to support underserved patients is what makes it most meaningful to me.

Process

From a machine learning engineering perspective, this project was not just about training a model, it was about building an end-to-end, real-time system that could reliably interpret facial expressions in a sensitive healthcare context. I structured my approach around the Double Diamond model, adapting each phase to the realities of ML product development: exploratory research, technical scoping, model iteration, and deployment.

Steps

Please choose the following steps to discover the steps of the project.

In this phase, I focused on understanding user needs and assessing technical feasibility. Together with the psychologist, I identified which facial features were most clinically relevant. I reviewed recent research in facial expression recognition and explored available tools, ultimately selecting MediaPipe for its speed and reliability in real-time landmark tracking. My early tests evaluated its robustness under different lighting conditions and camera angles to ensure suitability for clinical use.

Key output:
A prioritized list of target expressions, acceptable inference latency thresholds, and a narrowed list of viable detection strategies.

The insights from discovery were converted into concrete ML system specifications. I defined model constraints such as:

  • Frame-rate compatibility for real-time video (≥ 15 FPS)
  • Latency budget (< 300ms for inference and streaming combined)
  • Minimum detection accuracy on emotion classes

Working with the business analyst and psychologist, I scoped model output formats (e.g., emotion class + confidence + optional landmark overlays), and established API design patterns for future integration.

Key output:
A technically viable FER architecture and a scalable inference pipeline design aligned with stakeholder needs.

This was the most iterative phase, consisting of three major engineering cycles:

  • Desktop prototype — I implemented and tested various landmark detectors, optimized the feature set to reduce dimensionality and noise, and trained the initial model using labeled expression data. All testing was performed locally to control variables and measure jitter under different loads.
  • Remote-ready version — I built a REST API around the model using Flask + Nginx, containerized the app with Docker, and ran evaluations on remote servers to simulate real-world environments.
  • Real-time system — I transitioned to WebRTC + WebSocket for live video communication, integrated coturn for NAT traversal, and optimized the backend for concurrency and stable real-time performance.

Throughout development, I used OpenCV and custom logging tools to visualize and validate frame-level predictions, and I benchmarked performance under constrained bandwidth conditions.

Key output:
A robust and modular FER system, with pluggable components for emotion recognition, streaming, and future expansion.

The final product was deployed in a staging environment for user testing and clinician feedback. I documented the entire system, from model training scripts to API contracts, and delivered the full pipeline alongside suggestions for future enhancements such as voice-based sentiment analysis, attention tracking, and emotion timeline visualizations.

The system was built to be lightweight, ethical-by-design, and easy to integrate into the consultancy’s digital health infrastructure.

Key output:
A production-ready FER platform with real-time capabilities, designed to scale with future multimodal upgrades.

Technical Architecture and Workflow

The FER system was architected as a modular pipeline optimized for low latency, real-time inference, and ease of deployment. Below is an overview of its core components and how they interact:

Client Side

Video Capture: Runs in the browser or native client using WebRTC.
Preprocessing (optional): Lightweight filtering or resizing handled on-device to reduce upstream bandwidth.
Consent Mechanism: Users must explicitly grant permission before data is processed.

Network Layer

REST API (1st version): Previously used for stateless image uploads and testing; later phased out in favor of WebRTC.
WebRTC + coturn (2nd version): Enables peer-to-peer, low-latency video streaming; coturn provides NAT traversal and fallback relay.
WebSocket API: Handles real-time control messages and session metadata.

Backend (Server Side)

Frame Receiver Module: Extracts frames from video streams and queues them for inference.
Facial Landmark Detection: Implemented using MediaPipe for its efficiency and cross-platform support.
Feature Selection Engine: Filters and transforms landmark data, reducing dimensions and jitter.
ML Inference Engine: TensorFlow model classifies emotion states from extracted features.
Analytics Layer: Stores timestamped emotion predictions and optional landmarks for session replay and documentation.
Monitoring & Logging: Used Nginx logs, custom dashboards (Plotly), and Matplotlib for performance visualization.

Deployment

Containerized with Docker
Reverse proxy with Nginx
Prepared for integration with client’s electronic health documentation system via secured REST endpoints

Challenges & Solutions

A project of this scope came with several technical and ethical challenges. Here’s how I addressed them:

Balancing Latency vs. Accuracy

Facial expression recognition models are often heavy and computationally expensive, but the goal here was real-time inference.

Solution:

  • Selected lightweight detectors (MediaPipe) for landmark extraction
  • Reduced the feature space by selecting only the most expressive facial landmarks
  • Batched inference where possible, and tested on multiple hardware setups to fine-tune the latency threshold

DEMO

“FER Portfolio Demo – Hosam Zolfonoon Portfolio”

This video presents a demo of my Facial Expression Recognition (FER) system for remote psychological consultations. Using MediaPipe, WebRTC, and WebSocket, it delivers real-time facial landmark tracking and emotion inference with explicit user consent. Designed to help clinicians capture subtle cues in virtual therapy, the system supports patients with mobility or accessibility challenges. For access or collaboration inquiries, contact:  contact@hosamzolfonoon.pt.

Conclusion

This project is a reflection of what machine learning can achieve when it’s applied with both technical precision and human empathy. By developing a facial expression recognition system tailored for remote psychological consultation, I was able to blend model performance with real-world usability, ensuring that the solution worked not just in the lab, but in the lives of people who need care the most.

From low-latency inference design to WebRTC-based communication, and from subtle emotional detection to ethical deployment, this system stands as an example of full-stack ML engineering in the service of social good. Each phase, from discovery to delivery, pushed me to think not only as a technologist, but as a contributor to a larger vision: making mental health support more accessible, responsive, and data-informed.

Let’s Connect

Are you working on a project that bridges machine learning, real-time systems, or digital health?
Whether you’re building something innovative, looking for a technical collaborator, or just want to exchange ideas, I’d love to hear from you.
Feel free to reach out for a chat about projects, collaborations, or research.
Email me at: contact@hosamzolfonoon.pt
Let’s build technology that truly makes a difference.