publications | Raúl Taranco

2025

UNDER REVIEW

Enabling Motion-Based Sampling for Energy-Efficient Machine Vision

Raúl Taranco, and Antonio González

In Under Review, 2025

IRIS: Unleashing ISP-Software Cooperation to Optimize the Machine Vision Pipeline

Raúl Taranco, José-María Arnau, and Antonio González

In Proceedings of the 31st International Symposium on High-Performance Computer Architecture (HPCA), Mar 2025

@inproceedings{tarancoIRIS2025,
  address = {Las Vegas, NV, USA},
  title = {IRIS: Unleashing ISP-Software Cooperation to Optimize the Machine Vision Pipeline},
  copyright = {All rights reserved},
  isbn = {},
  shorttitle = {IRIS},
  url = {},
  doi = {},
  booktitle = {Proceedings of the 31st International Symposium on High-Performance Computer Architecture (HPCA)},
  series = {HPCA '25},
  publisher = {Association for Computing Machinery},
  author = {Taranco, Raúl and Arnau, José-María and González, Antonio},
  year = {2025},
  month = mar,
  pages = {Accepted for publication}
}

2024

PhD
Architectural Strategies to Enhance the Latency and Energy Efficiency of Mobile Continuous Visual Localization Systems

Raúl Taranco

Aug 2024

Abs DOI Bib PDF

The emergence of new applications such as autonomous machines (e.g., robots or self-driving cars) and XR (Extended Reality) promises to revolutionize how society interacts with technology in the rapidly advancing digital era. These technologies, deployed on edge devices, often rely on mobile or embedded SoCs (Systems-on-a-Chip) operating CV (Continuous Vision) pipelines that periodically capture and analyze environmental light. A typical CV SoC comprises two main components: a frontend for image capture and a backend for processing vision algorithms. The frontend usually includes an off-chip camera sensor and an ISP (Image Signal Processor), which processes the pixel stream, converting raw sensor data into high-quality images. The backend—comprising components such as the CPU, GPU, or specialized accelerators—analyzes the images stored in the main memory’s framebuffer to extract perception insights and enable advanced decision-making. Existing research identifies visual localization, object detection, and tracking as the primary bottlenecks in these emerging applications. Those algorithms face two principal challenges when deployed in mobile CV systems: latency and energy consumption. For example, an XR headset uses visual localization to track head motion for accurate frame rendering, where latency can cause discomfort. In self-driving cars, localization ensures centimeter-level precision, with delays compromising safety, especially at higher speeds. Additionally, high energy consumption limits the operation of battery-powered mobile systems. This thesis embarks on a strategic journey to elevate mobile CV systems’ performance and energy efficiency through several key contributions: We begin by analyzing a state-of-the-art visual localization engine on a CPU. The localization engine processes camera images, extracting and tracking features to estimate camera pose. Our evaluations reveal feature extraction as the primary bottleneck, accounting for 60% to 90% of total localization latency. Next, we investigate highly specialized hardware accelerator designs for image processing. The first contribution is LOCATOR (Low-power ORB Accelerator for Autonomous Cars), a hardware accelerator designed for ORB (Oriented FAST and Rotated BRIEF) feature extraction. LOCATOR processes image tiles with two parallel pipelines for feature detection and description, employing techniques like optimal static bank access patterns, caching mechanisms, and selective port replication. These optimizations yield a 16.8× speedup for ORB feature extraction, 1.9× end-to-end speedup, and 2.1× energy reduction per frame compared to a baseline mobile CPU. Realizing the need for more programmable and versatile solutions, our second contribution, SLIDEX (Sliding Window Extension for Image Processing), introduces a domain-specific vector ISA extension for CPUs. SLIDEX exploits the sliding window processing model, interpreting vector registers as overlapping windows to maximize data-level parallelism. SLIDEX reduces data access and movement, enhancing tasks like 2D convolutions and stencil operations, resulting in a 1.2× speedup and up to 19% energy reduction. The third contribution, δLTA (δon’t Look Twice, It’s Alright), decouples camera frame sampling from backend processing. δLTA allows the frontend to identify and skip redundant image regions, focusing processing only on significant changes. δLTA reduces unnecessary memory accesses and redundant computations, lowering localization tail and average latency by 7.2% and 15.2%, respectively, and energy consumption by 17%. Finally, IRIS (Image Region ISP-Software Prioritization) repurposes computations performed by the frontend ISP, segmenting and prioritizing image regions based on detail and motion. IRIS allows the backend to process relevant regions first, reducing latency and energy consumption by up to 9% in tail latency, 20% in average latency, and 16% in energy savings without additional overhead.
@thesis{schall2024, author = {Taranco, Raúl}, title = {Architectural Strategies to Enhance the Latency and Energy Efficiency of Mobile Continuous Visual Localization Systems}, school = {Universitat Politècnica de Catalunya (UPC)}, year = {2024}, month = aug, type = {PhD Thesis}, doi = {10.5821/dissertation-2117-415317}, handle = {http://hdl.handle.net/2117/415317} }
ICS
SLIDEX: A Novel Architecture for Sliding Window Processing

Raúl Taranco, José-María Arnau, and Antonio González

In Proceedings of the 38th ACM International Conference on Supercomputing (ICS), Jun 2024

Abs DOI Bib PDF

Efficient image processing is increasingly crucial in constrained embedded and real-time platforms, especially in emerging applications such as Autonomous Driving (AD) or Augmented/Virtual Reality (AR/VR). A commonality among most image processing operations is their reliance on primitives like convolutions and stencil operations, which typically utilize a sliding window dataflow. Many existing implementations are domain-specific, lacking generality, or are programmable at the cost of sacrificing performance and energy efficiency. Among the latter, the CPU-based platforms that typically rely on Vector Processing Units (VPUs) often miss critical optimization opportunities, particularly those arising from the overlapping nature of the mentioned windowed image processing operations. In response, we propose SLIDEX, a novel high-performance and energy-efficient vector ISA extension to exploit Sliding Window Processing (SWP) in conventional CPUs. SWP extends the conventional vector SIMD execution model, treating vector registers like variable-sized groups of overlapping pixel windows. SLIDEX-enabled VPU processes multiple windows simultaneously, maximizing the Data Level Parallelism (DLP) achievable per instruction while maintaining the same vector length. Furthermore, it significantly reduces the need for data access, movement, and alignment, decreasing memory and register file accesses compared to traditional SIMD designs. To support SLIDEX, we introduce a cost-effective microarchitecture designed for easy integration into existing VPUs with minimal modifications. We demonstrate the efficacy of SLIDEX by testing it on a state-of-the-art visual localization task critical in AD and AR/VR. The results are compelling: SLIDEX achieves significant speedups in vital tasks such as 2D convolutions for image filtering and stencil operations for feature extraction, leading to an overall speedup of ∼ 1.2 × and up to 19% energy reduction compared to traditional vector extensions.
@inproceedings{tarancoSLIDEXNovelArchitecture2024, address = {New York, NY, USA}, title = {{SLIDEX}: {A} {Novel} {Architecture} for {Sliding} {Window} {Processing}}, copyright = {All rights reserved}, isbn = {9798400706103}, shorttitle = {{SLIDEX}}, url = {https://dl.acm.org/doi/10.1145/3650200.3656613}, doi = {10.1145/3650200.3656613}, booktitle = {Proceedings of the 38th {ACM} {International} {Conference} on {Supercomputing} (ICS)}, series = {ICS '24}, publisher = {Association for Computing Machinery}, author = {Taranco, Raúl and Arnau, José-María and González, Antonio}, month = jun, year = {2024}, keywords = {convolution, image processing, ISA, SIMD, sliding-window, stencil}, pages = {312--323} }

2023

MICRO
δLTA: Decoupling Camera Sampling from Processing to Avoid Redundant Computations in the Vision Pipeline

Raúl Taranco, José-María Arnau, and Antonio González

In Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec 2023

Abs DOI Bib PDF

Continuous Vision (CV) systems are essential for emerging applications like Autonomous Driving (AD) and Augmented/Virtual Reality (AR/VR). A standard CV System-on-a-Chip (SoC) pipeline includes a frontend for image capture and a backend for executing vision algorithms. The frontend typically captures successive similar images with gradual positional and orientational variations. As a result, many regions between consecutive frames yield nearly identical results when processed in the backend. Despite this, current systems process every image region at the camera’s sampling rate, overlooking the fact that the actual rate of change in these regions could be significantly lower. In this work, we introduce δLTA (\deltaont’t Look Twice, it’s Alright), a novel frontend that decouples camera frame sampling from backend processing by extending the camera with the ability to discard redundant image regions before they enter subsequent CV pipeline stages. δLTA informs the backend about the image regions that have notably changed, allowing it to focus solely on processing these distinctive areas and reusing previous results to approximate the outcome for similar ones. As a result, the backend processes each image region using different processing rates based on its temporal variation. δLTA features a new Image Signal Processing (ISP) design providing similarity filtering functionality, seamlessly integrated with other ISP stages to incur zero-latency overhead in the worst-case scenario. It also offers an interface for frontend-backend collaboration to fine-tune similarity filtering based on the application requirements. To illustrate the benefits of this novel approach, we apply it to a state-of-the-art CV localization application, typically employed in AD and AR/VR. We show that δLTA removes a significant fraction of unneeded frontend and backend memory accesses and redundant backend computations, which reduces the application latency by 15.22% and its energy consumption by 17%.
@inproceedings{tarancoDLTADecouplingCamera2023, address = {Toronto, Canada}, title = {$\delta${LTA}: {Decoupling} {Camera} {Sampling} from {Processing} to {Avoid} {Redundant} {Computations} in the {Vision} {Pipeline}}, copyright = {All rights reserved}, isbn = {9798400703294}, shorttitle = {$\delta${LTA}}, url = {https://dl.acm.org/doi/10.1145/3613424.3614261}, doi = {10.1145/3613424.3614261}, booktitle = {Proceedings of the 56th {Annual} {IEEE}/{ACM} {International} {Symposium} on {Microarchitecture} (MICRO)}, series = {MICRO '23}, publisher = {Association for Computing Machinery}, author = {Taranco, Raúl and Arnau, José-María and González, Antonio}, month = dec, year = {2023}, keywords = {Computation Reuse, Image Signal Processor, Image Similarity}, pages = {1029--1043} }
PACT
SLIDEX: Sliding Window Extension for Image Processing

Raúl Taranco, José-María Arnau, and Antonio González

In 2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT), Dec 2023

Abs DOI Bib PDF

With the rising need for efficient image processing in emerging applications such as Autonomous Driving (AD) and Augmented/Virtual Reality (AR/VR), many existing solutions do not meet their performance and energy efficiency requirements or are domain-specific and lack generality. In this work, we introduce SLIDEX, a novel ISA extension that leverages Sliding Window Processing (SWP) to bridge this gap in the image processing domain. SWP is a novel SIMD model that exposes to the programmer and natively manipulates vector registers as groups of overlapped windows of pixels to exploit the sliding-window dataflow found in convolutions and other stencil operations. SWP amplifies the available Data Level Parallelism (DLP) and reduces memory and register file accesses. We evaluated SLIDEX benefits in the critical image processing task of a state-of-the-art visual localization system widely used in AD and AR/VR. SLIDEX obtains an }sim 1.2}times overall speedup and 22% energy reduction.
@inproceedings{tarancoSLIDEXSlidingWindow2023, address = {Vienna, Austria}, title = {{SLIDEX}: {Sliding} {Window} {Extension} for {Image} {Processing}}, copyright = {All rights reserved}, shorttitle = {{SLIDEX}}, url = {https://ieeexplore.ieee.org/document/10364589?signout=success}, doi = {10.1109/PACT58117.2023.00039}, booktitle = {2023 32nd {International} {Conference} on {Parallel} {Architectures} and {Compilation} {Techniques} (PACT)}, series = {PACT '23}, author = {Taranco, Raúl and Arnau, José-María and González, Antonio}, year = {2023}, keywords = {Visualization, Location awareness, convolution, Image processing, Energy efficiency, Parallel processing, image processing, Registers, Bridges, sliding-window}, pages = {332--334} }
JPDC
LOCATOR: Low-power ORB accelerator for autonomous cars

Raúl Taranco, José-María Arnau, and Antonio González

Journal of Parallel and Distributed Computing (JPDC), Dec 2023

Abs DOI Bib PDF

Simultaneous Localization And Mapping (SLAM) is crucial for autonomous navigation. ORB-SLAM is a state-of-the-art Visual SLAM system based on cameras used for self-driving cars. In this paper, we propose a high-performance, energy-efficient, and functionally accurate hardware accelerator for ORB-SLAM, focusing on its most time-consuming stage: Oriented FAST and Rotated BRIEF (ORB) feature extraction. The Rotated BRIEF (rBRIEF) descriptor generation is the main bottleneck in ORB computation, as it exhibits highly irregular access patterns to local on-chip memories causing a high-performance penalty due to bank conflicts. We introduce a technique to find an optimal static pattern to perform parallel accesses to banks based on a genetic algorithm. Furthermore, we propose the combination of an rBRIEF pixel duplication cache, selective ports replication, and pipelining to reduce latency without compromising cost. The accelerator achieves a reduction in energy consumption of 14597× and 9609×, with respect to high-end CPU and GPU platforms, respectively.
@article{TARANCO202332, title = {LOCATOR: Low-power ORB accelerator for autonomous cars}, journal = {Journal of Parallel and Distributed Computing (JPDC)}, volume = {174}, pages = {32-45}, year = {2023}, issn = {0743-7315}, doi = {https://doi.org/10.1016/j.jpdc.2022.12.005}, url = {https://www.sciencedirect.com/science/article/pii/S0743731522002507}, author = {Taranco, Raúl and Arnau, José-María and González, Antonio}, keywords = {ORB, ORB-SLAM, Hardware accelerator} }

2022

CAV
Sliding Window Support for Image Processing in Autonomous Vehicles

Raúl Taranco, José-María Arnau, and Antonio González

In Workshop on Compute Platforms for Autonomous Vehicles (CAV), held in conjunction with 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), Oct 2022

Abs Bib PDF

Camera-based autonomous driving extensively ma-nipulates images for object detection, object tracking, or camera-based localization tasks. Therefore, efficient and fast image processing is crucial in those systems. Unfortunately, current solutions either do not meet AD’s constraints for real-time performance and energy efficiency or are domain-specific and, thus, not general [14]. In this work, we introduce Sliding Window Processing (SWP), a SIMD execution model that natively operates on sliding windows of image pixels. We illustrate the benefits of SWP through a novel ISA extension called SLIDEX that achieves high performance and energy efficiency while maintaining pro-grammability. We demonstrate the benefits of SLIDEX for the image processing tasks of ORB-SLAM [17] [18], a state-of-the-art camera-based localization system. SLIDEX achieves an average end-to-end speedup of 1.65× and 1.2× compared to equivalent scalar and vector baselines respectively. Compared with the vector implementation, our solution reduces the end-to-end energy consumption a 22% on average.
@inproceedings{taranco2022slidex, title = {{Sliding Window Support for Image Processing in Autonomous Vehicles}}, author = {Taranco, Raúl and Arnau, José-María and González, Antonio}, year = {2022}, month = oct, booktitle = {Workshop on Compute Platforms for Autonomous Vehicles (CAV), held in conjunction with 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)}, url = {https://sites.google.com/g.harvard.edu/cav-micro22/}, urldate = {2022-09-04} }

2021

SBAC-PAD
A Low-Power Hardware Accelerator for ORB Feature Extraction in Self-Driving Cars

Raúl Taranco , José-Maria Arnau, and Antonio González

In 2021 IEEE 33rd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), Oct 2021

Abs DOI Bib PDF

Simultaneous Localization And Mapping (SLAM) is a key component for autonomous navigation. SLAM consists of building and creating a map of an unknown environment while keeping track of the exploring agent’s location within it. An effective implementation of SLAM presents important challenges due to real-time inherent constraints and energy consumption. ORB-SLAM is a state-of-the-art Visual SLAM system based on cameras that can be used for self-driving cars. In this paper, we propose a high-performance, energy-efficient and functionally accurate hardware accelerator for ORB-SLAM, focusing on its most time-consuming stage: Oriented FAST and Rotated BRIEF (ORB) feature extraction. We identify the BRIEF descriptor generation as the main bottleneck, as it exhibits highly irregular access patterns to local on-chip memories, causing a high performance penalty due to bank conflicts. We propose a genetic algorithm to generate an optimal memory access pattern offline, which greatly simplifies the hardware while minimizing bank conflicts in the computation of the BRIEF descriptor. Compared with a CPU system, the accelerator achieves 8x speedup and 1957x reduction in power dissipation.
@inproceedings{tarancoLowPowerHardwareAccelerator2021a, title = {A {Low}-{Power} {Hardware} {Accelerator} for {ORB} {Feature} {Extraction} in {Self}-{Driving} {Cars}}, url = {https://ieeexplore.ieee.org/document/9651662}, doi = {10.1109/SBAC-PAD53543.2021.00013}, urldate = {2025-03-11}, booktitle = {2021 {IEEE} 33rd {International} {Symposium} on {Computer} {Architecture} and {High} {Performance} {Computing} ({SBAC}-{PAD})}, series = {SBAC-PAD '21}, author = {Taranco, Raúl and Arnau, José-Maria and González, Antonio}, month = oct, year = {2021}, keywords = {Autonomous automobiles, Feature extraction, hardware accelerator, ORB, ORB-SLAM, Power dissipation, Real-time systems, Simultaneous localization and mapping, Software, Visualization}, pages = {11--21} }