Algorithms, Design Methods, and Many-core Execution Platform for Low-Power Massive Data-Rate Video and Image Processing

Artemis 2013 GA 621439

Project Description

Societal Impact


Technical Innovation

Advanced image and video processing systems are becoming a crucial and resource consuming part of embedded applications in many sectors. ALMARVI aims to facilitate the transition from a vertically structured market to a horizontally structured market. In particular, it focuses on reducing overall system design cost and time-to-market and enabling low cost solutions for high volume markets in different industrial domains and creating new market opportunities, and supporting SMEs.

The demonstrators developed under this project for the healthcare, security/surveillance/monitoring, and mobile use cases will directly lead to marketable applications and products in their relevant domains. Integrated releases of the image/video processing algorithm libraries, reference design tools and platforms, and system software stack solutions will be made available along with their evaluation for the demonstrated use cases. Cross-domain applicability will reduce fragmentation, thus increasing the market share of European supplier industry.

Societal Impact

The project provides the core of solutions for the big societal challenges like affordable healthcare and wellbeing, green and safe transportation, and reduced consumption of power.

1. Enable cross-domain re-use and interoperability for different product categories and application domains, thus promoting cross-fertilization and reuse of technology results.

2. Facilitate predictable system and product properties, and robust solutions.

3. Develop joint hardware-software techniques for resource and power management, yet providing massive data-rate processing and supporting interoperability over cross-domain platforms.


1. Reduce the cost of the system design 20% - 30% through modularity, flexible interfacing, adaptive architecture, execution platform with well-developed tool chains, adaptability and run-time configurability.

2. Reduce in development cycles  25% - up to 35% through seamless scalability and integration of hardware and software components and cross-domain component reuse, cross-domain system software stack, design tools, understanding of relevant  system layers

3. Manage a complexity increase with 30% -60% effort reduction through novel algorithms, architecture, design tools, execution platforms, and system software stack with run-time adaptive resource and power management techniques

4. Reduce effort and time for re-validation and re-certification  15% - 20% through incremental design, develop, test, integrate, validate cycles.

5. Cross-sectoral re-usability of Embedded Systems 20% - 50% through system architecture accounting for the common requirements of different sectors and application domains.

The key is to leverage the properties of image/video content while jointly adapting algorithms and hardware in order to achieve a much higher potential for power savings and to enable massive data rate processing.  At the Application Layer, the goal is to adapt algorithms towards the architectures. At the System Software Stack Layer, the adaptive run-time system allocates resources to different applications executing simultaneously in an energy-efficient way. At the Hardware Layer, the ALMARVI’s many-core execution platform provides the compute capabilities to diverse image/video processing applications.

Work Packages
Start time: 01.04.2014         Duration: 36 Months         Budget:   EUR 8.789 M
Countries involved:
Netherlands                 Turkey                  Czech Republic               Finland


About ALMARVI project
Open software
  • SuRmob : The Institute of Information Theory and Automation (UTIA) introduced SuRmob which is an application for mobile devices (Andriod) for an acquisition of super-resolved images. Super-resolution is a mathematical algorithm that combines multiple low-resolution input images and creates one image of higher resolution.  We provide an efficient implementation that runs in mobile devices equipped with digital cameras.
  • vfTasks open-source parallelization library: Vector Fabrics (VF) maintains vfTasks which is a library with a C API containing the following features:
  • Manage worker thread pools
  • Inter-thread streaming communication channels
  • 2-D synchronization for parallelized loops

It does not depend on any other libraries other than libc and the pthreads library. The latter can however be easily replaced with custom threading and memory allocation solution, allowing vfTasks to be ported to an embedded CPU or DSP processor. vfTasks is developed by VF and complements Vector Fabrics' Pareon product that helps to parallelize a C/C++ application. For more information, click here.

  • TCE : TUT maintains TTA-based Co-Design Environment (TCE) which is a toolset for designing and programming customized processors based on the Transport Triggered Architecture (TTA). The toolset provides a complete retargetable co-design flow from high-level language programs down to synthesizable processor RTL (VHDL and Verilog backends supported) and parallel program binaries. Processor customization points include the register files, function units, supported operations, and the interconnection network.
  • POCL : TUT uses Portable Computing Language (pocl) which aims to become a MIT-licensed open source implementation of the OpenCL standard which can be easily adapted for new targets and devices, both for homogeneous CPU and heterogenous GPUs/accelerators. IT uses Clang as an OpenCL C frontend and LLVM for the kernel compiler implementation, and as a portability layer. Thus, if your desired target has an LLVM backend, it should be able to get OpenCL support easily by using pocl. The goal is to accomplish improved performance portability using a kernel compiler that can generate multi-work-item work-group functions that exploit various types of parallel hardware resources: VLIW, superscalar, SIMD, SIMT, multicore, multithread.
  • rVEX : TUDelft maintains ρ-VEX which is an reconfigurable and extensible Very-Long Instruction Word (VLIW) processor. It is part of the overall "Liquid Architectures" research theme within the Computer Engineering Lab at TU Delft, The Netherlands. The ρ-VEX processor architecture is based on the VEX ISA. The main concept of our design is to be able to dynamically adapt the hardware design to match requirements from the applications and the operating environment. In this manner, resource utilization can be improved for energy savings or increased performance, e.g., by executing additional programs on the "freed" resources. Consequently, our design can be seen as a large wide-issue (up to 8) VLIW processor or as several 2-issue VLIW cores. Our designs have been used also in several courses given at TU Delft and we can make this material available for professors at other institues upon request.
  • SDF3 : TUE developed and maintains SDF3 toolchain. Synchronous dataflow (SDF) is a modelling formalism that allows design-time analysis of multiprocessor applications. SDF3 is the tool support for SDF based analysis of throughput and latency, and it provides solution to binding and scheduling questions on multiprocessors. The recent versions support Cyclo-Static Dataflow (CSDF) and Finite-State-Machine-based Scenario-Aware Dataflow (FSMSADF).
  • TRACE : TRACE is a Gantt chart visualization tool capable of presenting (large sets of) activities on resources (and dependencies between them) as a function of the time. Moreover, it allows visualizing multi-dimensional design spaces for easy comparison of design options. Various recent features (e.g., critical path analysis, distance analysis etc.) on TRACE were developed as a part joint activities between TUE and TNO-ESI and are being used in ALMARVI project.
  1. A. Brandon, J. Hoozemans, J. Van Straten, S. Wong, “Exploring ILP and TLP on a Polymorphic VLIW Processor”, to appear in the proceedings of the 30th International Conference on Architecture of Computing Systems, Vienna, Austria, 2017.
  2. J. Hoozemans, R. Heij, J. Van Straten, S. Wong, “VLIW-based FPGA computational fabric with streaming memory hierarchy for medical imaging applications”, to appear in the proceedings of the 13th International Symposium on Applied Reconfigurable Computing, Delft, the Netherlands, 2017.
  3. F. Sroubek, J. Kamenicky, and Y. M. Lu, “Decomposition space-variant blur in image deconvolution”, in IEEE Signal Processing Letters, vol. 23, no. 3, pp. 346-350, 2016.
  4. E.P. van Horssen, A.R.B. Behrouzian, D. Goswami, D. Antunes, T. Basten and M. Heemels, “Performance analysis and controller improvement for linear systems with (m,k)-firm data losses”,  in Proc. European Control Conference, ECC, Aalborg, Denmark,  2016.
  5. M. Buyukmihci, V.E. Levent, A.E. Guzel, O. Ates, M. Tosun, T. Akgun, C. Erbas, S. Gören, H.F. Ugurdag, "Output Domain Downscaler", in Proc. Intl. Symp. on Computer and Information Sciences (ISCIS), Krakow, Poland, 2016.
  6. A.E. Guzel, V.E. Levent, M. Tosun, M.A. Ozkan, T. Akgun, D. Buyukaydin, C. Erbas, H.F. Ugurdag, “Using High-Level Synthesis for Rapid Design of Video Processing Pipes”, in Proc. of East-West Design & Test Symposium (EWDTS), Yerevan, Armenia, 2016.
  7. M. Koskela, T. Viitanen, P. Jääskeläinen, and J. Takala, “Half-Precision Floating-Point Ray Traversal,” in Proc. Joint Conf. Comput. Vision Imaging Comput. Graphics Theory Appl., Rome, Italy, 2016.
  8. M. Hendriks, J. Verriet, T. Basten, B. Theelen, M. Brassé, and L. Somers, “Analyzing execution traces critical-path analysis and distance analysis”, Accepted for publication in Springer International Journal on Software Tools for Technology Transfer, 2016.
  9. Šroubek Filip, Kamenický Jan, Lu Y. M. “Decomposition of Space-Variant Blur in Image Deconvolution”  IEEE Signal Processing Letters vol.23, 3,  pp. 346-350, 2016.
  10. Hadi Alizadeh Ara, Marc Geilen, Twan Basten, Amir Behrouzian, Martijn Hendriks and Dip Goswami, “Tight Temporal bounds for dataflow applications mapped onto shared resources”, Accepted for publication and presentation at the proceeding of the 11th IEEE International Symposium on Industrial Embedded Systems 23-25 May 2016.
  11. Amir Behrouzian, Dip Goswami, Marc Geilen, Martijn Hendriks, Hadi Alizadeh Ara, Eelco Horssen, Maurice Heemels and Twan Basten, “Sample-Drop Firmness Analysis of TDMA-Scheduled Control Applications”, Accepted for publication and presentation at the proceeding of the 11th IEEE International Symposium on Industrial Embedded Systems 23-25 May 2016.
  12. J. Kotera, B. Zitová and F. Šroubek, “PSF accuracy measure for evaluation of blur estimation algorithms”, Proc. of  IEEE International Conference on Image Processing (ICIP), Quebec City,  2015.
  13. A. A. C. Brandon, J. J. Hoozemans, J. Van Straten, A. F Lorenzon, A. L. Sartor, A. C. S. Beck, S. Wong, “A Sparse VLIW Instruction Encoding Scheme Compatible with Generic Binaries” in Proc. International Conference on ReConFigurable Computing and FPGAs (ReConFig), Mayan Riviera, Mexico, 2015.
  14. J. J. Hoozemans, J.  Johansen, J. Van Straten, A. A. C. Brandon, S. Wong, “Multiple Contexts in a Multi-ported VLIW Register File Implementation” in Proc. International Conference on ReConFigurable Computing and FPGAs (ReConFig), Mayan Riviera, Mexico, 2015.
  15. T. Äijö, P. Jääskeläinen, T. Elomaa, H. Kultala, and J. Takala, “Integer Linear Programming Based Scheduling for Transport Triggered Architecture,” ACM Trans. Architecture and Code Optimization, Vol. 12, Issue 4, pp. 59:1-59:22, 2015.
  16. P. Jääskeläinen, C.S. de La Lama, E. Schnetter, K. Raiskila, J. Takala and H. Berg: “pocl: A Performance-Portable OpenCL Implementation,” Int. J. Parallel Programming, Vol. 43, Issue 5, pp. 752 – 785, 2015.
  17. H. Yviquel, A. Sanchez, P. Jääskeläinen, J. Takala, and M. Raulet, “Embedded Multi-Core Systems Dedicated to Dynamic Dataflow Programs,” J. Signal Processing Systems, Vol. 80, Issue 1, pp. 121 – 136, 2015.
  18. P. Jääskeläinen, H. Kultala, T. Viitanen, and J. Takala, “Code Density and Energy Efficiency of Exposed Datapath Architectures,” J. Signal Processing Systems, Vol. 80, Issue 1, pp. 49-64, 2015, doi:
  19. V. Korhonen, P. Jääskeläinen, M. Koskela, T. Viitanen, and J. Takala, “Rapid Customization of Image Processors Using Halide,” in Proc. IEEE Global Conf. Signal Inf. Process., Orlando, FL, USA, 2015.
  20. J. Glossner, P. Blinzer, and J. Takala, “HSA-Enabled DSPs and Accelerators,” in Proc. IEEE Global Conf. Signal Inf. Process., Orlando, FL, USA, 2015.
  21. T. Viitanen, M. Koskela, P. Jääskeläinen, H. Kultala, and J. Takala, “MergeTree: A HLBVH Constructor for Mobile Systems,” in ACM SIGGRAPH Asia, Kobe, Japan, 2015.
  22. H. Kultala, J. Multanen, P. Jääskeläinen, and J. Takala, “Impact of Operand Sharing to the Processor Energy Efficiency,” in Proc. CSI Int. Symp. Comput. Arch. & Digital Syst., Tehran, Iran, 2015.
  23. M. Koskela, T. Viitanen, P. Jääskeläinen, J. Takala, and K. Cameron, “Using Half Floating-Point Numbers for Storing Bounding Volume Hierarchies,” in Computer Graphics International Conference, Strasbourg, France, 2015.
  24. J. Kotera, F. Sroubek and B. Zitova, "PSF accuracy measure for evaluation of blur estimation algorithms", in International Conference on Image Processing (ICIP), Canada, 2015, (accepted for publication)
  25. I. Szentandrási, M. Zachariáš, J. Tinka, M. Dubská, J. Sochor and A. Herout, “INCAST”, in International Symposium on Mixed and Augmented Reality (ISMAR), Fukuoka, Japan, 2015
  26. M. J. Turnquist, M. Hiienkari, J. Makipaa and L. Koskinen, “A Fully Integrated Self-Oscillating Switched-Capacitor DC-DC Converter for Near-Threshold Loads” in Asian Solid-State Circuits Conference (A-SSCC), 2015 (accepted for publication)
  27. M. Hradiš, J. Kotera, P. Zemčík and F. Šroubek, “Convolutional Neural Networks for Direct Text Deblurring”, in Proceedings of The British Machine Vision Association and Society for Pattern Recognition BMVC, Swansea, UK, 2015
  28. B. Braithwaite, H. Niska, I. Pöllänen, T. Ikonen, K. Haataja, P. Toivanen, and T. Tolonen, “Optimized Curve Design for Image Analysis Using Localized Geodesic Distance Transformations”, In IS&T SPIE Electronic Imaging, California, USA, 2015
  29. A. R. B. Behrouzian, D. Goswami, T. Basten, M. Geilen and H. Alizadeh Ara, “Multi-Constraint Multi-Processor Resource Allocation”, in International conference on Embedded Computer Systems: Architecture, Modeling, and Simulation (SAMOS), Samos, Greece, 2015.
  30. D. Buyukaydin and T. Akgun, “GPU Implementation of an Anisotropic Huber-L1 Dense Optical Flow Algorithm Using OpenCL”, in International conference on Embedded Computer Systems: Architecture, Modeling, and Simulation (SAMOS), Samos, Greece, 2015.
  31. J. Multanen, T. Viitanen, H. Linjamäki, H. Kultala, P. Jääskeläinen, J. Takala, L. Koskinen, J. Simonsson, H. Berg, K. Raiskila and T. Zetterman, “Power Optimizations for Transport Triggered SIMD Processors”, in International conference on Embedded Computer Systems: Architecture, Modeling, and Simulation (SAMOS), Samos, Greece, 2015.
  32. J. Hannuksela, M. Niskanen and M. Turtinen, “Performance evaluation of image noise reduction computing on a mobile platform”, in International conference on Embedded Computer Systems: Architecture, Modeling, and Simulation (SAMOS), Samos, Greece, 2015.
  33. J. Kadlec, “Video Chain Demonstrator on Xilinx Kintex7 FPGA with EdkDSP Floating Point Accelerators”, in International conference on Embedded Computer Systems: Architecture, Modeling, and Simulation (SAMOS), Samos, Greece, 2015.
  34. I. Pöllänen, B. Braithwaite, K. Haataja, T. Ikonen and P. Toivanen, “Current Analysis Approaches and Performance Needs for Whole Slide Image Processing in Breast Cancer Diagnostics”, in International conference on Embedded Computer Systems: Architecture, Modeling, and Simulation (SAMOS), Samos, Greece, 2015.
  35. J. Hoozemans, S. Wong and Z. Al-Ars, “Using VLIW Softcore Processors for Image Processing Applications”, in International conference on Embedded Computer Systems: Architecture, Modeling, and Simulation (SAMOS), Samos, Greece, 2015.
  36. T. Viitanen, H. Kultala, P. Jääskeläinen and J. Takala,"Heuristics for greedy transport triggered architecture interconnect exploration", in International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES), India, 2014
  37. I. Pöllänen, B. Braithwaite, T. Ikonen, H. Niska, K. Haataja, P. Toivanen, and T. Tolonen, “Computer-Aided Breast Cancer Histopathological Diagnosis – Comparative Analysis of three DTOCS-based Features: SWDTOCS, SW-WDTOCS, and SW-3-4-DTOCS”, In 4th International Conference on Image Processing Theory, Tools, and Applications (IPTA), France, 2014
  38. H. Kultala, T. Viitanen, P. Jääskeläinen, J. Helkala, J. Takala, "Compiler optimizations for code density of variable length instructions", In IEEE Workshop on Signal Processing Systems (SiPS), Ireland, 2014
  39. T. Ikonen, H. Niska, B. Braithwaite, I. Pöllänen, K. Haataja, P. Toivanen, J. Isola, and T. Tolonen, “Computer-Assisted Image Analysis of Histopathological Breast Cancer Images Using Step-DTOCS”, In 14th International Conference on Hybrid Intelligent Systems (HIS), Kuwait, 2014
  40. D. Goswami, D. Müller-Gritschneder, T. Basten, U. Schlichtmann and S. Chakraborty, “Fault-tolerant Embedded Control Systems for Unreliable Hardware,” In International Symposium on Integrated Circuits (ISIC), Singapore, 2014
  41. I. Zliobaite, J. Hollmén, J. Teittinen and L. Koskinen, “Towards hardware-driven design of low-energy algorithms for data analysis” in ACM SIGMOD Record archive, 2014
  42. K. van Gend, “Cut Power Consumption by 5x Without Losing Performance”, in LinuxCon, Düsseldorf, Germany, 2014
Public Deliverables

Download D 1.3

Cross-Layer Models for estimating System Properties/Parameters

Download D 2.4

Parallel and Power-Aware Image Segmentation Algorithms (Architecture and Design)

Download D 2.5

 Parallel Object Recognition and Tracking, Motion Analysis Algorithms (Architecture and Design)

Download D 2.7

Parallel Image Enhancement, Restoration, and Fusion Algorithms (Architecture and Design)

Download D 3.3

Abstracting heterogeneous hardware architectures

Download D 3.5

Scalability, quality and usability of the execution platform

Download D 4.3

Design Space Exploration

Download D 4.6

Integrated System Software Stack

Download D 5.7

Evaluation of the ALMARVI Demonstrators

Download D 6.4

Progress Efficiency Report-1

Download D 6.5

Progress Efficiency Report-2

Download D 6.9

Project final report

Download D 7.1

ALMARVI Project Website

Download D 7.3

ALMARVI Dissemination plan and strategies

Download D 7.6

Dissemination Report (Intermediate)

Download D 7.7

Dissemination Report (Final)

Download D 7.8

ALMARVI Project Booklet

Download D 7.9

Standardisation Efforts

Contact us


2014 ALMARVI. All Rights Reserved