[SOLVED] C algorithm deep learning Scheme html openmp QT shell parallel compiler operating system database graph statistic software network Parallel Programming on Embedded MPSoCs

$25

File Name: C_algorithm_deep_learning_Scheme_html_openmp_QT_shell_parallel_compiler_operating_system_database_graph_statistic_software_network_Parallel_Programming_on_Embedded_MPSoCs.zip
File Size: 1601.4 KB

5/5 - (1 vote)

Parallel Programming on Embedded MPSoCs
Practical Work S9 EII
20192020
Karol Desnos Florian Arrestier

Content
Assignment 3
1 PThreads 5
1.1 ImageProcessingAlgorithmforParallelization 5 1.1.A VisualStudio2017Setup .. 5 1.1.B TheSobelFilter.. 6
1.2 ParallelizationwithPThreads . 8 1.2.A ForkJoin 8 1.2.B MasterSlaveThreadsandSemaphores .. 9 1.2.C Asymmetricparallelism 9
2 OpenMP 11
2.1 UseOpenMPonanX86Target 11 2.1.A OpenMPHelloWorld.. 11 2.1.B SobelParallelization .. 12
2.2 UseOpenMPonaC6678Target .. 13 2.2.A BuildEnvironmentSetup .. 13 2.2.B Sobelparallelization .. 15
3 Dataflow programming with PREESM 17
3.1 PreesmSetup . 17 3.2 Onlinetutorials. 17
Project Assignment 19
1 Objective . 19
2 SqueezeNetDeepNeuralNetwork. 19 2.A TellmewhatIam!. 19
2.B Given SqueezeNet Deep Neural Network Implementation . . . . . . . . . . 19
3 TechnicalAssignment 21
4 Organizationanddeadlines .. 21
5 WarningsandRecommendations.. 22
1

2 INSA Parallel Programming on Embedded MPSoCsS9 EIIPW

Assignment Instructions
To complete this assignment, you will have to write a short report presenting the result of the 3 PWs in English. The purpose of this report is to show that you understand what you did during the practical sessions, and why you did it.
The following assignments contains many suggestions and questions whose answers should be written in the report, nevertheless, it is important to understand that these questions should not be answered linearly, one after the other. Instead, you must write a wellstructured docu ment as you would do for presenting the result of a project or an internship.
The report length is limited to 5 pages. Only the body of the report, which comprises text paragraphs introduction, development, and conclusion, the illustrations, and the appendices, is restricted to 5 pages. The covers, the flyleaf, and the outline of the report are not counted in the 5page limit. Only one report should be submitted for each pair of students. Please make sure that the two names appear on the report.
Deadline
The deadline for submitting the report is one week after the 6th lab session, i.e. one week after the last session dedicated to PW3. Check the moodle website of the course for the exact deadline.
3

4 INSA Parallel Programming on Embedded MPSoCsS9 EIIPW

PW 1
PThreads
Prerequisites:
C language and Advanced C language course.
Realtime systems course.
Attend the lecture, read and understand the lecture notes.
Read and understand the whole assignment before the practical session.
Understand the C code of the Sobel filter before the practical session cf. Section 1.1.B.
Objectives: 4hUnderstand a simple image filtering algorithm: the Sobel filter.
Parallelize an application on an x86 Central Processing Unit CPU with PThreads.
Introduction
In this assignment, you will use the pthread Application Programming Interface API to par allelize an image processing algorithm. Whatever parallel programming API is used, under standing how an algorithm works is the key to its successful parallelization. This statement is particularly true for the pthreads API where most parallelism is explicitly specified by the developer. Hence, to understand how the image processing algorithm works, you will have to study it before the practical session. In a second part of this practical assignment, you will test two strategies to parallelize the application on a multicore X86 CPU.
1.1 Image Processing Algorithm for Parallelization
The purpose of this first part of the PW is to setup the build environment and compile the project that will be used as a basis for parallelization.
1.1.A Visual Studio 2017 Setup
The following procedure explains how to setup, build and test the project that will be completed in this assignment. This procedure has been tested only on Windows 7 with the Visual Studio 2017 Integrated Development Environment IDE.
Although, with moderate modifications, this project would be compatiblewith other operating systems and IDEs support will be given only for Visual Studio 2017 on Windows 7. No compiler related support will be given on personal laptop during lab sessions.
5

1. Download the pthreadlab.7z archive from the Moodle website of the course: http:moodle.insarennes.frcourseview.php?id355.
2. Extract the content of the archive in a dedicated directory.
CMake is a utility tool whose purpose is to ease the portability of complex CC appli cations by generating projects for most popular IDEs Code::Blocks, Visual Studio, Makefile, QT Creator, , on major operating systems Linux, Windows, Mac OS. To achieve this pur pose, source code files and project dependencies are specified in a configuration file, called CMakeLists.txt, using a specific description language. When CMake is launched, it automati cally generates a project for a specified IDE, where all dependencies to thirdparty libraries are configured.
3. Run pthreadlabCMakeVS2017.bat to launch the CMake tool for the project. This will both generate the Visual Studio project and copy all required DLLs for its execution.
4. Open the generated project pthreadlabbinpthreadlab.sln with Visual Studio.
5. In the Solution Explorer of Visual Studio, rightclick on the pthreadlab project, and
select Set as StartUp Project.
6. Run the application. In Visual Studio 2017, when measuring the performance of the application, run it using DebugStart without debugging CtrlF5 option. This will considerably lower the overhead of Visual Studio on your application performance, especially in multithread cases.
In its current state, the application will continuously read and display the frames of the given video file. In the console, the number of frames per second fps read by the application is displayed. Note down the number of fps of the current application as it will serve as a basis for future analysis of application performance, or to compute the Amdahls Law.
1.1.B The Sobel Filter
a Original image b Filtered image Figure 1.1: Sobel filtering example.
The Sobel filter is an image transformation widely used in image processing applications in order to detect the edges within a 2dimension picture.
YUV files located in the dat folder contain uncompressed video streams encoded using the YUV colorspace. In images encoded in the YUV colorspace, each pixel is coded with three
6
INSA Parallel Programming on Embedded MPSoCsS9 EIIPW
When running the application with the release profile, use the Debug Start without debugging option to avoid any performance degradation.

1 2 3 4 5 6 7 8 9
10 11
height and width the dimensions of Y Output: YSobel the filter result
for each row in 1, height2 do
for each col in 1,width2 do
gx :Yrow1,col12Yrow,col1Yrow1,col13terms gy :Yrow1,col12Yrow1,colYrow1,col13terms YSobelrow,col:sqrtgx gx gy gy
endfor
YSobel row,0 : 255
YSobelrow,width1:255 endfor
Fill first row of YSobel with 255 Fill last row of YSobel with 255
7. Add a call to the sobel function to the program built at step 6. The code of the function is provided in the archive downloaded on Moodle.
8. At this step, you have obtained the sequential version of the application. From now on, note down the characteristics of each version of your program performance, number of lines of code, CPUmemory usage, for comparisons with future versions.
integer values: a luma value Y for the brightness of the pixel, and two chrominance values U and V for the color of the pixel. In all videos available on the Moodle website of the course, the YUV420 format is used. In this image format each luma value is associated to a unique pixel, but each U or V value is associated to a block of 2by24 pixels. In this assignment, you are only required to process the luma component of the pixels, which is read from the YUV file by the readYUV function, and stored in an array of bytes i.e. unsigned char of size heightwidth.
The application of the Sobel filter consists of convoluting the Y component of the original image with 2 matrices to obtain two intermediate images. These two images are then assem bled to form the final image. Figure 1.1 presents an image and the result of the application of the Sobel filter.
Formally with:
A is the Y component of the original image of dimension height x width.Gx and Gy are the two intermediary images.
G is the result of the filtering operation.
the 2dimensional convolution operation.
The application of the Sobel filter is defined as follows:
1 0 1 1 2 1 Gx2 0 2A and Gy0 0 0A
1 0 1 1 2 1 and i , j 0, height 0, width
Gi, jGx i , j 2Gy i , j 2
The pseudocode for the Sobel Filtering algorithm is given in Algorithm 1. The missing terms
can easily be found in previous equations.
ALGORITHM 1: Sobel filter
Input: Y the Luma component of the input image.
INSA Parallel Programming on Embedded MPSoCsS9 EIIPW 7

9. In main.c, change the call to the MD5Update function to apply it to the output of the Sobel filter. The hash computed by this function, written in the md5.txt file when running the application, will be used to make sure that the result produced by the algorithm is not altered when parallelizing the application. Keep a copy of this file for future comparisons.
In the next sections, you will attempt to parallelize this application using the pthread API.
1.2 Parallelization with PThreads
In this section, you will parallelize the Sobel filter using pthreads. You will successively imple ment two different parallelization strategies: forkjoin and masterslave.
1.2.A ForkJoin
In this parallelization strategy, for each frame of the video, i.e. for each execution of the while!stopThreads loop, you will create and end threads that will handle processing of the sobel filter on a part of the input image.
The purpose of this first parallelization strategy is to learn how to create, start and end threads, and how to pass arguments to threads. This strategy is first implemented with two threads before being generalized to a generic number of threads.
With 2 Threads
10. Beforecreatingthreads,separatetheuniquecalltothesobelfunctionintotwocalls,each processing half of the image. Carefully select how the image is divided for future parallel processing. Do not create any new image buffer when parallelizing this application, the y and ySobel are sufficient.
11. Check that the result of the algorithm is identical to the original one. Have a look at the generated md5.txt file to confirm your observations. You may need to move some part of the sobel function into the main function to preserve the output.
12. Create a struct to store all the arguments needed to call the sobel function called from another thread.
13. Move one of the two sobel function calls to a secondary thread. Check the performance of the parallelized application. When evaluating the performance of your implementation, it is a good idea to open the performance monitor of your operating system to check the properties of your process: percentage of CPU time used, number of threads of the process Unfortunately, due to an unknown number of threads being launched by the SDL library, the number of threads indicated can hardly be trusted.
With n Threads
Keep a copy of the current version of the application as it will also be usedas a base version in next practical sessions. We strongly advise you to use
a Git project for this purpose.
14. 15.
Define a preprocessing variable NBTHREADS whose value corresponds to the total num ber of thread, including the main, used for computing the sobel function.
Analyze the application performance with a variable number of threads. Were these performances predictable?
8
INSA Parallel Programming on Embedded MPSoCsS9 EIIPW

1.2.B MasterSlave Threads and Semaphores
In this parallelization strategy, which is presented in the OpenMP section of lecture notes, the
main thread, called master thread, still is responsible for reading and displaying YUV video frames as well as performing part of the computation. Each secondary thread, called slave thread, will be created at the initialization of the application, and contain its own while!stopThreads loop. A thread synchronization mechanism will be used when needed.
16. Create NBTHREADS slave threads implementing the behavior described before. Com putation distribution remains identical to the one implemented in the previous ForkJoin implementation. The only difference is that secondary threads are started at the initial ization of the applicationand not within the main loop, and that each secondary thread will now call the sobel function within a while!stopThreads loop. Do not implement any synchronization mechanism yet. Does your program work? Comments on CPU usage and instrument your code with printfs to measure approximately how much unnecessary processing is done.
17. Use the pthreadbarrier mechanism to synchronize your threads. Dont forget to freede stroy all elements from the pthread library used by your program on its completion. How many synchronization points are required to keep the application fully functional? Study the impact of synchronization mechanism on application performance.
1.2.C Asymmetric parallelism
In previous sections, the processing realized in parallel was symmetrically i.e. equally dis tributed among threads. Hence, for each loop iteration, while the main thread is reading or displaying a video frame, all other threads are in an idle state, waiting for the next synchroniza tion.
Using a profiling tool, you will first measure the amount of time that is wasted waiting for the next synchronization. Then, you will increase application performance by finding a better distribution of computations among threads.
Application Profiling
Visual Studio integrates several profiling methods that offer a tradeoff between the accuracy of measurements and the overhead on performance of the profiled application:
CPU Sampling: With a predefined sampling rate, the execution points of an application are captured by the profiling framework. Based on a statistical analysis of these samples, the framework extrapolates the execution time of each function of the profiled application. The overhead of this profiling method is very low, but the accuracy of the results is limited, especially for applications and functions with a short runtime.
Instrumentation: During the build process of the application, extra code is added to the application to trigger accurate measurements of application performance. The overhead of this profiling method is high, but accuracy of the result is very high. This is the method you will use in this lab.
Follow these steps to enable application profiling in the Release configuration:
18. In the Solution Explorer of Visual Studio, rightclick on the pthreadlab project, and
select properties. Select the Release configuration and set the following properties:
In Configuration PropertiesCCGeneral, set Debug Information Format to
Program Database Zi.
INSA Parallel Programming on Embedded MPSoCsS9 EIIPW 9

In Configuration PropertiesLinkerDebugging, set Generate Debug Info to Yes DEBUG.
In Configuration PropertiesLinkerAdvanced, set Profile to Yes PROFILE. These configurations will be lost each time the Visual Studio project is regenerated with
CMake.
19. In the menubar, launch AnalyzePerformance and Diagnostics. Check that the pthreadlab project is the Analysis Target, and that the Performance Wizard is selected among Available Tools. Then, click Start. In the Performance Wizard, select Instrumen tation then click Next. Again, select only the pthreadlab project and click Next. Finally, check the Launch profiling after the wizard finishes and click Finish. Stop the application after a few seconds 10 sec to complete the profiling.
20. Navigate through the profiling report to see how much time is spent in each function. Analyze this result to find how much time is wasted for slave threads. These results can easily be represented with a diagram.
Asymmetric implementation using double buffering technique
The objective of this section is to change the program implementation to make it possible for slave threads to keep performing computations while the master thread is reading and displaying images.
To allow the slave and master threads to work in parallel, you must first make sure that they work on different data. Otherwise, there is a great chance that the main thread would display halfprocessed pictures, and the slave threads would process frames partially read from a file. To ensure data integrity, you will implement a doublebuffering technique, also called software pipelining.
The principle of doublebuffering is to allocate twice an element variable, array, struct, of your program whose integrity must be preserved. This technique is illustrated in Figure 1.2 where a and b are the two buffers allocated for an element of a program. First a thread of your program fills a with data. Then, while a second thread of your program begins processing a, the first thread of your program simultaneously fills b with new data. When processing of a and filling of b are both completed, elements are exchanged, and the first thread begins filling a with new values while the second thread starts processing b.
Fill
Fill
Fill
Fill
Fill
Thread 0 buffer a
buffer b Thread 1
Process
Process
Process
Process
10
INSA Parallel Programming on Embedded MPSoCsS9 EIIPW
21.
22.
Figure 1.2: Doublebuffering illustration
By analyzing profiling information, find a distribution of the sobel, read, and display functions among threads that fairly balance the computational load for NBTHREADS equal to the number of cores of your CPU.
Using the doublebuffering technique, implement the chosen distribution scheme. Ana lyze the performance of the resulting application.
sync sync
sync sync
sync sync
sync sync

PW 2
OpenMP
Prerequisites:
Programming in C Language.
Attend the lecture, read and understand the lecture notes.
Read and understand the whole assignment before the practical session.C code from PW 1.
Objectives: 4hParallelize an application on an x86 Central Processing Unit CPU with OpenMP.
Parallelize an application on a Multicore Digital Signal Processor DSP with OpenMP.
Assess the efficiency of OpenMP on both architectures.
Introduction
In this assignment, you will use OpenMP to parallelize an image processing algorithm on two multicore targets: an X86 Central Processing Unit CPU and a multicore Digital Signal Proces sor DSP from Texas Instruments. On the multiX86 target, you will use the different OpenMP directives to parallelize the application, assess their utility, and measure the performance gain brought by each of these primitives. On the multicore DSP target, you will port the code of the X86 application and adapt it when required. Then you will measure and analyze the application performance on this embedded architecture.
2.1 Use OpenMP on an X86 Target
In this section, you will first setup a basic project compiled with OpenMP. Then, you will use this project as a basis to parallelize the sobel application studied in PW1.
2.1.A OpenMP Hello World
The following procedure explains how to setup, build and test a simple Hello World program containing OpenMP directives.
1. Download the openmplab.7z archive from the Moodle website of the course. http:moodle.insarennes.frcourseview.php?id355.
2. Extract the content of the archive in a dedicated directory.
3. Copy the content of the dat and lib folders from the pthread lab.
11

4. Run openmplabCMakeVS2017.bat to launch the CMake tool for the project. This will both verify the OpenMP capacity of the installed compiler and generate the Visual Studio project with the adequate compiler flags.
5. Open the generated project openmplabbinopenmplab.sln with Visual Studio.
6. In the Solution Explorer of Visual Studio, rightclick on the openmplab project, and select
Set as StartUp Project.
7. Compile and run the application. Check that several threads were automatically gener ated by OpenMP. Was the number of generated threads predictable?
2.1.B Sobel Parallelization
The departure point to parallelize the sobel application is the code that was obtained at step 8 of PW1.
OpenMP parallel section
8. Copyallsourcecodefilesofthesobelapplicationintoopenmplabsrcandopenmplabinclude. Launch CMake, build and run the application to check its completeness.
9. In the C file containing the code of the sobel function, enclose the function code in a pragma omp parallel section with no additional clause Leave the declaration of loop iteration variables outside of the parallel section. Compile and run the application in both Debug and Release configurations. Note your observations on the program behav ior and performance.
10. Add clauses to the pragma omp parallel directive to correct the program behavior.
Who does what? Lets increase code verbosity!
11. Define a VERBOSE preprocessing variable. Using this variable, add deactivable printfs to your code to identify which thread produces which part of the output. What is cur rently happening? To prevent the console from being flooded with printfs, you can temporarily reduce the size of the processed frames in yuvRead.h andor add a call to systemPAUSE in your code.
12. Add breakpoints to your code, and confirm results from previous steps with the debugger.
OpenMP forloop parallelization
13. 14. 15.
16.
Parallelize the sobel loops with the appropriate OpenMP directive with no optional OpenMP clause. Which loop provides the best performance when parallelized?
Using the previously defined VERBOSE preprocessing variable, check again which thread produces which output.
Add a clause to the pragma omp parallel section to change the number of threads created by OpenMP. Study the performance of your application for a variable number of threads.
Study the impact on performance of additional clauses and OpenMP directives seen during the lecture. Keep the directives and clauses providing the best performance. How do these performance compare with pthread implementations from PW1?
12
INSA Parallel Programming on Embedded MPSoCsS9 EIIPW

2.2 Use OpenMP on a C6678 Target
An important goal of OpenMP is to provide a portable model for describing parallel applica tions. Hence, porting an existing OpenMP code on an architecture supporting OpenMP is supposedly as simple as rebuilding the application for the new target. In this section, you will first setup a basic OpenMP project for the C6678 multicore DSP from Texas Instruments. Then, you will compile, run and assess the performance of the sobel code parallelized with OpenMP on this architecture.1
2.2.A Build Environment Setup
The following procedure explains how to setup, build and test a simple Hello World program containing OpenMP directives on the C6678 platform. In this procedure, it is assumed that Code Composer Studio v6 CCSv6 was already installed with all the necessary plugins to compile and run code with OpenMP on the C6678 target.2. This assumption is verified for all computers in the PW rooms.
Building OpenMP HelloWorld in CCSv6
17. In CCSv6, open the CCSv6 Project creation wizard by clicking on Menu BarFileNewCCS Project. In the opened wizard, set the following configurations on the first page:
Target: C66xx Multicore DSPTMS320C6678
Connection: Texas Instruments XDS100v1 USB Debug ProbeProject name: sobelopenmp
Location: Select the location you want.
Compiler version: TI v8.1.0
Output Format Advanced Settings: eabi ELF
Project templates and examples: Empty RTSC Project
On the next page:
Product and Repositories: Only select the following products with the specified ver sion.
IPC 1.24,
MCSDK PDK TMS320C6678 1.1.2.6,OpenMP Runtime 2.x library,
SYSBIOS 6.33.06.50
XDC 3.23.04.60
Target: ti.targets.elf.C66
Platform: ti.runtime.openmp.platforms.evm6678
Buildprofile: release
18. In the Project Explorer, rightclick on the created project, and open the Properties.
1The SYSBIOS operating system running on the C6678 target also offers a parallel programming Application Programming Interface API based on threads. Unfortunately, this API is not compliant with the pthread standard. For this reason, manual thread programming on the C6678 is not covered in this course.
2CCSv6 and OpenMP installation procedure: http:processors.wiki.ti.comindex.phpPortingOpenMP 2.xtoKeyStone1
INSA Parallel Programming on Embedded MPSoCsS9 EIIPW 13

In the BuildC6000 CompilerAdvanced OptionsAdvanced Optimizations tab, check the Enable support for OpenMP 3.0 option.
In the BuildC6000 LinkerFile Search Path tab, check the Search libraries in priority order option.
Repeat these operations for the Release configuration.
19. During the build process, CCSv6 uses a configuration file to configure the different li braries that are linked by the application. Copy files ompconfig.cfg and omphello.c into your project from C:tiopenmpdsp2011603packagesexampleshello.
20. HAMMER TIME: build your project.
Load, Run and Debug
21. In the Project Explorer, open sobelopenmptargetConfigsTMS320C6678.ccxml. Open the Advanced tab of the editor. In the left part of this tab, select the C66xx0 element. In the right part of the tab, set the initialization script to the following file: C:ticcsv6ccsbaseemulationboardsevmc6678lgelevmc6678l.gel.
Repeat this operation for the 7 remaining cores. Save and close the editor.
22. Before connecting the board to your computer and plugging its power supply in, make sure that the board is in No Boot mode. To do so, simply set the switches of the board as follows: 3
Connect your board, first to the PC USB, then to the power supply.
23. To launch the debugger: Open the Target Configuration view by clicking on Menu BarWindowsShow ViewTarget Configuration. In the Target Configuration view, rightclick on the ccxml file in Projectssobelopenmpand select: Launch Selected Configuration. The CCS Debug perspective should open automatically.
24. To connect the board: In the Debug tab in topleft corner of the CCS Debug perspective, select the 8 cores of the architecture. Rightclick on the selected cores and select Group Cores. Rightclick on Group 1 and select Connect Target. This will connect CCSv6 to the 8 cores of the architecture and launch automatically the Global Default Setup script from the GEL file for each core. It may happen that the connection fails; in such case, unplug and reset the board and restart CCSv6 and try again.
25. Open the Load Program wizard by clicking on Menu BarRunLoadLoad Program. In the wizard, click on the Browse Project button and select the Debugsobelopenmp.out binary. Click on OK twice and wait for the completion of the loading process which may last a few tens of seconds per core.
26. Click on the play button to launch the execution. Check the correct behavior of the appli cation.
3More information on boot mode: http:processors.wiki.ti.comindex.phpTMDXEVM6678LEVM HardwareSetupBootModeDipSwitchSettings
DIP SW3
DIP SW4
DIP SW5
DIP SW6
off, on, on, on
on, on, on, on
on, on, on, on
on, on, on, on
14
INSA Parallel Programming on Embedded MPSoCsS9 EIIPW

2.2.B Sobel parallelization
Application specific configurations
In order to build the sobel application, a few extra configurations are needed.
27. Edit the ompconfig.cfg configuration file. Add the following content:
Allows usage of Timestampget32function
var Timestampxdc.useModulexdc.runtime.Timestamp;
Uninitialized memory section for loading the input video
Program.sectMap.myInputVideoMemnew Program.SectionSpec; Program.sectMap.myInputVideoMem.loadSegment DDR3; Program.sectMap.myInputVideoMem.typeNOINIT;
Support for f in printfs
var Systemxdc.useModulexdc.runtime.System; var SysStdxdc.useModulexdc.runtime.SysStd; System.SupportProxySysStd; System.extendedFormatsfS;
Sequential code validation
Before parallelizing the sobel application on the C6678 target, it is important to make sure that the sequential code is working.
28. Copy into your CCSv6 project the C files main.c, sobel.h, and sobel.c from the sobel application that was obtained at step 8 of PW1. Also add the C6678specific yuvRead.h and yuvRead.c files that can be downloaded from the Moodle website of the course. Remove all includes and function calls related to display features from your code. Build the application and load it on the target.
29. Before running the application, you need to load the video in the memory of the target. Open the map file generated in the sobelopenmpDebug directory. Find the address in memory where the section .myInputVideoMem is allocated.
30. Pause all executions in the CCS Debug perspective. Select one of the cores in the in De bug tab. Open the Load memory wizard by clicking on Menu BarDebugToolsLoad Memory. In the opened wizard, select the akiyocif.dat file that can be downloaded from the Moodle website. Check the box to use the header information contained in the file and click on Next. In the new page, set the address according to your observation in the .map file. Click on Finish; the load may take a few minutes.
31. Start the application and check the console to see if the code is running. In case of an exception, pause Core 0 execution, and click on Menu barToolsRTOS Object View ROV. In the newly opened view, select sobelopenmp.outBIOS and open the Scan for errors tab. Information on the encountered exception should be available here. Call a teacher if you need help to understand the issue or need hints to correct it.
32. In order to check that the video was correctly loaded and processed, open the Image Analyzer by clicking on Menu BarToolsImage Analyzer. Pause the execution of Core 0 by putting breakpoints in the main function. Open the Properties view and set the properties as follows starting with the image format:
INSA Parallel Programming on Embedded MPSoCsS9 EIIPW 15

Image format
YUV
U Line stride
1024
V Pixel stride
1
V mask
0xFF
V Line stride
1024
Alpha Pixel stride
0
Alpha mask
0x00000000
Alpha Line stride
0
Image source
Connected Device
YUV Start address
buffer name
Read data as
8 bit data
Number of pixels per line
2048
Number of lines
Y Pixel stride
858
Data format
Planar
Tiled
Resolution
4:2:0
1
Y mask
Y Line stride
0xFF
2048
U Pixel stride
1
U mask
0xFF
Back to the Image tab, click on the Refresh button. You should now see an image from the video you loaded on the board.
33. Compile and run the application in the Release configuration to measure the sequential performance of the application on this target.
Speedup assessment
34.
35. 36. 37.
Copy the sobel OpenMP code written for the X86 target. Replace OpenMP include with the following: include tiruntimeopenmpomp.h. Compile and run on the C6678 target using the Debug configuration. Use the Image Analyzer to check the produced results.
Using the Image Analyzer in the Debug configuration, study the impact of functionally incorrect OpenMP data scope clauses for loop indexes.
Study the performance of the application for a variable number of OpenMP threads. Com pile and run in the Release configuration. Make some observations.
In the Release configuration, study the impact on performance of default, but function ally correct, OpenMP data scope clauses for variables accessed within the parallel OpenMP section.
16
INSA Parallel Programming on Embedded MPSoCsS9 EIIPW

PW 3
Dataflow programming with PREESM
Prerequisites:
Programming in C Language.
Attend the lecture, read and understand the lecture notes.
Read and understand the whole assignment before the practical session.
Objectives: 4hParallelize an application on an x86 CPU with PREESM.
Parallelize an application on a Multicore DSP with PREESM.
Pipeline an application with PREESM.
Assess the efficiency of PREESM on both architectures.
Introduction
In this assignment, you will use the PREESM dataflow programming framework to parallelize an image processing algorithm on two multicore targets: an X86 CPU and a multicore DSP from Texas Instruments.
3.1 Preesm Setup
PREESM is an opensource rapid prototyping framework developed as a set of plugins for the eclipse Integrated Development Environment IDE. A prepackaged version of PREESM for windows is available on srvenseii.educ.insapublicTP201720185EIIPPEM. acces sible through a shortcut on the desktop of lab room computers. Simply download and unzip this archive to begin the online tutorials.
Prepackaged versions of PREESM for your personal computer are available at: https: github.compreesmpreesmreleases.
3.2 Online tutorials
The three following tutorials should be realized during the two lab sessions: 1. https:preesm.github.iotutosparasobel
2. https:preesm.github.iotutosmpsoccodegen
3. https:preesm.github.iotutossoftwarepipeline
17

Please note that CMake is already installed on the computers in the lab session rooms. Do not follow the introduction tutorial during lab sessions as it main purpose is to guide users through PREESM installation process. When setting up C projects, copy the libpthread2.10.0 and libSDL2.x.y from lab 1.
18 INSA Parallel Programming on Embedded MPSoCsS9 EIIPW

PROJECT
Project Assignment
1 Objective
The objective of this project is to parallelize a deep learning application on Texas Instruments 8cores C6678 EVM. To achieve this purpose, you will have to use the OpenMP parallelization technique studied during the laboratories of this course.
2 SqueezeNet Deep Neural Network 2.A Tell me what I am!
Classification of content contained in images is often necessary when trying to automate cer tain behavior of programs. Though it may not appear to be necessary to detect that you are holding a banana in your hand, it is useful to be sure not to mistake it with a gun1. Autonomous vehicles are another example of use cases for image classification techniques, as they rely on camera sensors that they use to understand their environment. Thus, it is important to accurately detect and classify people, traffic signs, etc.
For less than a decade, deep convolutional neural network have shown to be very well suited for this particular task. In this project, we will consider the SqueezeNet architecture. Figure .1 shows the detailed architecture of this neural network. The SqueezeNet neural net work is composed of multiple layers ex: conv1 layer, maxpool1 layer, etc. with different type of behaviors. For more details about the behavior of each layer, please refer to the different links given hereafter.
https:arxiv.orgabs1602.07360
https:en.wikipedia.orgwikiImageNet
https:www.kaggle.comcimagenetobjectlocalizationchallenge
http:machinelearninguru.comcomputervisionbasicsconvolutionconvolution layer.html
https:adeshpande3.github.ioABeginner27sGuideToUnderstandingConvolutional NeuralNetworks
2.B Given SqueezeNet Deep Neural Network Implementation
A sequential reference implementation of the SqueezeNet deep neural network that you will parallelize during this project is available on the Moodle website of this course: http:moodle. insarennes.frcourseview.php?id355.
1Especially if you are living in a certain country.
19

Figure .1: SqueezeNet network architecture.
20 INSA Parallel Programming on Embedded MPSoCsS9 EIIPW

In this project we are not interested in the training part of the network, we are using pre trained weights and biases. Thus, you will only parallelize the inference part of the network. The SqueezeNet network has been trained on images with a fixed resolution of 224 by 224 pixels. You can however use any image size you wish as it will be resized automatically to fit the required resolution.
3
The implemented algorithm follows the following steps for a given input image:
1. Resize the input image I to the default resolution of 224224 needed by the SqueezeNet network.
2. Load the pretrained weights and biases .
3. Do the inference i.e perform all the convolution in all layers of the network.
4. Decode the prediction result.
5. Create an MD5 hash of the prediction result. This hash will be used to make sure that the result produced by the algorithm is not altered when parallelizing the application.
All steps marked with asymbol are good candidates for parallelization.
Technical Assignment
In this project, the objective is to use one of the parallel programming technique studied in this course to parallelize the given deep neural network application on a desktop computer, and to port it on Texas Instruments 8cores C6678 EVM. During the evaluation, the demonstrated prototype has to be implemented using OpenMP on X86 and on the C6678.
Evaluation criteria for the demonstration are the following:
1.
2. 3.
4. 5.
4
To complete this project, each group of 4 students should complete the following assign
ments:
Demonstration: All evaluation criteria should be demonstrated to the pedagogical team during the last project lab session. If not demonstrated spontaneously before, demon stration will be made on teachers request during the last halfhour of the last project lab session. No demonstration will be possible after the scheduled end time of the last lab session.
Early spontaneous demonstration: Validation of criteria 1 or 2 are possible, on stu dents request, before the last project lab session. Early validation will grant a bonus of 1 point for each criterion.
Proper functioning of the parallelized application on X86, with a variable degree of paral lelism.
Proper functioning of the sequential application on the C6678.
Proper functioning of the parallelized application on the C6678, with a variable degree of parallelism.
Maximum speedup achieved compared to the sequential version on X86. Maximum speedup achieved compared to the sequential version on C6678.
Organization and deadlines
INSA Parallel Programming on Embedded MPSoCsS9 EIIPW 21

Best Performance Awards: To spice up the project, groups achieving the best perfor mance on X86 and on the C6678, will be awarded 1 extra point each. Performance in frames per second fps will be measured on a PC from the practical session room on different images and only if the produced result has the correct MD5. As we are work ing with fixed input image, the FPS will be evaluated by repeating the inference a given number of time and averaging the computation time.
Report: Before February the 2nd at 11:55 pm 23h55, you must upload on Moodle:
A project report of 5 pages maximum appendices included. This report in English should explain and justify the design choices made to parallelize the application and optimize its performance. The report should also assess the experimental speedup obtained when parallelizing the application on a variable number of cores. Any other information showing the work achieved for this project should also be included in the report.
A compressed archive of your solution. This archive should contain all files and projects necessary to open, compile and run your solution on Visual Studio 2013 or Code::Blocks, on Code Composer Studio. Code organization, comments and clear understanding will, as always, be taken into account in the notation.
Warnings and Recommendations
5
A few things to keep in mind throughout this project:

The deadline for the report February the 2nd at 11:55pm is a hard deadline and the submission server will lock itself automatically at this time! Any report andor code re turned after this deadline, or any corrupted archive will lead to 2 points of penalty on the final grade. Do not wait until the last minute to submit your work.
The C6678 boards will only be accessible during the scheduled 8 hours of the project. Since the scheduled 8 hours will most likely not be sufficient to realize the whole project, autonomous work is expected from you. This work is to be realized in groups of at most 4 persons.
To access the computers of the lab room with Code Composer Studio but not the C6678 boards outside the scheduled lab sessions, contact Samir Keddar in advance at Samir. Keddarinsarennes.fr.
22
INSA Parallel Programming on Embedded MPSoCsS9 EIIPW

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] C algorithm deep learning Scheme html openmp QT shell parallel compiler operating system database graph statistic software network Parallel Programming on Embedded MPSoCs
$25