{"id":8716,"date":"2024-04-17T08:30:34","date_gmt":"2024-04-17T15:30:34","guid":{"rendered":"https:\/\/visualgdb.com\/w\/?p=8716"},"modified":"2024-04-17T08:30:34","modified_gmt":"2024-04-17T15:30:34","slug":"optimizing-stm32-usb-high-speed-performance-with-tracing","status":"publish","type":"post","link":"https:\/\/visualgdb.com\/tutorials\/tracing\/embedded\/usb\/optimizing\/","title":{"rendered":"Optimizing STM32 USB High-Speed Performance with Tracing"},"content":{"rendered":"<p>This tutorial shows how to use several optimization techniques to get the maximum performance out of the high-speed USB interface on the STM32F7-Discovery board. We will show how raise the throughput in a single-direction mode more than twice the baseline, and how to use the linker-time optimization (LTO) to considerably drop the CPU utilization by the USB stack.<\/p>\n<p>We will use VisualGDB&#8217;s <a href=\"https:\/\/visualgdb.com\/documentation\/tracing\/\">software tracing<\/a> to get the cycle-accurate timings between different parts of the USB communication and the <a href=\"https:\/\/visualgdb.com\/documentation\/embedded\/memoryexplorer\/\">Memory Explorer<\/a> to understand the effects of the linker-time optimization on the generated code.<\/p>\n<p>If you are just looking for the final numbers, scroll back to the end of the tutorial or click <a href=\"#conclusion\">here<\/a>.<\/p>\n<p>Before you begin, install VisualGDB 6.0 or later.<\/p>\n<ol>\n<li>Start Visual Studio and Open the VisualGDB Embedded Project Wizard:<a href=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/01-emb.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-8717\" src=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/01-emb.png\" alt=\"\" width=\"1014\" height=\"675\" \/><\/a><\/li>\n<li>Enter the name and location for your project:<br \/>\n<a href=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/02-prjname.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-8718\" src=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/02-prjname.png\" alt=\"\" width=\"1014\" height=\"675\" \/><\/a><\/li>\n<li>Proceed with creating a new embedded application using Advanced CMake: <a href=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/03-cmake-1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-8719\" src=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/03-cmake-1.png\" alt=\"\" width=\"886\" height=\"706\" \/><\/a><\/li>\n<li>Choose the latest ARM toolchain and select your device from the list. In this tutorial we will use the STM32F7-Discovery board that uses the STM32F746NG device: <a href=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/04-device-1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-8720\" src=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/04-device-1.png\" alt=\"\" width=\"886\" height=\"706\" \/><\/a><\/li>\n<li>Select the <strong>USB Communications Device<\/strong> sample on the next page and click &#8220;Next&#8221;:<a href=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/05-commdev.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-8721\" src=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/05-commdev.png\" alt=\"\" width=\"886\" height=\"706\" \/><\/a><\/li>\n<li>Connect your on-board ST-Link debugger to the USB port and make sure VisualGDB recognizes it. Then ensure you have the software tracing and FLASH memory patching enabled and click &#8220;Finish&#8221; to generate the project: <a href=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/06-stlink.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-8722\" src=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/06-stlink.png\" alt=\"\" width=\"914\" height=\"706\" \/><\/a><\/li>\n<li>VisualGDB will create a sample project, however it will use the default clock settings that may not match your board, so the project may not run correctly:<br \/>\n<a href=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/07-clock.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-8723\" src=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/07-clock.png\" alt=\"\" width=\"1194\" height=\"761\" \/><\/a><\/li>\n<li>To fix this, create another project based on the <strong>STM32CubeMX Samples -&gt; USB_Device -&gt; MSC_Standalone -&gt; [board name]-HS<\/strong>:<a href=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/08-msc.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-8724\" src=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/08-msc.png\" alt=\"\" width=\"886\" height=\"693\" \/><\/a><\/li>\n<li>Then, copy the following from the cloned MSC sample into the main project:<br \/>\n<table style=\"border-collapse: collapse; width: 100%;\">\n<tbody>\n<tr>\n<td style=\"width: 50%;\">Entity<\/td>\n<td style=\"width: 50%;\">Functionality<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 50%;\"><strong>SystemClock_Config()<\/strong> function<\/td>\n<td style=\"width: 50%;\">Configures PLL to match the on-board crystal<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 50%;\">Entire <strong>usbd_conf.c <\/strong>file<strong><br \/>\n<\/strong><\/td>\n<td style=\"width: 50%;\">Handles low-level aspects of the USB configuration<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 50%;\">Entire <strong>system_stm32f7xx.c<\/strong> file<\/td>\n<td style=\"width: 50%;\">Enables optimal FLASH and power settings<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 50%;\">Entire <strong>stm32f7xx_hal_conf.h<\/strong> file<\/td>\n<td style=\"width: 50%;\">Specifies correct on-board crystal frequency and other settings<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>If usbd_conf.c doesn&#8217;t build due to missing main.h, replace #include &lt;main.h&gt; with the following code:<\/p>\n<pre>#include &lt;usbd_core.h&gt;\r\n#include &lt;usbd_cdc.h&gt;<\/pre>\n<pre><a href=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/09-conf.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-8725\" src=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/09-conf.png\" alt=\"\" width=\"1194\" height=\"761\" \/><\/a>\r\n<\/pre>\n<p>You can see the exact changes needed for the STM32F7Discovery board in <a href=\"https:\/\/github.com\/sysprogs\/tutorials\/commit\/56e3f1c653815d4da6527ac66c04ffc77eb0da3b\">this commit<\/a>.<\/li>\n<li>Finally, ensure that the USB Device Speed is in <strong>VisualGDB Project Properties -&gt; Embedded Frameworks<\/strong> is set to USB 2.0 High Speed:<br \/>\n<a href=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/10-hs.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-8726\" src=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/10-hs.png\" alt=\"\" width=\"1194\" height=\"761\" \/><\/a><\/li>\n<li>Press F5 to start debugging and ensure the USB-HS port on the board is connected to your computer. A new USB-based COM port will appear, echoing any characters typed into it:<a href=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/11-result.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-8727\" src=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/11-result.png\" alt=\"\" width=\"674\" height=\"440\" \/><\/a><\/li>\n<li>Now we will modify the <strong>main()<\/strong> function to unconditionally send a stream of data via the USB port. Replace the code after the call to <strong>USBD_Start()<\/strong> with this:\n<pre>static char data[512];\r\nchar byte;\r\nwhile (!USBD_Device.pClassData)\r\n    asm(\"nop\");\r\nfor (;;)\r\n    VCP_write(&amp;data, sizeof(data));<\/pre>\n<p>Then, create a basic C# program that will continuously read the data and show the throughput:<\/p>\n<pre>static void Main(string[] args)\r\n{\r\n\tSerialPort port = new SerialPort(\"COM20\");\r\n\tport.Open();\r\n\r\n\tvar buf = new byte[65536];\r\n\tfor (int i = 0; i &lt; buf.Length; i++)\r\n\t\tbuf[i] = (byte)i;\r\n\r\n\tlong bytesRead = 0;\r\n\tDateTime start = DateTime.Now;\r\n\r\n\tfor (; ; )\r\n\t{\r\n\t\tint done = port.Read(buf, 0, buf.Length);\r\n\t\tbytesRead += done;\r\n\t\tdouble msec = (DateTime.Now - start).TotalMilliseconds;\r\n\t\tdouble bytesPerSecond = (bytesRead * 1000) \/ msec;\r\n\t\tdouble kbPerSecond = bytesPerSecond \/ 1024;\r\n\t\tConsole.Write($\"\\rRead: {bytesRead \/ 1024}KB; \" +\r\n\t\t\t$\"Average speed: {kbPerSecond:f0} KB\/sec\");\r\n\t}\r\n}<\/pre>\n<\/li>\n<li>Replace <strong>COM20<\/strong> with the COM port number on your machine and run the benchmark program to measure the speed:<a href=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/12-benchmark.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-8730\" src=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/12-benchmark.png\" alt=\"\" width=\"829\" height=\"128\" \/><\/a>In our experiments, the unoptimized (Debug) build yielded about 4 MB\/sec.<\/li>\n<li>Switch the solution configuration to <strong>RelWithDebInfo <\/strong>and try measuring the speed again:<a href=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/13-rel.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-8731\" src=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/13-rel.png\" alt=\"\" width=\"1194\" height=\"761\" \/><\/a>In our experiments, this raised the USB throughput to about 7.7 MB\/sec.<\/li>\n<li>Note that the default STM32 USB configuration does not use DMA for the USB high-speed endpoint. You can override by setting <strong>dma_enable<\/strong> to 1 in the <strong>USBD_LL_Init()<\/strong> function in <strong>usbd_conf.c<\/strong>:<a href=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/14-dma.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-8732\" src=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/14-dma.png\" alt=\"\" width=\"1194\" height=\"761\" \/><\/a>This brings the throughput up to 9.1 MB\/s.<\/li>\n<li>Now we will use the VisualGDB&#8217;s <a href=\"https:\/\/visualgdb.com\/documentation\/tracing\/\">Software Tracing<\/a> to understand the exact timing of the USB communication and find bottlenecks. Sending large amounts of data via the bulk endpoint would normally involve a loop like this:\n<ul>\n<li>Main code calls <strong>VCP_write() <\/strong>that starts waiting for the USB endpoint to become ready<\/li>\n<li><strong>VCP_write()<\/strong> asks the STM32 HAL library to send the packet, which computes various headers, and ultimately starts a DMA transaction<\/li>\n<li>Once the transfer is complete, the USB hardware raises an interrupt<\/li>\n<li>The USB interrupt handler checks various conditions, marks the USB endpoint as ready, and the cycle repeats<\/li>\n<\/ul>\n<p>With tracepoints we can measure the exact time between different parts of this loop, and see how much optimization potential there is. Use Code Explorer in the <strong>Outgoing Calls<\/strong> view to see all functions called from <strong>VCP_write():<\/strong><br \/>\n<a href=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/15-tree.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-8733\" src=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/15-tree.png\" alt=\"\" width=\"1194\" height=\"766\" \/><\/a><\/li>\n<li><strong>USB_EPStartXfer()<\/strong> looks like the one that would actually start the transfer. Go to it, locate the line writing <strong>USBx_INEP(epnum)-&gt;DIEPDMA<\/strong> and set a tracepoint there recording the <strong>Cortex-M cycle counter<\/strong>:<a href=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/16-dma.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-8734\" src=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/16-dma.png\" alt=\"\" width=\"1194\" height=\"766\" \/><\/a><\/li>\n<li>Add another tracepoint in the beginning of the <strong>OTG_HS_IRQHandler()<\/strong> function. You can use the &#8220;new tracepoint&#8221; button in the Tracepoints window to quickly locate it by name: <a href=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/17-irqhandler.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-8735\" src=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/17-irqhandler.png\" alt=\"\" width=\"1194\" height=\"766\" \/><\/a><\/li>\n<li>Note how <strong>VCP_write()<\/strong> waits for <strong>TxState<\/strong> to be cleared before sending the next packet. Use the <strong>Find All References<\/strong> command to find the code in <strong>USBD_CDC_DataIn()<\/strong> that clears it, and set another tracepoint there: <a href=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/18-handled.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-8736\" src=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/18-handled.png\" alt=\"\" width=\"1194\" height=\"766\" \/><\/a><\/li>\n<li>Now you can start debugging, run the benchmark tool and see how the tracepoints are generating events:<a href=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/19-data-1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-8737\" src=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/19-data-1.png\" alt=\"\" width=\"1194\" height=\"763\" \/><\/a>As the USB HS packets are very fast, it will produce trace events faster than the debugger can read, so most events will be dropped. In this example it is not a problem, since we only need a small sample of events to check the timing, but you could change this behavior via <strong>VisualGDB Project Properties -&gt; Software Tracing<\/strong> to capture all events at the expense of slowing down the program.<\/li>\n<li>Stop the debugging session. VisualGDB will create a trace report file containing all recorded events. Click &#8220;Replay trace&#8221; to start viewing it: <a href=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/20-replay.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-8738\" src=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/20-replay.png\" alt=\"\" width=\"1194\" height=\"764\" \/><\/a><\/li>\n<li>Locate any of the events in the <strong>Tracepoints<\/strong> view, right-click on the performance counter in the <strong>Trace Data<\/strong> view and select &#8220;Plot in a new graph view&#8221;:<a href=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/21-plot.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-8739\" src=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/21-plot.png\" alt=\"\" width=\"1194\" height=\"762\" \/><\/a>You could also manually create a new graph, or table view and just drag the event there to begin plotting it.<\/li>\n<li>Right-click on the &#8220;Cortex-M Cycle Counter&#8221; line and select &#8220;add events from similar tracepoints&#8221;. VisualGDB will show all other tracepoints that capture the cycle counter: <a href=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/22-others.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-8740\" src=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/22-others.png\" alt=\"\" width=\"1194\" height=\"764\" \/><\/a><\/li>\n<li>Switch the view from <strong>Graph<\/strong> to <strong>Table<\/strong> and then convert the cycle counter column to &#8220;time&#8221;: <a href=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/23-data.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-8741\" src=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/23-data.png\" alt=\"\" width=\"1194\" height=\"763\" \/><\/a><\/li>\n<li>Enter the frequency of the system clock so that VisualGDB could translate the cycle counts to real time (you can find it out by checking the <strong>SystemCoreClock<\/strong> variable):<a href=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/24-216m.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-8742\" src=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/24-216m.png\" alt=\"\" width=\"1194\" height=\"763\" \/><\/a><\/li>\n<li>Now you will see the exact timing of various parts of the USB loop: <a href=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/25-perf.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-8743\" src=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/25-perf.png\" alt=\"\" width=\"1194\" height=\"762\" \/><\/a><\/li>\n<li>Note that we are getting two <strong>OTG_HS_IRQHandler()<\/strong> calls per packet. This happens because the STM32 CDC implementation sends a zero-length packet after each 512-byte packet to indicate that no further packets are coming for now. We don&#8217;t need this while streaming a continuous data set, so we will temporarily disable it by patching the <strong>usbd_cdc.c<\/strong> file (<strong>note that this file is a part of the STM32 BSP and editing will affect all your STM32 projects. If you don&#8217;t want this, convert the project into a stand-alone one before<\/strong>): <a href=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/nozlp.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-8754\" src=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/nozlp.png\" alt=\"\" width=\"1194\" height=\"766\" \/><\/a><\/li>\n<li>Now throughput increased to <strong>15.2MB\/s<\/strong> and the timing looks like this:<a href=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/timing.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-8755\" src=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/timing.png\" alt=\"\" width=\"1194\" height=\"766\" \/><\/a><\/li>\n<li>The tracepoint itself takes about <strong>512<\/strong> clock cycles, or <strong>2.4 uS<\/strong> (you can find it out by tracing a few sequential <strong>asm(&#8220;nop&#8221;)<\/strong> lines), so subtracting the tracepoint overhead, the timing looks like this:<br \/>\n<table style=\"border-collapse: collapse; width: 100%; height: 72px;\">\n<tbody>\n<tr style=\"height: 24px;\">\n<td style=\"width: 33.3333%; height: 24px;\">Operation<\/td>\n<td style=\"width: 33.3333%; height: 24px;\">Time<\/td>\n<td style=\"width: 33.3333%; height: 24px;\">Fraction of total period (59.7 uS)<\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"width: 33.3333%; height: 24px;\">Start of DMA -&gt; interrupt<\/td>\n<td style=\"width: 33.3333%; height: 24px;\">47.6 uS<\/td>\n<td style=\"width: 33.3333%; height: 24px;\">80%<\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"width: 33.3333%; height: 24px;\">Interrupt -&gt; DataIn reset<\/td>\n<td style=\"width: 33.3333%; height: 24px;\">4.6 uS<\/td>\n<td style=\"width: 33.3333%; height: 24px;\">8%<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 33.3333%;\">DataIn reset -&gt; next DMA start<\/td>\n<td style=\"width: 33.3333%;\">7.5 uS<\/td>\n<td style=\"width: 33.3333%;\">12%<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Note that we are looking at just one sample. In this example (total period of 59.7 uS per 512-byte packet) the throughput would be 16750K * 512 = ~8.5MB\/sec, which is way smaller than the recorded <strong>15.2MB\/s<\/strong>. In order to get really accurate timing between subsequent events, you would need to average the numbers between 100+ periods:<\/p>\n<ol>\n<li>Switch the column from <strong>Time<\/strong> to <strong>Value<\/strong> again<\/li>\n<li>Export the table to a CSV file<\/li>\n<li>Open it in Excel and add a formula for computing the difference between events of interest (e.g. compare every 3rd row to the one 3 rows above)<\/li>\n<li>Filter out too large values (when the events were truncated)<\/li>\n<li>Compute the average difference over a large enough span of rows<\/li>\n<\/ol>\n<p>In this example, averaging the period over 126 rows gives 7281 cycles without overhead = 29666 packets\/second = <strong>15.19MB\/s<\/strong>, that closely matches the measured throughput.<\/p>\n<p>We can also compute the absolute maximum theoretical throughput if the software overhead was completely non-existent by measuring the average time between <strong>DMA Start<\/strong> event and the subsequent <strong>OTG_HS_IRQHandler()<\/strong>, which averages to about 4639 cycles, or <strong>23.8MB\/sec<\/strong>.<\/li>\n<li>Based on the timing from a single period, the device spends about 80% of the time sending the actual data, and the remaining 20% going through various functions in the USB framework to update various states and decide to send another packet. On average, the software delay builds up to about <strong>37%<\/strong> of the time (<strong>1000+1671<\/strong> out of\u00a0 <strong>7281<\/strong> clock cycles per period). If you open the Embedded Memory Explorer in the Offline Disassembly mode, and look into any of the STM32 USB handler functions, you can clearly see why:<br \/>\n<a href=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/26-call.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-8744\" src=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/26-call.png\" alt=\"\" width=\"1194\" height=\"763\" \/><\/a><\/li>\n<li>A significant portion of the function code involves calls to other functions and saving\/restoring various registers:<a href=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/27-prologue.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-8745\" src=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/27-prologue.png\" alt=\"\" width=\"1194\" height=\"763\" \/><\/a>Let&#8217;s look at an example how this could affect the performance:\n<ul>\n<li>Assume a function called <strong>ComputeDistance()<\/strong> calls functions called <strong>subtract()<\/strong> and <strong>sqrt()<\/strong>. At the time <strong>ComputeDistance()<\/strong> is compiled, it doesn&#8217;t really know anything about the called functions, so it has to assume they follow the standard ARM calling convention. I.e. they can destroy the contents of the volatile registers (<strong>r0-r3<\/strong>) and must preserve the value of non-volatile registers.<\/li>\n<li>Even if <strong>subtract() <\/strong>is very simple and doesn&#8217;t actually touch most of the volatile registers, <strong>ComputeDistance()<\/strong> must copy them elsewhere before calling it, if it needs the values stored there.<\/li>\n<li>Similarly, <strong>sqrt()<\/strong> doesn&#8217;t know that <strong>ComputeDistance()<\/strong> doesn&#8217;t care about the volatile registers at the point it calls <strong>sqrt(), <\/strong>so it must save them into the memory.<\/li>\n<\/ul>\n<p>If the functions were declared in the same source file, and the compiler chose to inline <strong>subtract()<\/strong> and <strong>sqrt()<\/strong>, this would not be a problem. However, the ST&#8217;s USB functions are split between multiple source files, and inlining doesn&#8217;t work for them.<\/li>\n<li>LTO (link-time optimization) fixes that limitation. It delays the actual code generation until the very last stage (linking) and lets the compiler inline functions between different source files. Add &#8220;-flto;-O3&#8221; to the additional CFLAGS of the BSP:<br \/>\n<a href=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/28-lto.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-8746\" src=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/28-lto.png\" alt=\"\" width=\"1194\" height=\"761\" \/><\/a>This will automatically inject a <strong>bsp_compile_flags()<\/strong> statement into <strong>CMakeLists.txt<\/strong>.<\/li>\n<li>LTO will reduce the debuggability of the code (many functions won&#8217;t be steppable anymore), so it&#8217;s wise to only enable it on release builds. You can do it by adding a condition to <strong>CMakeLists.txt<\/strong> as shown below:<br \/>\n<a href=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/29-relonly.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-8747\" src=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/29-relonly.png\" alt=\"\" width=\"1194\" height=\"761\" \/><\/a><\/li>\n<li>Build the project and look into the Embedded Memory Explorer. See how most of the interrupt-handling logic got inlined into <strong>OTG_HS_IRQHandler()<\/strong>, massively increasing its size:\u00a0 <a href=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/30-handler.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-8748\" src=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/30-handler.png\" alt=\"\" width=\"1194\" height=\"763\" \/><\/a><\/li>\n<li>If you look at the disassembly, you will see much fewer (if any) calls to other functions:<a href=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/31-inlined.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-8749\" src=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/31-inlined.png\" alt=\"\" width=\"1194\" height=\"763\" \/><\/a><\/li>\n<li>Run another tracing session to measure the performance:<a href=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/32-perf-1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-8757\" src=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/32-perf-1.png\" alt=\"\" width=\"1194\" height=\"766\" \/><\/a>Note how the time between the <strong>IRQ handler<\/strong> and <strong>DataIn Handled<\/strong> reduced considerably, as well as the time between <strong>DataIn Handler<\/strong> and <strong>DMA Start. <\/strong>The average time of the operations in clock cycles, excluding the tracepoint overhead, are shown below:<br \/>\n<table style=\"border-collapse: collapse; width: 100%; height: 72px;\">\n<tbody>\n<tr style=\"height: 24px;\">\n<td style=\"width: 33.3333%; height: 24px;\">Operation<\/td>\n<td style=\"width: 33.3333%; height: 24px;\">Without LTO<\/td>\n<td style=\"width: 33.3333%; height: 24px;\">With LTO<\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"width: 33.3333%; height: 24px;\">IRQ handler to DataIn processing<\/td>\n<td style=\"width: 33.3333%; height: 24px;\">1001<\/td>\n<td style=\"width: 33.3333%; height: 24px;\">540 (-46%)<\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"width: 33.3333%; height: 24px;\">DataIn processing to start of DMA transfer<\/td>\n<td style=\"width: 33.3333%; height: 24px;\">1650<\/td>\n<td style=\"width: 33.3333%; height: 24px;\">856(-48%)<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Inlining the USB interrupt handling logic reduced the overall software overhead almost twice.<\/li>\n<li>Once you have done gathering the tracing data, you can export the tracepoints to a reusable set using a context menu command or a button the toolbar:<a href=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/33-export.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-8751\" src=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/33-export.png\" alt=\"\" width=\"1194\" height=\"766\" \/><\/a><\/li>\n<li>Enter the name of the set, save it on the solution scope, and ensure the &#8220;delete tracepoints&#8221; checkbox is checked: <a href=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/34-name.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-8752\" src=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/34-name.png\" alt=\"\" width=\"786\" height=\"367\" \/><\/a><\/li>\n<li>Now the tracepoint list is empty again, so you won&#8217;t lose any performance trying to record every single event: <a href=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/35-notp.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-8753\" src=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/35-notp.png\" alt=\"\" width=\"1194\" height=\"766\" \/><\/a><\/li>\n<li>You can re-apply previously saved tracepoints via the &#8220;import a saved tracepoint set&#8221; button to easily re-run the same experiments on a different version of the code:<a href=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/import.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-8759\" src=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/import.png\" alt=\"\" width=\"786\" height=\"593\" \/><\/a>Note that although using the LTO reduced the interrupt handling overhead almost twice, it only increased the throughput by about 3% (15.2MB\/s to <strong>15.7 MB\/s<\/strong>), suggesting that the USB hardware has an internal packet queue, so reducing interrupt handling delays further will not bring much performance increase. The good news is that it has considerably reduced the CPU utilization by the USB stack, as we will now demonstrate.<\/li>\n<li>Add the following code to the <strong>usb_cdc_if.c<\/strong> file:\n<pre>static volatile long g_IdleCounter;\r\n\r\nvoid __attribute__((noinline)) benchmark()\r\n{\r\n    g_IdleCounter++;\r\n}<\/pre>\n<p>Then, modify the wait loop in <strong>VCP_write()<\/strong> to repeatedly call <strong>benchmark()<\/strong> while it is waiting for the USB interface to become ready.<\/p>\n<pre>while (pCDC-&gt;TxState) {\r\n    benchmark();\r\n}<\/pre>\n<\/li>\n<li>You can now start another debugging session (without tracepoints), use Live Watch to plot the value of <strong>g_IdleCounter<\/strong> and see how fast it changes by selecting a subrange of it:<a href=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/rate-1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-8761\" src=\"https:\/\/visualgdb.com\/w\/wp-content\/uploads\/2024\/02\/rate-1.png\" alt=\"\" width=\"1231\" height=\"793\" \/><\/a>Live Watch works by sampling the memory in the background and does not introduce any overhead, so the number of times <strong>benchmark()<\/strong> got called is proportional to the amount of meaningful work the CPU could otherwise do. Marking <strong>benchmark()<\/strong> with <strong>__attribute__((noinline))<\/strong> ensures that it won&#8217;t get affected by LTO.<br \/>\nIn our experiments, the rate without any USB load was 3.7M iterations\/second, the rate without LTO was 2.46M\/sec (CPU utilization of 35%), and enabling LTO raised the rate to 3.1M\/second (17% utilization).<\/li>\n<\/ol>\n<h3><a id=\"conclusion\"><\/a>Conclusion<\/h3>\n<p>The table below summarizes different optimization techniques applied to the STM32 USB driver on STM32F7-Discovery and their effect on the throughput and CPU utilization with 512-byte packets:<\/p>\n<table style=\"border-collapse: collapse; width: 100%;\">\n<tbody>\n<tr>\n<td style=\"width: 33.3333%;\">Variant<\/td>\n<td style=\"width: 33.3333%;\">Throughput<\/td>\n<td style=\"width: 33.3333%;\">CPU utilization<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 33.3333%;\">Debug build<\/td>\n<td style=\"width: 33.3333%;\">4.2 MB\/s<\/td>\n<td style=\"width: 33.3333%;\">Not measured<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 33.3333%;\">Release build<\/td>\n<td style=\"width: 33.3333%;\">7.7 MB\/s<\/td>\n<td style=\"width: 33.3333%;\">60%<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 33.3333%;\">Enabled DMA<\/td>\n<td style=\"width: 33.3333%;\">9.1 MB\/s (+18%)<\/td>\n<td style=\"width: 33.3333%;\">47%<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 33.3333%;\">Disabled ZLPs<\/td>\n<td style=\"width: 33.3333%;\">15.2 MB\/s (+67%)<\/td>\n<td style=\"width: 33.3333%;\">35%<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 33.3333%;\">Enabled LTO<\/td>\n<td style=\"width: 33.3333%;\">15.7 MB\/s (+3%)<\/td>\n<td style=\"width: 33.3333%;\">17%<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n","protected":false},"excerpt":{"rendered":"<p>This tutorial shows how to use several optimization techniques to get the maximum performance out of the high-speed USB interface<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[197,247],"tags":[248,204,71],"_links":{"self":[{"href":"https:\/\/visualgdb.com\/w\/wp-json\/wp\/v2\/posts\/8716"}],"collection":[{"href":"https:\/\/visualgdb.com\/w\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/visualgdb.com\/w\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/visualgdb.com\/w\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/visualgdb.com\/w\/wp-json\/wp\/v2\/comments?post=8716"}],"version-history":[{"count":6,"href":"https:\/\/visualgdb.com\/w\/wp-json\/wp\/v2\/posts\/8716\/revisions"}],"predecessor-version":[{"id":8763,"href":"https:\/\/visualgdb.com\/w\/wp-json\/wp\/v2\/posts\/8716\/revisions\/8763"}],"wp:attachment":[{"href":"https:\/\/visualgdb.com\/w\/wp-json\/wp\/v2\/media?parent=8716"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/visualgdb.com\/w\/wp-json\/wp\/v2\/categories?post=8716"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/visualgdb.com\/w\/wp-json\/wp\/v2\/tags?post=8716"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}