Profiling Linux C++ Code with Visual Studio

This tutorial shows how to profile C++ code using Visual Studio, valgrind and VisualGDB. We will show how to import the JSON for Modern C++ parser (as of 31 October 2017) into a Visual Studio project, build it under Linux, run a benchmark, quickly identify the code consuming most of the time and optimize it. We will increase the performance of the release build by over 1.5x by locating and eliminating a few bottlenecks.

The techniques shown in this tutorial can work with any other C/C++ code as well; we are using the JSON parser to simply demonstrate the functionality on a real-world project rather than on a synthetic example.

Before you begin, install Visual Studio and VisualGDB 5.3 or later. Also prepare a Linux machine that will be used to build and profile the JSON benchmark.

  1. Start Visual Studio and open the VisualGDB Linux Project Wizard:01-newproj
  2. On the first page of the wizard select “Import a project -> Import a CMake project -> Use the advanced CMake Project Subsystem”:02-import
  3. On the next page select the Linux machine you want to use for building:03-machine
  4. Connect to your Linux machine over SSH and clone the https://github.com/nlohmann/json repository to a local directory:04-clone
  5. Get back to VisualGDB Project Wizard and specify the directory where you cloned the repository:05-dir
  6. Although VisualGDB can access the remote files directly over SSH, it is recommended to copy them locally if you want to use the profiler. This will help VisualGDB adjust the function locations reported by valgrind based on the actual function declarations:06-access
  7. Proceed with the default settings on the last page:07-cmake
  8. Press “Finish” to create the project. VisualGDB will automatically detect the targets declared in the original CMakeLists.txt file and show them in Solution Explorer:08-prj
  9. The original CMakeLists.txt files don’t contain the targets for the benchmark. You can add it manually by right-clicking on the .vgdbcmake project and selecting “Add->New Items->Executable”. Add an executable called “benchmark” in the “benchmarks” subdirectory:10-bench
  10. VisualGDB will automatically create the CMakeLists.txt file in that subdirectory. Remove the auto-generated source file and reference the benchmarks.cpp file instead:11-simple
  11. Add “thirdparty/benchpress” and “thirdparty/cxxopts” to the include directories for the new target:12-dirs
  12. Add “pthread” to the linked library names:13-pthread
  13. To make profiling faster, we will comment out all benchmarks except “parse canada.json” and focus on optimizing the parser for this set. Build the target by selecting “Build Target(s)” in Solution Explorer:15-build
  14. Ensure you have the RelWithDebInfo configuration selected. The regular Release configuration doesn’t generate debug symbols, so VisualGDB won’t be able to highlight performance of individual code lines:16-built
  15. Open Debug Properties for the target and set the working directory to the “benchmarks” subdirectory:17-workdir
  16. Run the program without debugger by pressing Ctrl-F5 and take note of the parsing time:18-time
  17. Now we will begin optimizing the program performance. Select Analyze->Analyze Performance with VisualGDB:19-analyzeEnsure that the “benchmark” target is selected as the startup target, as VisualGDB always profiles the startup target.
  18. Select “Profile without debugging” and press “OK”:20-start
  19. VisualGDB will launch the benchmark under profiler and begin collecting performance data:21-recording
  20. Once the profiling is complete, VisualGDB will automatically display function call tree sorted by the inclusive time. Expand the critical path until you see the “strtod” function. The function is responsible for parsing string representations of the floating point numbers (e.g. “1.2345”):22-critical
  21. You can enter the total time observed without profiling in the “total time” field so that VisualGDB will try to estimate how much time each function would take under normal circumstances (without profiler). Note that this estimate is not 100% accurate as VisualGDB (and valgrind) counts CPU instructions that don’t always translate 1-to-1 to CPU cycles:23-times
  22. A quick research shows that the strtod() function is indeed relatively slow as it handles the advanced ‘E’ syntax. We will address this by replacing it with a simpler implementation handling normal cases that will fall back to calling strtod() if it encounters the ‘e’ character (the implementation is based on this one). Create a file called “Optimizations.h” with the following contents:
    #pragma once
    #include <sys/types.h>
    #include <string>
    #include <vector>
     
    double strtod_fast(const char *start, char **end)
    {
        const char *p = start;
        double r = 0.0;
        bool neg = false;
        if (*p == '-') {
            neg = true;
            ++p;
        }
        
        while (*p >= '0' && *p <= '9') {
            r = (r * 10.0) + (*p - '0');
            ++p;
        }
        if (*p == '.') {
            double f = 0.0;
            double coef = 1.0;
            ++p;
            while (*p >= '0' && *p <= '9') {
                f = (f * 10.0) + (*p - '0');
                ++p;
                coef /= 10.0;
            }
            r += f * coef;
        }
        if (neg) {
            r = -r;
        }
        
        if ((*p | 0x20) == 'e')
            return strtod(start, end);
        
        if (end)
            *end = const_cast<char *>(p);
        return r;
    }
  23. Include the file from json.hpp and replace the call to std::strtod() with a call to strtod_fast():24-fast
  24. Build the program and run it without debugging. See how the execution time got ~20% faster: 25-newtime
  25. Doing another profiling session will quickly reveal that strtod_fast() is indeed much faster than strtod() and now takes only 8% of the total program run time:26-newfunc
  26. You can double-click on the function in the report and then click “annotate lines” to highlight performance of each individual line:27-linesSee how most of the time is spent parsing digits after the decimal point. If we wanted to optimize this function further, we could use a lookup table to handle 2 characters at once or try first converting all digits to an integral value and then dividing it in on FP operation.
  27. Now we will show a few more optimizations. Enable the “show individual lines” switch and navigate to the “scan_number()” method. See that the “yytext.push_back()” line is now taking almost 20% of the entire program run time:28-pushback
  28. Double-click on the line to open the corresponding source code. Note how yytext is defined as a string:29-yytext
  29. As appending a character to it is considerably slowing down our program, we can replace it with a custom container optimized for quick appending characters one-by-one:
    • The custom container will not be null-terminated until its contents is explicitly requested
    • It will initially allocate a large chunk of data for the string to avoid reallocating it when appending characters

    Add the following code to Optimizations.h:

    class FastAppendableString
    {
    protected:
        char *m_pData;
        mutable char *m_pEnd;
        char *m_pEndOfAlloc;
        mutable bool m_NullTerminated = false;
        
        void DoAppendChar(char ch)
        {
            if (m_pEnd >= m_pEndOfAlloc)
                asm("int3");
            *m_pEnd++ = ch;
        }
        
        void EnsureRawForm() const
        {
            if (m_NullTerminated)
            {
                m_pEnd--;
                m_NullTerminated = false;
            }
        }
        
    public:
        FastAppendableString()
        {
            m_pData = (char *)malloc(4096);
            m_pEnd = m_pData;
            m_pEndOfAlloc = m_pData + 4096;
        }
        
        ~FastAppendableString()
        {
            free(m_pData);
        }
        
        FastAppendableString(const FastAppendableString &) = delete;
        void operator=(const FastAppendableString &) = delete;
        
        const char *data()
        {
            if (!m_NullTerminated)
            {
                DoAppendChar(0);
                m_NullTerminated = true;
            }
            return m_pData;
        }
        
        const size_t size()
        {
            if (m_NullTerminated)
                return m_pEnd - m_pData - 1;
            else
                return m_pEnd - m_pData;
        }
        
        void clear()
        {
            m_pEnd = m_pData;
            m_NullTerminated = false;
        }
        
        void push_back(char ch)
        {
            EnsureRawForm();
            DoAppendChar(ch);
        }
        
        operator std::string()
        {
            return data();
        }
    };

    Then replace the declaration of yytext to use the new custom container type: 30-string

  30. Run the program again and see another 12% increase in the performance:31-time3Note that we have intentionally left the reallocation case unimplemented (it will trigger a breakpoint via the “int3” instruction). Adding code to reallocate the buffer will slightly reduce the performance as it will force the GCC to save the registers modified by realloc() before checking whether a call to realloc() is needed. This could be resolved by making a non-inlineable function saving all CPU registers on entry and calling it via inline assembly.
  31. The JSON parser example shown in this tutorial contains another relatively easy optimization opportunity. Switch to the “All functions” view and sort the functions by exclusive time. See how the emplace_back() method is taking 10% of the time. It is called from 4 different locations and Valgrind normally doesn’t distinguish which of them contributes the most to the execution time:32-emplace
  32. To easily distinguish between different invocation paths of emplace_back(), run another profiling session enabling the “distinguish up to 1000 parent frames” option:33-stack
  33. Now you can expand the [all contexts] node and see that 8% of the total program runtime is spent on the emplace_back() call from lexer::get():34-advstack
  34. This scenario is similar to the previous one (slow character-by-character appending):35-tokenstring
  35. Add another optimized container to Optimizations.h and edit token_string to use it:
    class FastAppendableString2 : public FastAppendableString
    {
    public:
        void pop_back()
        {
            EnsureRawForm();
            if (m_pEnd >= m_pData)
                m_pEnd--;            
        }
        
        const char *begin() const
        {
            EnsureRawForm();
            return m_pData;
        }
        
        const char *end() const
        {
            return m_pEnd;
        }
    };
  36. Run the program again. See how we have reduced the runtime by another 7.5% (35% down from the original version):36-time4Note that profiling helps you optimize the program for a specific scenario. E.g. one of 3 optimizations in this tutorial relied on heavy use of floating-point numbers in the dataset and would not give much improvement for data sets containing mostly integral numbers and strings. Remember to always use real data sets (or proportionally scaled down subsets) when you profile your code to get the best final results.