xiph – Corey James Blog

Hello and welcome to my blog!

In this blog post, I will be writing about my final project for my software portability and optimization class SPO 600. In this post, I will review the work I have done with the FLAC project. If you didn’t read the first post in this series, you could find it here. A quick review of the last post, for my SPO 600 class, I got tasked with finding and optimizing an open-source project. Knowing I wanted to work with audio, I found the open-source project called FLAC (Free Lossless Audio Codec). On further investigation, I made a plan on how to optimize the FLAC library. In this blog post, I will go over the implementation and results of my optimization.

Execution of my plan:

In the previous blog post, I laid out a strategy for the completion of my optimization. It turned out to be quite helpful. I was able to follow it, and I completed the optimization just as planned. Below is the strategy I made and some notes on how I completed each task.

Research the required pre-processor directives that I will need to run the Aarch64 code inside the FLAC library conditionally.
Through reading the FLAC code, I was able to determine that the pre-processor directive used in this project really on global variables defined by the configure script. After reading the “configure.ac” file, online AutoTools documentation and the “configure.ac” file from the OPUS project, which is also created by Xiph, the same people who created the FLAC project. I was able to determine how to check if the user has an aarch64 CPU and if the intrinsic library arm-neon is available. After confirming that I am on an aarch64 machine, I define “FLAC__CPU_AARCH64,” and after confirming arm-neon intrinsic are possible, I define “FLAC__HAS_NEONINTRIN.”
Test the pre-processor directives with some code that will cause a fault, so I know it is working.
Since the FLAC project has some optimizations for other platforms, I needed to follow the same pattern as the previous people. To test the preprocessor directives, I had to add the new architecture to the function selection logic. To add the new architecture, I added code to “src/libFLAC/cpu.c”, “src/libFLAC/include/private/cpu.h”, “src/libFLAC/include/private/lpc.h” and “src/libFLAC/stream_encoder.c.” Once I had the architecture function selection logic done, I was ready to test. I did not end up needing to cause any faults to confirm that the preprocessor directives were working. Instead, I used a printf statement and a copy of the vanilla C code version of the auto-correlation function. In running the program, I was able to see the printf messages, and I also used perf to confirm I was using the new function.
Examine the codebase to know where precisely I need to put the pre-processor directives. And check if I need to mess with the build instructions.
I ended up doing this in step in steps one and two since the FLAC project didn’t use pre-defined variables for the pre-processor directives. Instead, the FLAC project uses variables defined by the configure script. And I had to examine the codebase and implement the changes to run a test.
Configure the makefile to build the new file that I am adding.
I added one file to the project called lpc_intrin_neon.c. So for the compiler to build it, I put it into the list of source files inside “src/libFLAC/Makefile.am.”
I am going to focus on the “FLAC__lpc_compute_autocorrelation” function, and I am going to translate it into aarch64 intrinsic’s. I will use the existing c and x86 intrinsic code to help me with the translation.
Success! It took some time, but I was able to translate the x86 code into aarch64. The way I did this was by using the Intel and arm-neon online documentation. I also got help by googling a specific x86 intrinsic and asking what arm-neon instruction does the same thing or similar. For a few intrinsics, there was no direct replacement. Specifically, there is no shuffle in arm-neon, so I had to read up on how shuffle worked on x86 and execute that using multiple arm-neon instruction. I ended up creating inline functions for the shuffles to make writing the code more manageable and cleaner.
Testing my optimization, I will re-run the test that I performed on the original code with my optimized version and see if I have improved the performance on the aarch64 platform.
I tested with two aarch64 machines. The first machine has a faster single-thread performance with 8 threads. The second machine has 24 threads but slower single-thread performance. On the first machine, initially, the autocorrelation function took 26.11 percent of the runtime. After the optimizations, the autocorrelation function took 12.64 percent of the time. On the second machine, initially, the autocorrelation function took 52.41 percent of the runtime. After the optimizations, the autocorrelation function took 14.78 percent of the time. I also tested the optimizations on an x86 machine to confirm that the changes did not affect that architecture.
As a stretch goal, depending on how hard it is to write the Aarch64 intrinsics, I would like to translate the full “ipc.c” file with aarch64 intrinsics.
I didn’t end up translating the full “ipc.c” file, but I did translate all versions of the autocorrelation function. There are four versions of the autocorrelation function. Depending on how much lag it will call the correct version, either lag 4, lag 8, lag 12 or lag 16.

Full Results:

The following results are not averaged, but I did run theses test multiple times with similar results. The numbers below are from a few of the many tests I performed.

Aaarch64 Machine 1:

TOTAL RUNTIME OF THE TEST BEFORE OPTIMIZATION:

real    0m51.784s
user    0m49.356s
sys     0m2.349s

TOTAL RUNTIME OF THE TEST AFTER OPTIMIZATION:

real    0m43.503s
user    0m40.950s
sys     0m2.470s

PERF REPORT BEFORE OPTIMIZATION (First 20 Lines):

To display the perf.data header info, please use --header/--header-only options.
 #
 #
 Total Lost Samples: 0
 #
 Samples: 208K of event 'cycles:u'
 Event count (approx.): 98509947650
 #
 Overhead  Command   Shared Object           Symbol
 ……..  ……..  ………………….  ………………………………………………………………………………..
 #
     26.11%  lt-flac   libFLAC.so.8.3.0        [.] FLAC__lpc_compute_autocorrelation
     25.54%  lt-flac   libFLAC.so.8.3.0        [.] FLAC__fixed_compute_best_predictor_wide
     11.35%  lt-flac   libFLAC.so.8.3.0        [.] FLAC__bitwriter_write_rice_signed_block
      9.45%  lt-flac   libFLAC.so.8.3.0        [.] FLAC__MD5Transform
      5.95%  lt-flac   lt-flac                 [.] format_input
      5.60%  lt-flac   libFLAC.so.8.3.0        [.] FLAC__lpc_compute_residual_from_qlp_coefficients_wide
      3.42%  lt-flac   libFLAC.so.8.3.0        [.] precompute_partition_info_sums_
      2.34%  lt-flac   libFLAC.so.8.3.0        [.] FLAC__MD5Accumulate
      2.21%  lt-flac   libFLAC.so.8.3.0        [.] FLAC__crc16

PERF REPORT AFTER OPTIMIZATION (First 20 Lines):

To display the perf.data header info, please use --header/--header-only options.
 #
 #
 Total Lost Samples: 0
 #
 Samples: 175K of event 'cycles:u'
 Event count (approx.): 81871492155
 #
 Overhead  Command  Shared Object       Symbol
 ……..  …….  ………………  ………………………………………………………………………………..
 #
     30.58%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__fixed_compute_best_predictor_wide
     13.36%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__bitwriter_write_rice_signed_block
     12.64%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__lpc_compute_autocorrelation_intrin_neon_lag_12
     11.71%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__MD5Transform
      7.16%  lt-flac  lt-flac             [.] format_input
      5.18%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__lpc_compute_residual_from_qlp_coefficients_wide
      4.16%  lt-flac  libFLAC.so.8.3.0    [.] precompute_partition_info_sums_
      3.00%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__MD5Accumulate
      2.62%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__crc16

Aaarch64 Machine 2: