6
\$\begingroup\$

I want to optimize the code to run faster, so I started with parallelizing the for loops, but the code still not very fast. I do get variable frames per second, but the maximum is about 23 FPS. The code is manipulating the histogram of a given image. Is there a faster and more efficient way of doing it?

#include <opencv2/opencv.hpp>
#include <iostream>
#include <vector>

// Function declaration
void hist(cv::Mat& image);

int main() {
    // Read the image
    cv::Mat image = cv::imread("image.jpg");
    if (image.empty()) {
        std::cerr << "Error: Could not open or find the image!" << std::endl;
        return -1;
    }

    // Call the hist function
    hist(image);

    // Display the result
    cv::imshow("Result", image);
    cv::waitKey(0);
    return 0;
}

void hist(cv::Mat& image) {
    int filterFactor = 1;
    long int N = image.rows * image.cols;

    std::vector<int> h_b(256, 0);
    std::vector<int> h_g(256, 0);
    std::vector<int> h_r(256, 0);

    // Calculate histograms for B, G, and R channels in parallel
    cv::parallel_for_(cv::Range(0, image.rows), [&](const cv::Range& range) {
        for (int i = range.start; i < range.end; ++i) {
            const cv::Vec3b* row = image.ptr<cv::Vec3b>(i);
            for (int j = 0; j < image.cols; ++j) {
                const cv::Vec3b& pxi = row[j];
                h_b[pxi[0]]++;
                h_g[pxi[1]]++;
                h_r[pxi[2]]++;
            }
        }
    });

    // Accumulate histograms
    for (int i = 1; i < 256; ++i) {
        h_b[i] += filterFactor * h_b[i - 1];
        h_g[i] += filterFactor * h_g[i - 1];
        h_r[i] += filterFactor * h_r[i - 1];
    }

    // Find vmin and vmax for B, G, and R channels
    auto find_vmin_vmax = [&](const std::vector<int>& hist, int& vmin, int& vmax) {
        vmin = 0;
        vmax = 255 - 1;
        while (hist[vmin + 1] <= N * 3 / 100) { vmin++; }
        while (hist[vmax - 1] > (N - (N / 100) * 3)) { vmax--; }
        if (vmax < 255 - 1) { vmax++; }
    };

    int vmin_b, vmin_g, vmin_r, vmax_b, vmax_g, vmax_r;
    find_vmin_vmax(h_b, vmin_b, vmax_b);
    find_vmin_vmax(h_g, vmin_g, vmax_g);
    find_vmin_vmax(h_r, vmin_r, vmax_r);

    // Apply stretching in parallel
    cv::parallel_for_(cv::Range(0, image.rows), [&](const cv::Range& range) {
        for (int i = range.start; i < range.end; ++i) {
            cv::Vec3b* row = image.ptr<cv::Vec3b>(i);
            for (int j = 0; j < image.cols; ++j) {
                cv::Vec3b& pxi = row[j];
                pxi[0] = std::max(std::min(static_cast<int>(pxi[0]), vmax_b), vmin_b);
                pxi[1] = std::max(std::min(static_cast<int>(pxi[1]), vmax_g), vmin_g);
                pxi[2] = std::max(std::min(static_cast<int>(pxi[2]), vmax_r), vmin_r);

            }
        }
    });

    // Apply scaling in parallel 
    cv::parallel_for_(cv::Range(0, image.rows), [&](const cv::Range& range) {
        for (int i = range.start; i < range.end; ++i) {
            cv::Vec3b* row = image.ptr<cv::Vec3b>(i);
            for (int j = 0; j < image.cols; ++j) {
                cv::Vec3b& pxi = row[j];
                pxi[0] = cv::saturate_cast<uchar>((pxi[0] - vmin_b) * 255 / (vmax_b - vmin_b));
                pxi[1] = cv::saturate_cast<uchar>((pxi[1] - vmin_g) * 255 / (vmax_g - vmin_g));
                pxi[2] = cv::saturate_cast<uchar>((pxi[2] - vmin_r) * 255 / (vmax_r - vmin_r));
            }
        }
    });
}
\$\endgroup\$

2 Answers 2

7
\$\begingroup\$

profiling

This submission is about performance, yet it includes no profiling measurements. All we are told is that hist() completes within 44 msec.

temporal relationships

In your use case you call hist(image1) and then hist(image2). Does the use case impose any structure, so image2's pixels are somehow related to image1? For example, could we construct a smallish bbox or set of rasters which contains all pixels that changed in this frame? Is caller aware of such constraints and could pass them in? Or is it feasible for hist() to compute them?

approximation

Do you really need to re-compute the exact parameters for every frame? Would it suffice to recycle parameters that are one or a couple of frames out of date?

single frame delay

Compute precise results, delayed by one frame. Background thread asynchronously works on frame N-1, and publishes its results by the time frame N arrives. We use those parameters to synchronously stretch and scale frame N.

spatial relationships

Does one raster by-and-large resemble its adjacent rasters? Pick a stride, so you only compute parameters for every 2nd or every 16th raster, and take that approximation as "good enough".

stretch + scale

It's not clear that you've maxed out your cores. It seems possible that L3 or main memory bandwidth is the bottleneck here. You didn't share profiling data so it's hard to tell.

You stretch using all cores, and then you separately scale using all cores. We're reading and writing every pixel in both instances, dirtying cache lines and writing them out.

Prefer to stretch + scale within a given iteration, as a combined operation. This potentially doubles your throughput, if memory bus bandwidth was the limiting factor.

lookup table

I have no idea if this would be winning, given that arithmetic is cheap, so time it and see.

You currently compute a histogram table plus some other parameters. Consider computing that plus three 256-element lookup tables, so we're doing no arithmetic during stretch + scale.

aesthetics

One benefit of three lookup tables is you could compare them from frame to frame. And you could do exponential smoothing on them, to avoid visually jarring abrupt changes.

Suppose we have tables available from several recent frames. Compute smoothed lookup value
v[i] = .5 * v[i-1] + .25 * v[i-2] + .125 * v[i-3] + .125 * v[i-4].
Ideally it's an infinite sum, but then you have to approximate the tail end at some point; here I used 1/8th to do that. And there's no need to store lots of old tables, as we can simply divide old value by 2 then do an addition.

Such a smoothing approach adapts nicely to having a background thread compute histograms at a lower rate than once per frame.

Additionally, that thread might focus on "histogram of even rasters" for one frame, and then find histogram of odd numbered rasters on next frame, if we believe there's spatio-temporal structure in the frames.

\$\endgroup\$
5
  • \$\begingroup\$ The smoothing of lookup tables is neat. You really only need to have the previous table available, and compute, say, new = new*0.25 + prev*0.75. Because prev is the previously smoothed table, it already contains information from all the previous frames. The 0.25 and 0.75 sum to 1, and the split determines the speed with which new information is adopted into the table. \$\endgroup\$ Commented Jul 11 at 0:00
  • \$\begingroup\$ @CrisLuengo Yeah, I was suggesting a smoothing \$\alpha\$ of .5, you're suggesting .25, and even smaller values like .1 might be suitable for some use cases. It's unclear if the OP is examining a sequence of synthetic video game images, sports images with frequent rapid cuts as we follow the tennis ball, or some other image source. Any way you slice it we're "cheating" since we don't compute what the OP code computes, but hey, life is full of tradeoffs. And they start with stakeholders revealing what's important vs. what's a nice-to-have. \$\endgroup\$
    – J_H
    Commented Jul 11 at 0:14
  • 1
    \$\begingroup\$ For readability and maintainability I would use 1/2 ... 1/8 instead of .5 ... .125. The compiler will compute the values for you. \$\endgroup\$ Commented Jul 11 at 12:23
  • 1
    \$\begingroup\$ I didn’t mean to talk about the constants. I was referring your statement “Ideally it's an infinite sum, but then you have to approximate the tail end at some point”. You only need to refer to the previously computed result. The recurring sequence y(t) = α x(t) + β y(t-1) creates a sequence where the output y at a point t depends on the input x at t and all previous time points. If α + β = 1, it’s a causal smoothing filter. Just fill in y(t-1) = α x(t-1) + β y(t-2) in that expression, and repeat for ever… \$\endgroup\$ Commented Jul 11 at 14:57
  • \$\begingroup\$ Thank you so much @J_H, I did implement the spatial relationships and lookup table and I get x100 performance. That was very helpful answer. Now I need to implement the smoothing function as you expected I see jumps every now and then. \$\endgroup\$
    – Ja_cpp
    Commented Jul 21 at 9:31
4
\$\begingroup\$

There is one major issue with the code, and then some performance issues (the other answer already touched on some of these).

Major Issue

In the histogram computation, you loop over pixels in parallel, but update a single set of arrays.

Parallel threads updating the same array invariably leads to race conditions. Here, one thread reads a value, increments it by one, then writes it back. Say another thread does the same thing to the same value, at the same time. It reads the value before the first thread writes, and it writes it out after the first thread writes. The second thread thus clobbered (overwrote) the first one’s work.

And because writing is typically done in a larger unit than a single int, the chance of this type of race condition happening is larger than you’d think.

Either create the histogram in a single thread, or have each thread build its own histogram from its own section of the image, then combine the histograms at the end.

I would keep it simple, have this function be single-threaded, and run it in parallel, each thread processing a different frame.

Performance Issues

You have two loops, “stretching” and “scaling”. The “stretching” loop actually just clips the data, the second loop stretches (== scales). But the second loop also clips, since you use cv::saturate_cast<uchar>(). So, that first loop is completely superfluous, leave it out!

As the other answer pointed out, a single loop through data is much faster than two loops through data. Try to do as much of your processing as possible in each loop, so you don’t loop more often than necessary. But in this particular case, the work done in the first loop is not needed at all.

\$\endgroup\$
3
  • \$\begingroup\$ Thank you for the "Performance Issues" solutions, that works very well. But for making the function single threaded I get worse performance as the image is 4K resolution. \$\endgroup\$
    – Ja_cpp
    Commented Jul 21 at 9:33
  • 1
    \$\begingroup\$ @Ja_cpp Where I suggest you write it single-threaded, it is because your multi-threaded histogram creation is bugged. It is not doing the right thing. It cannot work correctly. I also told you how to create a histogram using multiple threads. Do that if you must. But don’t use your current code, it is broken. \$\endgroup\$ Commented Jul 21 at 18:00
  • \$\begingroup\$ Yes I agree, I just retried a single threaded loop to see if I do really need to spend time on correcting the bugs on the multi threaded one. Thank you once again. \$\endgroup\$
    – Ja_cpp
    Commented Jul 22 at 7:38

Not the answer you're looking for? Browse other questions tagged or ask your own question.