Why many GPU reviews are not fit for purpose
Why many GPU reviews are not fit for purpose…
Recently AMD and NVIDIA received some flack regarding performance of their cards when gaming, specifically around an experience called micro stutters. Some great work by the likes of The Tech Report managed to capture and document the results in a clear and useful way. Both manufacturers had problems although the impact varied across different cards and games. To be fair though both AMD and NVIDIA have accepted the problem and gone about addressing the performance issue.
It is also fair to say that in the future a number of sites will adopt newer tests which are able to highlight this particular issue when it occurs. There remains the question though; why wasn’t this more of an issue and sooner? It had after all been happening for some time.
In what seems like an ongoing, unrelated discussion we have on occasion taken to our forum to discuss our testing processes, results and how they compare to other publications. Behind the scenes we have also discussed similar matters with AMD and NVIDIA a number of times but in the run up to the GTX TITAN launch recently it became clear that there was a real problem in the graphics/review industry and that was a significant contributing factor to micro-stuttering (or frame latency issues) being an unsolved problem, bothering gamers across the globe for a while.
The simple fact is that manufacturers and consumers have been too accepting of reviews. Not caring about more than the basics of the results presented to them, or how the figures were achieved/recorded. Add to that the fact there are a huge number of review sites out there and to be frank they are filled with lazy writers who will do the minimum required to get an article out… and editors who only care about hits (money) and not the quality of the work. On occasion the same lazy writers are also the money focused editors and that is never a good mix. That’s before we even consider those who fabricate testing and plagiarise others work.
There are of course sites/magazines who do offer quality reviews, but they are in the minority (a minority I hope that we are part of).
Anyway, today we are going to look at GPU reviews specifically (and may venture back into other categories in the future as motherboard reviews for example are also a mess in the review industry), taking a look at what WE feel makes for a quality review and why. This isn’t about singling out a single company, publication, writer but hopefully it can educate both PR reps and consumers about what makes a good review and how that can help them both in getting the best purchase for their needs… or maximising sales. Maybe it will make some other writers think about their processes too…
Test System and Practices
There are a few aspects, even before testing begins, which can impact the results in a review. The first is the choice of components/configuration, the second the tests used. The configuration of a system doesn’t just mean the components, it also means the drivers and this is one of the key areas where a lazy reviewer will often take shortcuts. Across the industry there are a wide range of sites who regularly recycle results for months on end. This means taking a set of benchmarks from say October then when they test again in December, January etc. the same figures are used again but the new products being added are either tested on a newer driver or an outdated driver. So how can that impact performance?
Driver Performance Changes
NOTE: For the purposes of this article we are sticking to games which have been available throughout the lifespan of our oldest drivers, where possible.
In this first test we have taken Battlefield 3 and are testing it on our i7-3960X test system (more on that later) with a Radeon 7970. We tested first with Catalyst 12.9 Beta from September 2012 and then Catalyst 13.3 Beta which was released last week (March 2013). The difference in frames per second, on the same test system, just by updating a 6 month old driver is 14fps (on average). That takes our 7970 from an average of 61fps right up to 75fps when testing at 1920×1080 on Ultra detail. Framerates throughout the test increase of course with the important minimum framerate figure rising 12fps from 46 to 58.
This isn’t just a Battlefield 3 specific change though and it isn’t always a change for the positive… take Skyrim as an example…
In this test we took Skyrim, again at 1920×1080 on our test system and set it to Ultra with no other changes. Running on 12.9 (September 2012) we averaged 127fps, on 13.3beta (March 2013) we averaged 116fps.
In summary, different drivers can make a significant difference to testing even the same card and this is not specific to AMD.
Moving to the second point we mentioned above, the tests chosen can have an impact on results, but how significant? Before we look at the figures achieved in some different tests it is worth noting that there are 3 main ways to test "gaming" performance currently. First are synthetic tests such as 3DMark. Then there are inbuilt benchmarks/timedemos where a game plays a pre-defined section using its engine. Then there is real world gameplay using (usually) FRAPS to record performance.
Without doubt 3DMark and the like has its uses. Timedemos on the other hand, in our opinion, offer two hugely detrimental aspects. Firstly they allow reviewers to be lazy. You click a few buttons, walk away and when back the program tells you a figure. You pop that in a graph… done. That means a reviewer doesn’t actually see (or feel) the performance of the game or any graphical issues but more importantly inbuilt benchmarks rarely represent the framerates consumers will get when playing.
Let’s see by how much.
In our first "timedemo" test we have Shogun 2 and at our tested settings the inbuilt bench gives us an average of 83fps, dropping as low as 62fps (Catalyst 12.9). When we then look at some average gameplay we see that the framerate a consumer would actually see is an average of 56fps, that’s 27fps lower (again 12.9). Mix it up a little by moving to Catalyst 13.3 we see that the new driver gives us a nice increase in performance but overall the gap remains, timedemo averaging 87fps, real gameplay sitting at 66fps which is a 21fps change. It gets worse though as in the most demanding sections of Shogun 2, battles, we see the average is actually 58fps on average. A massive 29fps change from the timedemo.
Again this is not specific to AMD, nor is it specific to Shogun 2… and it doesn’t always mean our results are worse than the timedemo. Here is Hitman: Absolution on a GTX 690 with our test system.
In this test not only is the Timedemo showing we would get lower performance than the actual game, this particular bench is so far off our performance more than doubles when actually gaming.
So overall we have seen above that the choice of test and the configuration used has a significant impact on results but configuration comes in two parts when thinking about GPU testing. The other side is the hardware used.
Hardware Used – The impact
Shown below we have a common scenario, the use of two different CPUs for gaming tests. This isn’t a random choice, it is based on a similar scenario to a GTX Titan review we read where that card (in fact multiple TITANs) was tested on an i5 CPU. Does it matter? To look at high end GPU performance on different CPUs here is a high end i5 vs a high end i7 in Hitman and Max Payne 3.
Testing real world gameplay in Hitman shows us that the average difference in framerate between an i5-3570K and i7-3960X is 15fps on a high end GPU and that rises to nearly 20fps on minimum framerate.
In Max Payne 3 we again see a difference in performance between two different CPUs when the rest of the components are the same (excluding motherboard). This time the average framerate increases by 10frames per second on the i7, the minimum by 16fps.
Impacting the outcome of a review
By this point in the article we have seen that the drivers used, components used and tests used can all have a significant impact on the results in a review. We could of course engineer a scenario based on this which combines all three aspects for a massive difference in performance but it is actually more beneficial to step back a little. Let’s return to that first test of Battlefield 3 and imagine a common review where we are comparing something like a 7970 with a new GTX 680 (maybe a manufacturer releases a new cooler design on their own PCB). If we take our old 7970 Catalyst 12.9 results…they are less than 6months old after all /sarcasm… and compare them to our new 680 testing the GTX has a massive average fps advantage in Battlefield 3. That could send consumers down a purchase route. If however we retest the two cards with the latest drivers the result shows the Radeon to be much more competitive on average framerates with a win on minimum. That could completely change the purchase decision and once again, this can work both ways with different games and driver configurations swinging the results from AMD favoured to NVIDA being more competitive.
How we do things and why.
We like to think that when it comes to GPU testing (all testing really) we take a common sense approach. At the core of that is the thought "what experience will this GPU, or its competitors, give me as an end user if I buy it today?".
To get that answer we want to know what it will do for us in the latest games… in actual gameplay. We also want to know what the cards specific features will offer us. We need to know that the performance we are seeing is accurate and not impacted by factors outside the GPU too.
As an example let’s take our GTX Titan review. Here is our test system (which was also used for this article):
ASRock Fatality X79 Champion
Intel Core i7-3960X @ 5.0GHz
4x 4GB Corsair Vengeance DDR3-2133
Corsair H100 Liquid Cooler
Corsair HX860i PSU
Corsair Force GT 240GB SSD
Corsair Vengeance Keyboard, Mouse and Headset
As it happens a big chunk of the components in that review are Corsair branded, that’s part of an upcoming article, but the key point is suitability. There is absolutely a valid argument that this test system isn’t necessarily suited to, for example, budget GPU reviews but that said there is another side which would say it absolutely is because it removes any potential bottleneck.
Anyway, with this test system the choice seems pretty simple. Someone buying a $999 GPU probably isn’t going to run it on i5… they are likely going to have an SSD too… and so on. Past that we also tested with the latest drivers (at the time of writing) which were 314.09 for Titan, 314.07 for GTX 680 and Catalyst 13.2 Beta6 for the AMD configs. All on Windows 8 64bit, fully patched and of course the motherboard BIOS was up to date. This is standard practice, we always test every product as if it was new for each graphics card review even though it massively increases the time to test over using old results.
Of course in addition to a test system which was relevant to the products target audience, one which was up to date too, we also tested real world gameplay in the latest games on a mix of engines and game genres:
Aliens: Colonial Marines
Far Cry 3
Assassins Creed 3
StarCraft 2: Heart of the Swarm (Beta)
Star Wars The Old Republic
We are not completely against synthetic testing though so the new 3DMark was included too… as was the recently released Heaven 4.0.
It’s in tests like the games listed above and the new synthetic benchmarks that new testing really benefits reviews. Through using old games and old results a reviewer locks themselves into using old tests.
Past that though a modern GPU is not just all about framerates… GTX Titan for example is also aimed at GPU compute performance (we tested that, real world and synthetic) and supports features like TXAA/FXAA (we tested those in Mass Effect 3 and Assassins Creed 3), PhysX (Borderlands 2), Media Playback (HD content acceleration on Blu-Ray or Streaming) and 3DVision (Skyrim).
Compare that to other reviews you have read of GTX Titan and consider these areas which can often indicate use of old, inaccurate results.
While you do, here are the warning signs:
- Don’t be blinded by a comparison with 20 other GPUs. It is unlikely they were all tested on the latest hardware/drivers and we know from above the impact of that!
- Look for new tests, the latest games (or synthetic tests) signify new testing. (It’s nice to see some old favourites mixed in though, they have value to some readers).
- Look for the latest drivers being used on ALL products. You wouldn’t buy a card and install old drivers would you?
- Look for detailed test specifications and an explanation of how testing was performed. Is the hardware relevant for the product being tested? Has the reviewer retested multiple times to remove any spurious results?
- Look to see if real gameplay is stated, or if it is timedemos. And are the settings relevant… for example Eyefinity and Surround included for High End GPUs? Anti-Aliasing etc? Basically are the reviews testing at settings end users would game at?
- Look for examples which prove testing was real world… for example screenshots of gameplay (and that they are different to past articles).
- Look for charts showing minimum framerates across time.
- Look for charts showing frame latency, this is now just as important as framerate.
- Look for more than just gaming… power/thermal etc should also be there but extra tests are also worthy of inclusion. After all many people watch streaming HD content on their PC and modern browsers include GPU acceleration to assist the CPU. We want that to perform great too, right?
- Look for transparency. A quality review will have all the answers in it… everything done will be detailed because the reviewer should be proud to show the quality of their work, the effort put into it.
What can we all take away from this?
We absolutely get that some new review sites cannot afford the latest high end kit but for established sites using the wrong hardware is inexcusable. Speak to your editor, have them speak to manufacturers or buy in hardware to build an appropriate system. It may seem crazy but the review we mentioned using i5 to test multiple Titans was actually one of the largest in its region… if they can’t spare one of the high end i7 CPUs that they previously tested (and yes Intel tend to let larger sites keep them for future tests/comparisons) then there is something seriously wrong there. And that’s on the admin side, before we consider the impact in results from the lower spec CPU, or even that Z77 has potentially limited PCIe bandwidth for multi-GPU tests compared to X79.
When it comes to testing… have some pride. Think about the service you are providing and always put the reader first. What do they need/want and how do you provide it? (Almost always the answer is hard work).
In the same way that it is inexcusable for review sites to provide sub-standard testing it is ridiculous that those providing the samples don’t have more understanding and control over their product coverage. Not that a manufacturer should be able to say "you wrote a negative review, no samples in future". More that PR reps should understand how quality testing benefits them. Passing (in this case) a GPU to a site who cannot adequately test it helps no one. As one example we had a conversation recently with a rep who stated that sales of a particular high end product hadn’t been great in a region so that impacted future product sampling. Is it any wonder that this is the case when the majority of reviews in that region did the bare minimum to test the previous product?… and failed to cover more than the basic features? Why would anyone buy the product compared to another if they couldn’t see anything other than comparison framerates which were out of date? (As a side issue, next time round that same company sampled sites with, they assume, large readership rather than quality testing… kind of a case of hoping enough people would read to get some sales… but that’s a whole other article!)
So when it comes to sampling, and you work in the industry look for examples of reviews where the writer is providing quality, relevant, thorough and up to date testing. If you don’t know what that is… see above and below. If you are worried about viewing figures, remove that concern by sharing the best content with your social media users, company website or via links on online stores for your product.
You are the most important stakeholder in this entire scenario. You pay the bills of the review sites who you visit. You keep manufacturers in business with purchases. It sadly isn’t enough to assume that each is doing all they can. Don’t just stare at graphs, pay attention to the entire article… read it and when you see a quality review, spread the word. If you don’t understand something or think that additional tests would be worthwhile join the forum/comments system and ask questions or comment. Any worthwhile site will love constructive comments/questions. (That’s another warning sign… if sites moderate their comments on articles what are they trying to hide and what valid comments are you not seeing?) Just as important as reading, commenting and getting involved in the production of better content… when you see a sub-standard article question it and spread the word of that too.
Why do we have frame latency/micro stutter graphs in our reviews now? As much as we would like to think we are all knowing and read everything, it was a post on our forum which highlighted it after a recent GPU review and upon investigating further we added it. That’s how things should work.
Basically, demand more of your content providers… look to get what you need and more from a product review.
Make no mistake, we are in no way saying our GTX Titan review is perfection, though we do stand by our methods of relevant hardware, real world testing, testing of all features, using recent games, up to date drivers and retesting for every review… there is always room for improvement and we are constantly looking for ways to enhance product reviews but it is clear that this industry can improve, A LOT. We all have a part to play in that but from a reviewers point of view our goal should always be to provide the most thorough, accurate and relevant content possible… at a minimum.