Literal candidates for constants
Literals are hard coded values in the source. Once they are written, they can’t be changed without altering the source code. This is always fine, initially, while building the first implementations. Yet, later, it may arise that such literal have to be managed from different points of the code. This is where constants come into play.
The hard part of defining constants is deciding when a literal is a good candidate. Depending on the maturity of the code and its connexion with other remote parts of the application, a constant definition is a delicated decision : it may never happen. And, it is just too early to create a constant. Let’s see how we can detect some candidate with two strategies : frequency and patterns.
Literal frequency
The first approach to convert a literal to a constant is to a observe its duplications. Once a literal is used multiple times throughout the code, there is a need for it to be disambiguated. Here are three octal integers, used in Exakat’s code :
0755
: 23 times0700
: 3 times0777
: 1 time
Octal integers have usually only one usage in PHP : setting the privileges when creating a new folder. Here, the 0755
is the classic default value. Its usage is easy to understand, and its frequence makes it a good candidate for refactoring into a global constant.
0777
is used once only, so it might be a lone code. Indeed, this is part of an analysis called Keep Files Access Restricted, which tracks folders created with loose privileges. Then, this literal is different, and it should not be turned into a constant, just like 0755
.
Constant disambiguation
Finally, the frequency approach applies to 0700
. It looks like a stricter version of the 0755
, even though the literal is distinct from 0700
. Manual review is needed here, and it could lead to the merging with the newly defined constant for 0755
. That’s how consistency across the code is improved.
This is a process called disambiguation, where the same intent was applied with different literals, and consistency was lost in the process. Here, we have an example of literals which are different, yet could be merged under the same constant value.
In fact, the same disambiguation could also happen with high frequency literals. Take, true
, 0
, 1
, 10
or 256
, for example. Those are used in various occasions, and they might be distinct applications.
For example, Exakat uses the string 'none'
to configure some properties : for class constant visibility, method visibility, block of code presence, or baseline existence. While it is semantically valid to use this word in each of those situation, it is recommended to distinguish the situations with a different constant name.
Code pattern usage
The disambiguation process leaves the programmer in a grey zone : when the frequency of a literal is low, it is not complex to dectect its meaning, and it is easy to refactor the literal into a constant.
When the literal is used in many different situations, sorting the various cases is difficult. One strategy to sort this is to use pattern discovery : detect a recurring pattern in the way some literals are used, and then, group all those usage together.
literals as messages
Literals may be used to carry a simple message or state. The state is set in one part of the code, then it is later retrieved to identify it. This is the case in the following code :
<?php $display_or_not = 'yes'; foo($message, $visible); function foo($message, $visible) { if ($visibility == 'yes') { print $message; } } ?>
The display of the message is configured with the assignation of ‘yes’, then it is later checked again with ==.
Also, note that this could have been a boolean, or even, a boolean with a constant named ‘YES’. The string helps the human coder understands the semantics of the literal. A constant would also do that work : host a literal, and give a readable description of it in the code.
Code pattern
At that point, a pattern is easy to identify. The literal value has to be set, and the same literal has to be part of a comparison.
Following the container for the literal is possible in the above illustration, since the assignation and the comparisons are close. Yet, those messages are often set up to enable communication between remote or unrelated part of the code. So, the storage of those messages could be ignored, yet provide interesting results.
This leads to recognizable patterns like the followings (applied on yoastseo wordpress plugin :
- 1
$clean = '1'
$site->$state_slug === '1'
$type === '1'
- full
$image['size'] = 'full'
$image_size = 'full'
$size === 'full'
- on
$val = 'on'
$_SERVER['HTTPS'] === 'on'
$meta_data['wpseo_noindex_author'] === 'on'
$value === 'on'
wpseo_manage_options
$submenu_pages[$index][3] = 'wpseo_manage_options'
$capability === 'wpseo_manage_options'
Interesting takes and limitations
The shorter (or smaller) the literal, the better. When the size goes up, it becomes harder to write in the code, and error prone. This leads to the adoption of a constant, to minimize those errors.
Some of the usage of the literals are completely distinct and not related. The semantics of the containers may be helpful to spot false positives : $_SERVER['HTTPS']
is read-only, and $clean
, $type
, $site->$state_slug
look very unrelated
Some of the literals are later combined with other values, through concatenation or append to an array. For example, a file extension .php
or a network protocol https
. Those should be skipped, as we are looking for a token, aka a piece of string that is used as a whole, not a part of anything else. (this is not illustrated here).
Conclusion
To avoid the effect of the magic number, setting up constants allows the code to become a lot more readable, yet surprisingly efficient to run. Constants allows to write a meaningful code, and apply a machine useful literal.
We also shown how code patterns emerge in the code, and how static analysis may take advantages of them. The process is simple : identify some common behavior, turn it into an analysis rule, then run it on several pieces of code. And then, the most important : review the results : the one that were expected, and the other cases. Other cases are critical, as they help us understand other patterns.
One way to go further with this pattern is to identify enumerations. Even before PHP 8.1, enumerations group several constants together as a consistent group. Values are tested and mutually exclusive within the group. This often means using switch
or match
, or even in_array()
, with a special array that groups all those enumeration cases. Could we suggest enumerations, while spotting those usage? What would be the false positives?
Code pattern are such a rabbit hole.