Author: Reshama Shaikh
Introduction
Use the function check_scalar
for parameters validation. The validation function checks to see the following for a parameter: is an acceptable data type, is within the range of values, the range of values (interval).
- References Issue #21927 (@reshamas)
- References Issue #20724: “Use check_scalar for parameters validation” (with notes by @glemaitre, @jjerphan, @genvalen)
- References PR #20723. “MNT use check_scalar to validate scalar in AffinityPropagation”. This is an example PR by @glemaitre.
A helper function exists in scikit-learn which validates a scalar value: sklearn.utils.check_scalar
documentation.
It is used to validate parameters of classes (and functions). Most of the current classes in scikit-learn do not use this helper function. We want to refactor the code so that it does use this standard helper function. Utilizing this helper function will help to get consistent error types and messages.
Steps
Below, I go through an example, step by step.
Go to working directory
pwd
▶ pwd
/Users/reshamashaikh/software-build/scikit-learn
(base)
~/software-build/scikit-learn main ✔
Activate virtual environment
conda activate sklearndev
▶ conda activate sklearndev
(sklearndev)
~/software-build/scikit-learn main ✔
Sync local repo with the GitHub repo, main
branch
git pull upstream main
git push origin main
▶ git pull upstream main
From github.com:scikit-learn/scikit-learn
* branch main -> FETCH_HEAD
Already up to date.
(sklearndev)
~/software-build/scikit-learn main ✔ 1d
▶ git push origin main
Everything up-to-date
(sklearndev)
~/software-build/scikit-learn main ✔ 1d
▶
Create a new working branch, from main
branch
git checkout main
git checkout -b xscalar_glm
▶ git checkout main
Already on 'main'
Your branch is up to date with 'origin/main'.
(sklearndev)
~/software-build/scikit-learn main ✔ 1d
▶ git checkout -b xscalar_glm
Switched to a new branch 'xscalar_glm'
(sklearndev)
~/software-build/scikit-learn xscalar_glm ✔ 1d
▶
Identify a class to implement check_scalar
function
To find an algorithm which may need to implement check_scalar
function, I searched the repo scikit-learn/scikit-learn for max_iter
, as a start. I found a constructor that has scalar numeric as parameters.
I found:
- File: sklearn/linear_model/glm.py
- Associated test: sklearn/linear_model/_glm/tests/test_glm.py
Identify the scalar numeric parameters
For glm.py, I found four classes in the file:
GeneralizedLinearRegressor
PoissonRegressor
GammaRegressor
TweedieRegressor
I will begin work on the first one, GeneralizedLinearRegressor
. Also, for each I will look at minimum and maximum values. If minimum and maximum values are missing, I will add them, as well as the boundary conditions.
Within the class GeneralizedLinearRegressor
, I identify the following scalar numeric parameters:
alpha
, value range:[0.0, inf)
max_iter
, value range:[1, inf)
tol
, value range:(0.0, inf)
verbose
, value range:[1, inf)
Tests
Tests and validation
Parameter validation checks are added in order to catch any invalid parameter values passed into the estimator before the algorithm is run. If no parameter validation exists, we are left to the mercy of the algorithm. For instance, if the algorithm receives a negative number for maximum number of iterations, it will break.
Sklearn has thorough validation checks. With the use of the helper function, check_scalar
, these validation checks can be refactored for greater consistency and readability.
Tests are added to make sure that parameter validation checks behave correctly. In the case of creating tests for check_scalar
, the tests check that the check_scalar
validation raises a ValueError
or a TypeError
where appropriate, and that the error message returned is as expected.
If no tests exists for the parameter validation, add tests. Note that even if the tests do not exist, the validation definitely does.
See if tests exists
In the file test_glm.py, I see the following test exists. It checks 5 possible inputs, but has only one ValueError
error message:
@pytest.mark.parametrize("max_iter", ["not a number", 0, -1, 5.5, [1]])
def test_glm_max_iter_argument(max_iter):
"""Test GLM for invalid max_iter argument."""
y = np.array([1, 2])
X = np.array([[1], [2]])
glm = GeneralizedLinearRegressor(max_iter=max_iter)
with pytest.raises(ValueError, match="must be a positive integer"):
glm.fit(X, y)
In this case, these are invalid values for max_iter
: ["not a number", 0, -1, 5.5, [1]]
- “not a number”: invalid type (string), should be integer
- 5.5: invalid type (float), should be integer
[1]
: invalid type (list), should be integer- 0: iterations should be > 0
- -1: iterations should be > 0
So, here we have 5 tests to run. And, our tests should give informative error messages.
In the glm.py
file, I temporarily comment out whatever checks exist for valid values (validation) of max_iter
.
# if not isinstance(self.max_iter, numbers.Integral) or self.max_iter <= 0:
# raise ValueError(
# "Maximum number of iteration must be a positive "
# "integer;"
# " got (max_iter={0!r})".format(self.max_iter)
# )
Then, I run the existing test test_glm_max_iter_argument
:
pytest sklearn/linear_model/_glm/tests/test_glm.py -k test_max_iter_argument -vsl
I see that 5 tests have failed:
max_iter
='not a number'
> if n_iterations >= maxiter: E TypeError: '>=' not supported between instances of 'int' and 'str' ../../miniforge3/envs/sklearndev/lib/python3.9/site-packages/scipy/optimize/lbfgsb.py:367: TypeError
max_iter
=0
> glm.fit(X, y) E Failed: DID NOT RAISE <class 'ValueError'> sklearn/linear_model/_glm/tests/test_glm.py:150: Failed
max_iter
=-1
> glm.fit(X, y) E Failed: DID NOT RAISE <class 'ValueError'> sklearn/linear_model/_glm/tests/test_glm.py:150: Failed
max_iter
=5.5
> glm.fit(X, y) E Failed: DID NOT RAISE <class 'ValueError'> sklearn/linear_model/_glm/tests/test_glm.py:150: Failed
max_iter
=[1]
> if n_iterations >= maxiter: E TypeError: '>=' not supported between instances of 'int' and 'list' ../../miniforge3/envs/sklearndev/lib/python3.9/site-packages/scipy/optimize/lbfgsb.py:367: TypeError
Add parametrized tests
The tests must fail before adding validation. This is an example of how we will add a parametrized test:
Current:
@pytest.mark.parametrize("max_iter", ["not a number", 0, -1, 5.5, [1]])
def test_glm_max_iter_argument(max_iter):
"""Test GLM for invalid max_iter argument."""
y = np.array([1, 2])
X = np.array([[1], [2]])
glm = GeneralizedLinearRegressor(max_iter=max_iter)
with pytest.raises(ValueError, match="must be a positive integer"):
glm.fit(X, y)
We will update the test as we have done below:
@pytest.mark.parametrize(
"params, err_type, err_msg",
[
({"max_iter": 0}, ValueError, "max_iter == 0, must be >= 1"),
({"max_iter": -1}, ValueError, "max_iter == -1, must be >= 1"),
(
{"max_iter": "not a number"},
TypeError,
"max_iter must be an instance of <class 'numbers.Integral'>, not <class"
" 'str'>",
),
(
{"max_iter": [1]},
TypeError,
"max_iter must be an instance of <class 'numbers.Integral'>,"
" not <class 'list'>",
),
(
{"max_iter": 5.5},
TypeError,
"max_iter must be an instance of <class 'numbers.Integral'>,"
" not <class 'float'>",
),
],
)
def test_glm_scalar_argument(params, err_type, err_msg):
"""Test GLM for invalid max_iter argument."""
y = np.array([1, 2])
X = np.array([[1], [2]])
glm = GeneralizedLinearRegressor(**params)
with pytest.raises(err_type, match=err_msg):
glm.fit(X, y)
I run the tests.
Note: I have renamed the test function.
pytest sklearn/linear_model/_glm/tests/test_glm.py::test_glm_scalar_argument
The tests fail, as expected, because invalid values are being input.
E ValueError: Maximum number of iteration must be a positive integer; got (max_iter=5.5)
sklearn/linear_model/_glm/glm.py:232: ValueError
==================================================== 5 failed in 0.59s =====================================================
(sklearndev)
Add and run validation
Next, in the glm.py
file, I do two things:
- Import the needed function
from ...utils import check_scalar
- Add in the
check_scalar
function in thedef fit
function. The function here checks that formax_iter
is:- an integer
- has a has a minimum value of
1
- has no maximum value
- is within this range:
[1, )
. Note that no upper bound is specified.
check_scalar(
self.max_iter,
name="max_iter",
target_type=numbers.Integral,
min_val=1,
max_val=None,
include_boundaries="left",
)
Confirm tests are passing!
After doing the above, we see that all 5 tests are now passing:
~/software-build/scikit-learn xscalar_glm ✔ 8d
▶ pytest sklearn/linear_model/_glm/tests/test_glm.py -k test_glm_scalar_argument -vsl
=========================================================== test session starts ============================================================
platform darwin -- Python 3.9.7, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 -- /Users/reshamashaikh/miniforge3/envs/sklearndev/bin/python
cachedir: .pytest_cache
rootdir: /Users/reshamashaikh/software-build/scikit-learn, configfile: setup.cfg
plugins: cov-3.0.0
collected 78 items / 73 deselected / 5 selected
sklearn/linear_model/_glm/tests/test_glm.py::test_glm_scalar_argument[params0-ValueError-max_iter == 0, must be >= 1] PASSED
sklearn/linear_model/_glm/tests/test_glm.py::test_glm_scalar_argument[params1-ValueError-max_iter == -1, must be >= 1] PASSED
sklearn/linear_model/_glm/tests/test_glm.py::test_glm_scalar_argument[params2-TypeError-max_iter must be an instance of <class 'numbers.Integral'>, not <class 'str'>] PASSED
sklearn/linear_model/_glm/tests/test_glm.py::test_glm_scalar_argument[params3-TypeError-max_iter must be an instance of <class 'numbers.Integral'>, not <class 'list'>] PASSED
sklearn/linear_model/_glm/tests/test_glm.py::test_glm_scalar_argument[params4-TypeError-max_iter must be an instance of <class 'numbers.Integral'>, not <class 'float'>] PASSED
===================================================== 5 passed, 73 deselected in 0.23s =====================================================
(sklearndev)
Reminders
When submitting the pull request (PR):
- Label PR with prefix “MAINT”
- A changelog entry is not required
Resources
Rebuild source code
If tests are failing, I may need to rebuild the source code, using below syntax:
pip install -e . --no-build-isolation -v
Run full test suite in sklearn
To run the full suite of tests, it takes about 20 minutes on my computer.
pytest sklearn
There is example output of the tests in 2021-12-12-pytest_sklearn_output.md
E AssertionError:
E This test fails because scikit-learn has been built without OpenMP.
E This is not recommended since some estimators will run in sequential
E mode instead of leveraging thread-based parallelism.
E
E You can find instructions to build scikit-learn with OpenMP at this
E address:
E
E https://scikit-learn.org/dev/developers/advanced_installation.html
E
E You can skip this test by setting the environment variable
E SKLEARN_SKIP_OPENMP_TEST to any value.
E
E assert False
E + where False = _openmp_parallelism_enabled()
sklearn/tests/test_build.py:33: AssertionError
===== 1 failed, 25839 passed, 205 skipped, 250 xfailed, 62 xpassed, 2290 warnings in 1002.24s (0:16:42) ======
(sklearndev)
~/software-build/scikit-learn xscalar_glm ✔
Running Individual Tests
Typically, to run the full test suite, I would type pytest sklearn
, which takes about 20 minutes.
Individual tests can be run using the syntax below, there are a couple of ways to do it:
pytest sklearn/linear_model/_glm/tests/test_glm.py -k test_glm_max_iter_argument -vsl
pytest sklearn/linear_model/_glm/tests/test_glm.py::test_glm_max_iter_argument
This is the output observed after running the test.
▶ pytest sklearn/linear_model/_glm/tests/test_glm.py::test_glm_max_iter_argument
=================================================== test session starts ====================================================
platform darwin -- Python 3.9.7, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
rootdir: /Users/reshamashaikh/software-build/scikit-learn, configfile: setup.cfg
plugins: cov-3.0.0
collected 5 items
sklearn/linear_model/_glm/tests/test_glm.py ..... [100%]
==================================================== 5 passed in 0.17s =====================================================
(sklearndev)
~/software-build/scikit-learn xscalar_glm ✔
Because I consolidated some existing tests and added the new ones, I renamed the test. I would run the following for the test:
pytest sklearn/linear_model/_glm/tests/test_glm.py -k test_glm_scalar_argument -vsl
Acknowledgements
- Guillaume LeMaitre @glemaitre
- Julien Jerphanon @jjerphan
- Thomas J. Fan @thomasjpfan
- Genesis Valencia @genvalen
Part 2: PoissonRegressor
- Virtual environment activated:
conda activate sklearndev
- Identify class to work on:
PoissonRegressor
- Working with this file: sklearn/linear_model/glm.py
- Working with associated test: sklearn/linear_model/_glm/tests/test_glm.py
- Create working branch from
main
branchgit checkout main git pull upstream main git checkout -b xscalar_poissonreg
- Identify scalar numerical parameters and the valid range of values for the class
PoissonRegressor
alpha
, value range:[0.0, inf)
max_iter
, value range:[1, inf)
tol
, value range:(0.0, inf)
verbose
, value range:[1, inf)
- Add parameter interal ranges to the docstring
alpha
, Values should be in the range[0.0, inf)
.max_iter
, Values should be in the range[1, inf)
.tol
, Values should be in the range(0.0, inf)
.verbose
, Values should be in the range[1, inf)
.
- Run tests:
pytest sklearn/linear_model/_glm/tests/test_glm.py -k test_glm_scalar_argument -vsl
- There is no
def fit
for classPoissonRegressor