Pythoscope Proposal

This is the original proposal that appeared on testing-in-python mailing list on August 18th 2008.

Signed off by:
Titus Brown
Grig Gheorghiu
Paul Hildebrandt
Michal Kwiatkowski

Our mission statement

To create an easily customizable and extensible open source tool that will automatically, or semi-automatically, generate unit tests for legacy systems written in Python.

Slogan ;-)

Pythoscope. Your way out of The (Lack of) Testing Death Spiral[1].

Milestones

Milestones listed below are there to give you a general idea about where we stand and where we want to go. Having said that, we plan working on the system the agile way, with requirements fleshing out (and, undoubtedly, numerous problems popping out) as we go. We definitely want to keep our goals realistic and quickly want to arrive to the point where our work will be helpful to at least part of real projects out there. We hope to work closely with the Python testing community in order to keep the project on a right track.

Rather tentative schedule for milestones follows. We want to complete milestone 2 pretty quickly, to start working with code as soon as possible. Our plan is to complete milestone 6 in about a month and start working on, what now looks like the hardest problem, side effects.

  • Milestone 1 (The proposal): done
  • Milestone 2 (Architecture): August 20th
  • Milestone 3 (Static analysis): August 31st
  • Milestone 4 (Dynamic analysis): September 7th
  • Milestone 5 (Setup & teardown): September 14th
  • Milestone 6 (Side effects): September 21st

Milestone 1: Write a proposal and introduce the project to the Python community.

At the time of this writing, this milestone has just been completed. :-)

Milestone 2: Decide on an initial architecture.

In terms of architecture I basically see it divided into two parts. First part's responsibility is to collect and analyze informationabout the legacy code and store it on disk. After that the second component jumps in and uses this information to generate unit tests. This separation is nice in many ways. First of all, it clearly isolates responsibilities. Second, it allows us to rerun the parts independently. So, whether we want the tool to gather new information from recently changed source code, or start from scratch with unit test stubs for some old class, we can do it without touching the other part of the system.

This separation should be mostly understood at the conceptual level. Both parts will surely share some of the library code and they may even end up being invoked with the same script, using appropriate command line flag. The distinction is important, because we may end up using the relevant information for other things than unit test generation. Like, for example, powerful source code browser, debugger, or a refactoring tool. This is possible, but not certain, future. For now we'll focus our attention on the test generation, because we feel this is an area of Python toolbox that needs improvement most.

The information collector will accept directories, files and points of entry (see dynamic code analysis description in milestone 4) to produce a comprehensive catalog of information about the legacy system. This includes things like names of modules, classes, methods and functions, types of values passed and returned during execution, exceptions raised, side effects invoked and more, depending on the needs of the test generator. This is the part of the system that will require most of the work. Dynamic nature of Python, while it gives us a lot of leverage and freedom, introduces specific challenges related to code analysis. This will be a fun project, that I'm sure of. :-)

Milestone 3: Explore the test generation idea, ignoring the problem of side effects for now, using static analysis only.

In this milestone we'll put life into the project. We'll focus on implementing foundations of the architecture. Analyzer will only look at the code statically, with just enough code to support test stubs generation. Generator will generate test stubs to the designated output directory for the specified modules, classes, methods and stand-alone functions. The main way to interact with the scripts will be to pass arguments to them, although we should keep the door open for later addition of configuration files. I feel those aren't necessary at such an early stage. Once we get a better idea of what should be configurable and to what extend, we'll think about configuration files.

There are quite a few tools that explored Python source code analysis. Those include:

  • source code style checkers, like pylint[2], PyChecker[3], and PyFlakes[4]
  • CodeInvestigator[5] debugger
  • Python4Ply[6], lexer and parser for Python
  • cyclomatic complexity analyzer[7]
  • type annotators, like PyPy type annotator[8] and annotate script[9]
  • Cheesecake codeparser.py[10]

So we have a plenty of examples of using the stdlib's (now deprecated) compiler module[11]. Any thoughts whether we should use the new _ast module[12] appreciated. My guess is that we shouldn't, because we don't want to force people into using the latest stable version of Python. They may not have a choice while working on their legacy applications.

Milestone 4: Dynamic code analysis, generation of basic test cases.

In this milestone we'll enter the realm of dynamic code tracking and analysis. Using points of entry provided by the user we'll execute real code and gather information about values passed and returned (both normally and through exceptions). From that we'll generate actual characterization test cases, with all the values gathered during the run. Initially we'll stick with basic Python types, and maybe simple derivatives, to handle complicated object creation in the next milestone.

We have a smaller, but strong set of examples of code tracing. Those include coverage tools like coverage[13] and figleaf[14], as well as type annotators mentioned earlier. I also developed a proof-of-concept tool called ifrit, available here: http://pythoscope.org/local--files/download/ifrit-r28.tar.gz .

Milestone 5: Code with setup (still no side effects).

During the dynamic run we'll get actual values, not necessarily with any idea how to build them. We'll need a good quasi-serialization mechanism in place that will basically take a live object and change it to code that creates it. In this milestone we'll focus on creating that mechanism in order to generate test cases with proper setup and teardown. Generated code will include mocks, as depending on what we're testing we may not need to create real objects. We'll probably use one of the existing mocking solutions, since there are so many of them. ;-)

Milestone 6: Differentiate between pure and "side effecty" code.

As we slowly move into the domain of code with side effects first thing we'll need is a method to differentiate "pure" code from a "side effecty" one. For all practical purposes we'll treat any code that doesn't effect the outside world in any way as pure. Whether it assigns values or destructively modifies its local variables doesn't matter to us, as long as it keeps those operations encapsulated. Sample sources of side effects we're interested in are global/class variables, file system, databases, and IO operations. Finding whether a module is import-safe may also be a possible problem we can spend a time on.

Future: Tackle the different kinds of side effects.

We'll try to handle each type of different side effects one by one. This will probably include coding around some common Python libraries that deal with those side effects. Ideally we'll want to come up with a simple interface to describe external state and side effects related to it in a way that you'll be able configure the system to your specific project's needs. Practically every legacy system out there has its own custom database/network/you-name-it library. We would like to make the process of customizing the tool to specific projects as painless as possible.

Future: Quickcheck-style fuzzy testing, deriving code contracts.

Another idea that is worth exploring is taking information about values passed around and deriving Eiffel-style[15] contracts for methods and functions. It would work like this:

  1. Generate a random input of some chosen type. We could use some function contract information gathered earlier, but if that's not available we can continue anyway. Not only values of arguments should vary, but their number as well (important for testing functions with optional arguments).
  2. Call the function with generated input.
  3. Record the result.
  4. Generate a test case based on this.

Test cases don't have to be generated immediately, I'd rather see them grouped by the result (into equivalence classes[16]) and put into separate test cases.

Using this method we'll be able to come up with new test cases without any user interaction, and possibly beyond normal system usage, capturing "accidental" system behavior, which I'm guessing could be a real time saver for legacy systems.

Design for testability obstacle

One of the greatest conceptual problems seems to be the fact that to make testing possible, be it hand-written or generated, system has to be designed for testability. Micheal Feathers' book "Working Effectively with Legacy Code" is all about that topic. It suggest a number of code refactorings that one can use to get given code under a test harness. Book uses C++, Java and C# as examples, and fortunately lot of the problems listed don't apply to Python. Dynamic nature of Python, duck-typing, lack of real private variables and other so-called "unsafe" features give us a lot of leverage here. We'll be dealing with this problem starting from milestone 5 ("code with setup"), so we should know pretty early if my worries are justified. Any thoughts on the issue would be most appreciated.

References

[1] See The Death Spiral blog post: http://ivory.idyll.org/blog/mar-08/software-quality-death-spiral.html
[2] http://www.logilab.org/857
[3] http://pychecker.sourceforge.net/
[4] http://divmod.org/trac/wiki/DivmodPyflakes
[5] http://codeinvestigator.googlepages.com/main
[6] http://www.dalkescientific.com/Python/python4ply.html
[7] http://www.traceback.org/2008/03/31/measuring-cyclomatic-complexity-of-python-code/
[8] http://codespeak.net/pypy/dist/pypy/doc/translation.html#annotator
[9] http://www.partiallydisassembled.net/blog/?item=166
[10] http://pycheesecake.org/browser/trunk/cheesecake/codeparser.py
[11] http://docs.python.org/lib/module-compiler.html
[12] Only for Python 2.5 and higher, http://docs.python.org/dev/library/_ast
[13] http://nedbatchelder.com/code/modules/coverage.html
[14] http://darcs.idyll.org/~t/projects/figleaf/doc/
[15] See http://en.wikipedia.org/wiki/Design_by_contract
[16] Equivalence classes: groups of inputs that should result in the same output or that should exercise the same logic in the system. By organizing inputs in this manner we can focus tests on boundary values of those classes.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License