Tutorial II: Handling Type System (part 1):

Parsing files are useful, but we quickly need to do some type checking on our input to do some advanced DSL. To handle this problem, the package pyrser.type_system provide what we need.

1- Type semantics

We provide some classes to describe types.

pyrser.type_system.Symbol: A Symbol represents a thing in our language.

pyrser.type_system.Signature: A Signature is an abstract type common to Val, Var and Fun. It is the common denominator of the typing system and provides the capability to get a string representation of a symbol.

pyrser.type_system.Val: A Val represents a litteral value in our language.

pyrser.type_system.Var: A Var represents a named variable in our language.

pyrser.type_system.Fun: A Val represents a named function in our language.

pyrser.type_system.Scope: A Scope represents a scope or a type (ADT or Abstract Data Type).

We could notice that a Scope could be of three kind:

FREE: This is a standalone scope.

LINKED: This scope is connected to a parent scope. So type resolution is forwarded to parent if it failed.

EMBEDDED: This scope is a subscope of a parent scope. An embedded scope is seen as an extension of the parent scope. So when we iterate thru symbols in this scope we reached also symbols present in the parent scope. This is useful in certain case but problematic for other (typically pyrser.type_system.Scope.get_by_params()).

Basically we could use the package like this:

from pyrser.type_system import *

t1 = Type('int')
t2 = Type('double')
var = Var('var1', 'int')
f1 = Fun('fun1', 'int', [])
f2 = Fun('fun2', 'int', ['char'])
f3 = Fun('fun3', 'int', ['int', 'double'])
scope = Scope(sig=[t1, t2, var, f1, f2, f3])
print(str(scope))

And produce the following output:

scope :
    type double
    fun fun1 : () -> int
    fun fun2 : (char) -> int
    fun fun3 : (int, double) -> int
    type int
    var var1 : int
...

We’re actually generating the signatures of one variable and three functions and add them to an unnamed pyrser.type_system.Scope, thus creating an anonymous scope that could be our language’s global scope. This is the reason why we instantiate the pyrser.type_system.Scope object using the keyword sig (also second positionnal argument): by not giving a first parameter which is an identifier naming the scope, we anonymize it. If we wanted to name it, we could have created it as follows:

scope = Scope('namespace1', [t1, t2, var, f1, f2, f3])

Or, after creating the object, we can attribute the proper name:

scope.set_name('namespace1')
print(str(scope))

that would produce the output:

...
scope namespace1 :
    type namespace1.double
    fun namespace1.fun1 : () -> int
    fun namespace1.fun2 : (char) -> int
    fun namespace1.fun3 : (int, double) -> int
    type namespace1.int
    var namespace1.var1 : int
...

Now our functions and vars are automatically decorated to be part of the namespace. We could inspect the internal names used by our symbols:

print(repr(scope))

We get all internal names of our signatures:

...
{'namespace1_double': {}, 'namespace1_fun1_int': namespace1_fun1_int, 'namespace1_int': {}, 'namespace1_fun2_char_int': namespace1_fun2_char_int, 'namespace1_fun3_int_double_int': namespace1_fun3_int_double_int, 'namespace1_var1_int': namespace1_var1_int}

2- Type operations

With the previous classes, we got the basic abstraction to implement a name-based type system with functions/variables overloads.

In fact, the class pyrser.type_system.Scope provides what we need for basic type operations.

Let’s take a classical scope with few overloads of a function f:

scope = Scope(sig=[Fun('f', 'void', ['int']), Fun('f', 'int', ['int', 'double', 'char']), Fun('f', 'double', ['int', 'juju'])])
scope |= Scope(sig=[Fun('f', 'double', ['char', 'double', 'double'])])

Then add some locals variables (with possible overloads):

t1 = Type('int')
t2 = Type('double')
p11 = Val('14', 'int')
p12 = Val('14', 'double')
p21 = Var('b', 'int')
p22 = Var('b', 'double')
p31 = Var('c', 'int')
p32 = Var('c', 'double')
p33 = Var('c', 'char')
a = Scope(sig=[p11, p12])
scope.update([t1, t2, p11, p12, p21, p22, p31, p32, p33])
print(str(scope))

We get this setting:

scope :
    val $1 (14) : int
    val $2 (14) : double
    var b : double
    var b : int
    var c : char
    var c : double
    var c : int
    type double
    fun f : (char, double, double) -> double
    fun f : (int, double, char) -> int
    fun f : (int, juju) -> double
    fun f : (int) -> void
    type int

scope :
...

We could easily infer what is the type of f,a,b,c in the sentence f(a, b, c). In order to do this, we must first retrieve all the possible signatures for each parameter. Then, we need to retrieve all possible signatures for the given function and filter them with the set of signatures for each parameter, leaving only the plausible overloads for us to check.

Since a is already at hands (a literal value should always be represented by a scope containing all the possible type overloads), we first need to get all possible signature for b:

b = scope.get_by_symbol_name('b')
print(str(b))

We get:

...
        var b : double
        resolution :
            'double': <weakref at 0x7f628fc748b8; to 'Type' at 0x7f628fce8f28> (double)
    
    evalctx :
        var b : int
        resolution :
            'int': <weakref at 0x7f628fc74908; to 'Type' at 0x7f62925b5198> (int)
    

...

As you may have understood, pyrser.type_system.Scope.get_by_symbol_name() returns a sub-set of the Scope instance itself. Thus, we get another Scope, on which we can operate further.

We do the same for c. After that, we choose only functions called f, with these sets of parameters:

(fun, param) = scope.get_by_symbol_name('f').get_by_params([a, b, c])
print(str(fun))

And we only got:

...
        fun f : (int, double, char) -> int
        resolution :
            'char': Unresolved
            'double': <weakref at 0x7f7d15dd08b8; to 'Type' at 0x7f7d15e45f28> (double)
            'int': <weakref at 0x7f7d15dd0908; to 'Type' at 0x7f7d18707128> (int)
    

...

As we can see, some types (int and double) are resolved to a pyrser.type_system.Type , while char is left unresolved. This is because no declaration exists for the type char within our scope. Indeed, the type system tried to retrieve the types associated to the different parameters of a resolved function.

On another note, pyrser.type_system.Scope.get_by_symbol_name() also returns the Scope containing the different sets of parameters that must be used for each overload:

...
    evalctx :
        fun f : (char, double, double) -> double
        resolution :
            'char': Unresolved
            'double': <weakref at 0x7fb830f188b8; to 'Type' at 0x7fb830f8cf28> (double)
...

Here, we got a unique overload so the type checking resolved the types to the proper function.

3 - Type mangling

Now that we know how to look for a signature within a scope, we may want to have a bit more control about how the unique identifiers are generated for the signatures. Indeed, the whole typing system is based on a few classes which provide the unique identifiers. Modifying how those identifiers are generated can allow us to enable or disable function overload for a toy language, for instance.

Remember, in the first section of this tutorial, we had the following code:

from pyrser.type_system import *

t1 = Type('int')
t2 = Type('double')
var = Var('var1', 'int')
f1 = Fun('fun1', 'int', [])
f2 = Fun('fun2', 'int', ['char'])
f3 = Fun('fun3', 'int', ['int', 'double'])
scope = Scope(sig=[t1, t2, var, f1, f2, f3])
print(str(scope))

which displayed the signatures as a list:

...
{'namespace1_double': {}, 'namespace1_fun1_int': namespace1_fun1_int, 'namespace1_int': {}, 'namespace1_fun2_char_int': namespace1_fun2_char_int, 'namespace1_fun3_int_double_int': namespace1_fun3_int_double_int, 'namespace1_var1_int': namespace1_var1_int}

It is actually the Symbol class that controls how those unique signature identifiers are generated. The pyrser.type_system.Symbol class actually looks like this:

    def show_name(self) -> str:
        """
        Return a convenient name for type checking
        """
        return ".".join(self.get_scope_names())

    # to overload for new language
    def internal_name(self) -> str:
        """
        Returns the namespace's internal_name.
        """
        unq = "_".join(self.get_scope_names())
        return unq

And the implementation of the pyrser.type_system.Fun class is the following:

    def internal_name(self):
        """
        Return the unique internal name
        """
        unq = super().internal_name()
        if self.tparams is not None:
            unq += "_" + "_".join(self.tparams)
        if self.tret is not None:
            unq += "_" + self.tret
        return unq

If we follow properly how the internal_name method of the pyrser.type_system.Fun class works, we can see that the higher level class (pyrser.type_system.Fun in our case) can use internally it’s parent class’s internal_name method. That part is actually up to the implementor, as it could also define a wholly different mangling method.

In reality, three classes express the different typing concepts that enter into account when trying to generate unique signature identifiers. Those are the concepts of Value, Variable and Function, which classes are respectively the classes Val, Var and Fun. So in order to re-define the mangling for your own language, you may need to redefine up to four classes: pyrser.type_system.Symbol, pyrser.type_system.Val, pyrser.type_system.Var and pyrser.type_system.Fun.

Now, let us try to define a mangling fit for a language that would not support any overloading for a given symbol, meaning that a variable could not have the same name as a function:

from pyrser.type_system import *

class MySymbol(Symbol):

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    def show_name(self):
        return '.'.join(self.get_scope_names()) + '.' + self.name

    def internal_name(self):
        return '_'.join(self.get_scope_names()) + '.' + self.name


class MyFun(MySymbol, Fun):

    def __init__(self, *args, **kwargs):
        MySymbol.__init__(self, *args, **kwargs)
        Fun.__init__(self, *args, **kwargs)

    def show_name(self):
        paramstr = ''
        if self.tparams is not None:
            paramstr = ', '.join(self.tparams)
        return super().show_name() + '(' + paramstr + ')'


class MyVar(MySymbol, Var):

    def __init__(self, *args, **kwargs):
        MySymbol.__init__(self, *args, **kwargs)
        Var.__init__(self, *args, **kwargs)


t1 = Type('int')
t2 = Type('char')
t3 = Type('double')

Note that MyVar only re-uses MySymbol‘s show_name and internal_name methods. Thus, we can see that using the show_name (used mostly when printing out an object for display purposes), we can differentiate MyFun from MyVar, even though the internal_name is the same for both classes. So now, the unique identifier being the same, the typing system won’t allow having more than one unique name registered, and thus prevents us from registering both a variable and a function having the same namespaces and name.

We can try out the following piece of code:

v2 = MyFun('G', 'int')
val = Scope(sig=[v1, v2])
f1 = MyFun('fun1', 'int', [])
f2 = MyFun('fun2', 'int', ['char'])
f3 = MyFun('fun3', 'int', ['int', 'double'])
scope = Scope(sig=[t1, t2, t3, val, f1, f2, f3])
print(str(scope))

print(repr(scope))

Which yields the following output, where we can see that the mangling was handled by our code:

scope :
    scope :
        fun G.G() : () -> int
    
    type char
    type double
    fun fun1.fun1() : () -> int
    fun fun2.fun2(char) : (char) -> int
    fun fun3.fun3(int, double) : (int, double) -> int
    type int

{'': {'G.G': G.G}, 'int': {}, 'char': {}, 'fun3.fun3': fun3.fun3, 'fun1.fun1': fun1.fun1, 'double': {}, 'fun2.fun2': fun2.fun2}

4 - Type resolution and disambiguation

In most languages, the typing system can encounter situations where the type is not as obvious as a one on one match. Indeed, a lot of languages have to resolve (either following a standard resolution model or yielding an error) situations where multiples signatures match the one we are looking for. As we just saw, since we can redefine the unique internal identifier generation for the typing system’s classes, depending on the method used, we could fall more or less easily in one of those situations.

For instance, let’s assume that our mangling supports function overloads, like the C++ language does. Then, let’s assume the following symbols to have been declared in a fictive language that we’re trying to type-check:

from pyrser.type_system import *

c = Type('char')
i = Type('int')
d = Type('double')
b = Type('bignum')
f1 = Fun('fun', 'bignum', ['char'])
f2 = Fun('fun', 'int',  ['char'])
f3 = Fun('fun', 'char', ['int', 'double'])
f4 = Fun('fun', 'bignum',  ['double'])
scope = Scope('foo', [c, i, d, b, f1, f2, f3, f4])

As a pre-requisite, let us assume that a number litteral in our language can be typed in multiple ways. For instance, a number litteral can be typed as a character, an integer, or as a big number. Then, when some parsed code will contain a litteral, the following set of pyrser.type_system.Val will be built:

l1 = Val(13, 'char')
l2 = Val(13, 'int')
l3 = Val(13, 'bignum')
literal = Scope(sig=[l1, l2, l3])

Now, we have a user input, where the written code is a function call to fun, with a number litteral as a parameter, that we could translate to the following typing code:

overloads = scope.get_by_symbol_name('fun')
print(str(overloads))

Now, we display the following scope:

scope foo :
    evalctx :
        fun foo.fun : (char) -> bignum
        resolution :
            'bignum': <weakref at 0x7f27573579a8; to 'Type' at 0x7f2759c94f28> (foo.bignum)
            'char': <weakref at 0x7f2757357958; to 'Type' at 0x7f2759c8ee80> (foo.char)
    
    evalctx :
        fun foo.fun : (char) -> int
        resolution :
            'char': <weakref at 0x7f2757357958; to 'Type' at 0x7f2759c8ee80> (foo.char)
            'int': <weakref at 0x7f2757357908; to 'Type' at 0x7f2759c8ef28> (foo.int)
    
    evalctx :
        fun foo.fun : (double) -> bignum
        resolution :
            'bignum': <weakref at 0x7f27573579a8; to 'Type' at 0x7f2759c94f28> (foo.bignum)
            'double': <weakref at 0x7f27573579f8; to 'Type' at 0x7f2759c94518> (foo.double)
    
    evalctx :
        fun foo.fun : (int, double) -> char
        resolution :
            'char': <weakref at 0x7f2757357958; to 'Type' at 0x7f2759c8ee80> (foo.char)
            'double': <weakref at 0x7f27573579f8; to 'Type' at 0x7f2759c94518> (foo.double)
...

Here, since the overloads list contains more than one item, it may be easily resolvable by using the get_by_params (returning a tuple of a scope and a list of scopes) on the overloads pyrser.type_system.Scope:

(fun, param) = overloads.get_by_params([literal])
print(str(fun))
print(str(param))

Indeed, pyrser.type_system.Scope.get_by_params() takes care of matching the available signatures with the multiple sets for each parameter. Now, if the fun scope contains more than one signature, it means that we have an unresolved type. That could mean a lot of differents things, but for now, let’s try to reduce this choice.

If we only got one signature in the resulting fun scope, then the typing system would have validated the types of the input, and we could go on and fiddle with the generation. Let us see what an unresolved func and param look like:

...
scope foo :
    evalctx :
        fun foo.fun : (char) -> bignum
        resolution :
            'bignum': <weakref at 0x7f56df9f79f8; to 'Type' at 0x7f56e2334f28> (foo.bignum)
            'char': <weakref at 0x7f56df9f7908; to 'Type' at 0x7f56e232ee80> (foo.char)
    
    evalctx :
        fun foo.fun : (char) -> int
        resolution :
            'char': <weakref at 0x7f56df9f7908; to 'Type' at 0x7f56e232ee80> (foo.char)
            'int': <weakref at 0x7f56df9f7958; to 'Type' at 0x7f56e232ef28> (foo.int)
    

...

In this case, we can see that using the literal as parameter was not enough to resolve the type of the function we want to use, but we can see a difference between the two: the return type. So we can filter once again over the return type:

byret = fun.get_by_return_type('bignum')
print(str(byret))

and we then get the following output:

...
scope foo :
    evalctx :
        fun foo.fun : (char) -> bignum
        resolution :
            'bignum': <weakref at 0x7f894e2d79f8; to 'Type' at 0x7f8950c14f28> (foo.bignum)
...

As we can see, by using this last filter, we could identify an unique function signature matching our user input. Alas, in some cases, it’s not as easy. Indeed, in some languages you might have polymorphic types, that the Scope class cannot resolve itself. It requires the help of another typing module: the pyrser.type_system.inference.Inference. Sometimes, even the inference module cannot resolve something, and then we fall in the case of an error, that will be up to us to notify to the user.

See Tuto III to dive deeper into the usage of pyrser, and see how to add mechanisms for the typing system to have a more powerful resolver, adding Type coercion, and using the Inference module.