Tutorial II: Handling Type System (part 1):¶
Parsing files are useful, but we quickly need to do some type checking on our input to do some advanced DSL.
To handle this problem, the package pyrser.type_system
provide what we need.
1- Type semantics¶
We provide some classes to describe types.
pyrser.type_system.Symbol
: A Symbol represents a thing in our
language.
pyrser.type_system.Signature
: A Signature is an abstract type common
to Val
, Var
and Fun
. It is the common denominator of the typing
system and provides the capability to get a string representation of a symbol.
pyrser.type_system.Val
: A Val represents a litteral value in our
language.
pyrser.type_system.Var
: A Var represents a named variable in our
language.
pyrser.type_system.Fun
: A Val represents a named function in our
language.
pyrser.type_system.Scope
: A Scope represents a scope or a type (ADT
or Abstract Data Type).
We could notice that a Scope could be of three kind:
FREE: This is a standalone scope.
LINKED: This scope is connected to a parent scope. So type resolution is forwarded to parent if it failed.
EMBEDDED: This scope is a subscope of a parent scope. An embedded scope is seen as an extension of the parent scope. So when we iterate thru symbols in this scope we reached also symbols present in the parent scope. This is useful in certain case but problematic for other (typically
pyrser.type_system.Scope.get_by_params()
).
Basically we could use the package like this:
from pyrser.type_system import *
t1 = Type('int')
t2 = Type('double')
var = Var('var1', 'int')
f1 = Fun('fun1', 'int', [])
f2 = Fun('fun2', 'int', ['char'])
f3 = Fun('fun3', 'int', ['int', 'double'])
scope = Scope(sig=[t1, t2, var, f1, f2, f3])
print(str(scope))
And produce the following output:
scope :
type double
fun fun1 : () -> int
fun fun2 : (char) -> int
fun fun3 : (int, double) -> int
type int
var var1 : int
...
We’re actually generating the signatures of one variable and three functions and
add them to an unnamed pyrser.type_system.Scope
, thus creating an
anonymous scope that could be our language’s global scope. This is the reason
why we instantiate the pyrser.type_system.Scope
object using the
keyword sig
(also second positionnal argument): by not giving a first
parameter which is an identifier naming the scope, we anonymize it. If we
wanted to name it, we could have created it as follows:
scope = Scope('namespace1', [t1, t2, var, f1, f2, f3])
Or, after creating the object, we can attribute the proper name:
scope.set_name('namespace1') print(str(scope))
that would produce the output:
...
scope namespace1 :
type namespace1.double
fun namespace1.fun1 : () -> int
fun namespace1.fun2 : (char) -> int
fun namespace1.fun3 : (int, double) -> int
type namespace1.int
var namespace1.var1 : int
...
Now our functions and vars are automatically decorated to be part of the namespace. We could inspect the internal names used by our symbols:
print(repr(scope))
We get all internal names of our signatures:
...
{'namespace1_double': {}, 'namespace1_fun1_int': namespace1_fun1_int, 'namespace1_int': {}, 'namespace1_fun2_char_int': namespace1_fun2_char_int, 'namespace1_fun3_int_double_int': namespace1_fun3_int_double_int, 'namespace1_var1_int': namespace1_var1_int}
2- Type operations¶
With the previous classes, we got the basic abstraction to implement a name-based type system with functions/variables overloads.
In fact, the class pyrser.type_system.Scope
provides what we need for basic type operations.
Let’s take a classical scope with few overloads of a function f
:
scope = Scope(sig=[Fun('f', 'void', ['int']), Fun('f', 'int', ['int', 'double', 'char']), Fun('f', 'double', ['int', 'juju'])])
scope |= Scope(sig=[Fun('f', 'double', ['char', 'double', 'double'])])
Then add some locals variables (with possible overloads):
t1 = Type('int')
t2 = Type('double')
p11 = Val('14', 'int')
p12 = Val('14', 'double')
p21 = Var('b', 'int')
p22 = Var('b', 'double')
p31 = Var('c', 'int')
p32 = Var('c', 'double')
p33 = Var('c', 'char')
a = Scope(sig=[p11, p12])
scope.update([t1, t2, p11, p12, p21, p22, p31, p32, p33])
print(str(scope))
We get this setting:
scope :
val $1 (14) : int
val $2 (14) : double
var b : double
var b : int
var c : char
var c : double
var c : int
type double
fun f : (char, double, double) -> double
fun f : (int, double, char) -> int
fun f : (int, juju) -> double
fun f : (int) -> void
type int
scope :
...
We could easily infer what is the type of f,a,b,c in the sentence
f(a, b, c)
. In order to do this, we must first retrieve all the possible
signatures for each parameter. Then, we need to retrieve all possible
signatures for the given function and filter them with the set of signatures
for each parameter, leaving only the plausible overloads for us to check.
Since a
is already at hands (a literal value should always be represented
by a scope containing all the possible type overloads), we first need to get
all possible signature for b
:
b = scope.get_by_symbol_name('b')
print(str(b))
We get:
...
var b : double
resolution :
'double': <weakref at 0x7f628fc748b8; to 'Type' at 0x7f628fce8f28> (double)
evalctx :
var b : int
resolution :
'int': <weakref at 0x7f628fc74908; to 'Type' at 0x7f62925b5198> (int)
...
As you may have understood,
pyrser.type_system.Scope.get_by_symbol_name()
returns a sub-set of
the Scope instance itself. Thus, we get another Scope, on which we can operate
further.
We do the same for c
. After that, we choose only functions called f, with
these sets of parameters:
(fun, param) = scope.get_by_symbol_name('f').get_by_params([a, b, c])
print(str(fun))
And we only got:
...
fun f : (int, double, char) -> int
resolution :
'char': Unresolved
'double': <weakref at 0x7f7d15dd08b8; to 'Type' at 0x7f7d15e45f28> (double)
'int': <weakref at 0x7f7d15dd0908; to 'Type' at 0x7f7d18707128> (int)
...
As we can see, some types (int
and double
) are resolved to a
pyrser.type_system.Type
, while char
is left unresolved. This
is because no declaration exists for the type char
within our scope.
Indeed, the type system tried to retrieve the types associated to the different
parameters of a resolved function.
On another note, pyrser.type_system.Scope.get_by_symbol_name()
also returns the Scope containing the different sets of parameters that must be
used for each overload:
...
evalctx :
fun f : (char, double, double) -> double
resolution :
'char': Unresolved
'double': <weakref at 0x7fb830f188b8; to 'Type' at 0x7fb830f8cf28> (double)
...
Here, we got a unique overload so the type checking resolved the types to the proper function.
3 - Type mangling¶
Now that we know how to look for a signature within a scope, we may want to have a bit more control about how the unique identifiers are generated for the signatures. Indeed, the whole typing system is based on a few classes which provide the unique identifiers. Modifying how those identifiers are generated can allow us to enable or disable function overload for a toy language, for instance.
Remember, in the first section of this tutorial, we had the following code:
from pyrser.type_system import *
t1 = Type('int')
t2 = Type('double')
var = Var('var1', 'int')
f1 = Fun('fun1', 'int', [])
f2 = Fun('fun2', 'int', ['char'])
f3 = Fun('fun3', 'int', ['int', 'double'])
scope = Scope(sig=[t1, t2, var, f1, f2, f3])
print(str(scope))
which displayed the signatures as a list:
...
{'namespace1_double': {}, 'namespace1_fun1_int': namespace1_fun1_int, 'namespace1_int': {}, 'namespace1_fun2_char_int': namespace1_fun2_char_int, 'namespace1_fun3_int_double_int': namespace1_fun3_int_double_int, 'namespace1_var1_int': namespace1_var1_int}
It is actually the Symbol class that controls how those unique signature
identifiers are generated. The pyrser.type_system.Symbol
class
actually looks like this:
def show_name(self) -> str:
"""
Return a convenient name for type checking
"""
return ".".join(self.get_scope_names())
# to overload for new language
def internal_name(self) -> str:
"""
Returns the namespace's internal_name.
"""
unq = "_".join(self.get_scope_names())
return unq
And the implementation of the pyrser.type_system.Fun
class is the
following:
def internal_name(self):
"""
Return the unique internal name
"""
unq = super().internal_name()
if self.tparams is not None:
unq += "_" + "_".join(self.tparams)
if self.tret is not None:
unq += "_" + self.tret
return unq
If we follow properly how the internal_name
method of the
pyrser.type_system.Fun
class works, we can see that the higher level
class (pyrser.type_system.Fun
in our case) can use internally it’s
parent class’s internal_name
method. That part is actually up to the
implementor, as it could also define a wholly different mangling method.
In reality, three classes express the different typing concepts that enter into
account when trying to generate unique signature identifiers. Those are the
concepts of Value, Variable and Function, which classes are respectively the
classes Val, Var and Fun. So in order to re-define the mangling for your own
language, you may need to redefine up to four classes:
pyrser.type_system.Symbol
, pyrser.type_system.Val
,
pyrser.type_system.Var
and pyrser.type_system.Fun
.
Now, let us try to define a mangling fit for a language that would not support any overloading for a given symbol, meaning that a variable could not have the same name as a function:
from pyrser.type_system import *
class MySymbol(Symbol):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
def show_name(self):
return '.'.join(self.get_scope_names()) + '.' + self.name
def internal_name(self):
return '_'.join(self.get_scope_names()) + '.' + self.name
class MyFun(MySymbol, Fun):
def __init__(self, *args, **kwargs):
MySymbol.__init__(self, *args, **kwargs)
Fun.__init__(self, *args, **kwargs)
def show_name(self):
paramstr = ''
if self.tparams is not None:
paramstr = ', '.join(self.tparams)
return super().show_name() + '(' + paramstr + ')'
class MyVar(MySymbol, Var):
def __init__(self, *args, **kwargs):
MySymbol.__init__(self, *args, **kwargs)
Var.__init__(self, *args, **kwargs)
t1 = Type('int')
t2 = Type('char')
t3 = Type('double')
Note that MyVar
only re-uses MySymbol
‘s show_name
and
internal_name
methods. Thus, we can see that using the show_name
(used
mostly when printing out an object for display purposes), we can differentiate
MyFun
from MyVar
, even though the internal_name
is the same for
both classes. So now, the unique identifier being the same, the typing system
won’t allow having more than one unique name registered, and thus prevents us
from registering both a variable and a function having the same namespaces and
name.
We can try out the following piece of code:
v2 = MyFun('G', 'int')
val = Scope(sig=[v1, v2])
f1 = MyFun('fun1', 'int', [])
f2 = MyFun('fun2', 'int', ['char'])
f3 = MyFun('fun3', 'int', ['int', 'double'])
scope = Scope(sig=[t1, t2, t3, val, f1, f2, f3])
print(str(scope))
print(repr(scope))
Which yields the following output, where we can see that the mangling was handled by our code:
scope :
scope :
fun G.G() : () -> int
type char
type double
fun fun1.fun1() : () -> int
fun fun2.fun2(char) : (char) -> int
fun fun3.fun3(int, double) : (int, double) -> int
type int
{'': {'G.G': G.G}, 'int': {}, 'char': {}, 'fun3.fun3': fun3.fun3, 'fun1.fun1': fun1.fun1, 'double': {}, 'fun2.fun2': fun2.fun2}
4 - Type resolution and disambiguation¶
In most languages, the typing system can encounter situations where the type is not as obvious as a one on one match. Indeed, a lot of languages have to resolve (either following a standard resolution model or yielding an error) situations where multiples signatures match the one we are looking for. As we just saw, since we can redefine the unique internal identifier generation for the typing system’s classes, depending on the method used, we could fall more or less easily in one of those situations.
For instance, let’s assume that our mangling supports function overloads, like the C++ language does. Then, let’s assume the following symbols to have been declared in a fictive language that we’re trying to type-check:
from pyrser.type_system import * c = Type('char') i = Type('int') d = Type('double') b = Type('bignum') f1 = Fun('fun', 'bignum', ['char']) f2 = Fun('fun', 'int', ['char']) f3 = Fun('fun', 'char', ['int', 'double']) f4 = Fun('fun', 'bignum', ['double']) scope = Scope('foo', [c, i, d, b, f1, f2, f3, f4])
As a pre-requisite, let us assume that a number litteral in our language can be
typed in multiple ways. For instance, a number litteral can be typed as a
character, an integer, or as a big number. Then, when some parsed code will
contain a litteral, the following set of
pyrser.type_system.Val
will be built:
l1 = Val(13, 'char') l2 = Val(13, 'int') l3 = Val(13, 'bignum') literal = Scope(sig=[l1, l2, l3])
Now, we have a user input, where the written code is a function call to
fun
, with a number litteral as a parameter, that we could translate to the
following typing code:
overloads = scope.get_by_symbol_name('fun') print(str(overloads))
Now, we display the following scope:
scope foo :
evalctx :
fun foo.fun : (char) -> bignum
resolution :
'bignum': <weakref at 0x7f27573579a8; to 'Type' at 0x7f2759c94f28> (foo.bignum)
'char': <weakref at 0x7f2757357958; to 'Type' at 0x7f2759c8ee80> (foo.char)
evalctx :
fun foo.fun : (char) -> int
resolution :
'char': <weakref at 0x7f2757357958; to 'Type' at 0x7f2759c8ee80> (foo.char)
'int': <weakref at 0x7f2757357908; to 'Type' at 0x7f2759c8ef28> (foo.int)
evalctx :
fun foo.fun : (double) -> bignum
resolution :
'bignum': <weakref at 0x7f27573579a8; to 'Type' at 0x7f2759c94f28> (foo.bignum)
'double': <weakref at 0x7f27573579f8; to 'Type' at 0x7f2759c94518> (foo.double)
evalctx :
fun foo.fun : (int, double) -> char
resolution :
'char': <weakref at 0x7f2757357958; to 'Type' at 0x7f2759c8ee80> (foo.char)
'double': <weakref at 0x7f27573579f8; to 'Type' at 0x7f2759c94518> (foo.double)
...
Here, since the overloads list contains more than one item, it may be easily
resolvable by using the get_by_params (returning a tuple of a scope and a list
of scopes) on the overloads pyrser.type_system.Scope
:
(fun, param) = overloads.get_by_params([literal])
print(str(fun))
print(str(param))
Indeed, pyrser.type_system.Scope.get_by_params()
takes care of
matching the available signatures with the multiple sets for each parameter.
Now, if the fun
scope contains more than one signature, it means that we
have an unresolved type. That could mean a lot of differents things, but for
now, let’s try to reduce this choice.
If we only got one signature in the resulting fun
scope, then the typing
system would have validated the types of the input, and we could go on and fiddle
with the generation. Let us see what an unresolved func and param look like:
...
scope foo :
evalctx :
fun foo.fun : (char) -> bignum
resolution :
'bignum': <weakref at 0x7f56df9f79f8; to 'Type' at 0x7f56e2334f28> (foo.bignum)
'char': <weakref at 0x7f56df9f7908; to 'Type' at 0x7f56e232ee80> (foo.char)
evalctx :
fun foo.fun : (char) -> int
resolution :
'char': <weakref at 0x7f56df9f7908; to 'Type' at 0x7f56e232ee80> (foo.char)
'int': <weakref at 0x7f56df9f7958; to 'Type' at 0x7f56e232ef28> (foo.int)
...
In this case, we can see that using the literal as parameter was not enough to resolve the type of the function we want to use, but we can see a difference between the two: the return type. So we can filter once again over the return type:
byret = fun.get_by_return_type('bignum')
print(str(byret))
and we then get the following output:
...
scope foo :
evalctx :
fun foo.fun : (char) -> bignum
resolution :
'bignum': <weakref at 0x7f894e2d79f8; to 'Type' at 0x7f8950c14f28> (foo.bignum)
...
As we can see, by using this last filter, we could identify an unique function
signature matching our user input. Alas, in some cases, it’s not as easy.
Indeed, in some languages you might have polymorphic types, that the Scope
class cannot resolve itself. It requires the help of another typing module: the
pyrser.type_system.inference.Inference
. Sometimes, even the inference
module cannot resolve something, and then we fall in the case of an error, that
will be up to us to notify to the user.
See Tuto III
to dive deeper into the usage of pyrser, and see how to add
mechanisms for the typing system to have a more powerful resolver, adding Type
coercion, and using the Inference module.