Lexer in Python for custom programming language (Lexical Analyzer)

Badger · Apr 5, 2019

Basically I am creating my own language using Python currently. I had started the Lexer program and the way it works was that it tokenizes the source by seperating spaces. However, I want to be able to read both of these possible scenarios:

Code:

name="Jack";
name = "Jack";

My lexer does not work for the first 'name' variable declaration, but does work for the second. How can I make it so either can work? Not sure if anyone has done any lexers here, but maybe somebody has some insight.

I also am wondering how lexers can take something like

Code:

name = ((5))

and it is completely valid code to write so long as the open parenthesis each have a corresponding closing parenthesis.

Ally · May 7, 2019

I understand your approach to writing your own language but tokenizing by spaces simply won't work for most cases.

In the beginning, your "language" will have to be explicitly typed given a set of rules, such as the observed restriction imposed by splitting by spaces. The other obvious thing to note is that every line will be different, and will have different formats for each. What if you're declaring a variable or invoking a function/method?

Wikipedia is only slightly insightful. Educational institutions may provide instructions and blogs can very much help; even if they're targetted at a particular language.

Rather than focus on a full-on language, get your hands dirty with some esolanguage parsing techniques. A Brainfuck interpreter is a very simple, very easy esolang to write for yourself. Go on, go do it yourself.

What I'm trying to say is that you could attempt, instead, using symbols to represent instructions. In brainfuck, single characters are used, but you don't have to do that! I understand my answer is largely redundant and on the surface, counterproductive, but believe me, it helps. Learn some techniques and explore. Half the fun of writing interpreters and analysers is the exploration. Don't expect to go full blown language first up. That's a recipe for disaster.

Now regarding your second question:

.Badger said:
name = ((5))

When parsing things like brackets, you can generally keep a stack such that when an open bracket is found, you push the index to the stack, and when a close bracket is found, you can pop it. Anything between those two indexes is within the brackets.

Observe, I'll parse ((5)), pseudo-code ish:

Code:

Loop from index 0 to 4,
if element at index is ( then push index to stack
if element at index is ) then store current index as endIndex and pop the stack as startIndex
the resulting region is startIndex + 1 to endIndex - 1
add the region to a list

The resulting regions will be 5 and (5). Having a double bracket like that is redundant though, so you could probably apply more processing to eliminate duplicate resulting values (such as 5 and (5)).

Edit: If you're into mathematical operations too, look into Infix and Postfix expressions, as well as Reverse Polish Notation. Techniques for these are available on Wikipedia and a multitude of other places.

Lexer in Python for custom programming language (Lexical Analyzer)

Badger

Software Developer

Ally

gσ∂∂єѕѕ σƒ мαтнѕ αη∂ мєℓσηѕ χσ