regular expressions: the cause of and solution to, all of life’s problems.

uhum, a few years ago I wrote a regular expression to try to approximate the number of syllables in a word.
Today I dug it out and decided it was not good enough! Instead of approximating the number of syllables, let’s try to make it match the full syllables!

The result follows. It’s in Python re module syntax, which is pretty much PCRE but you’ll notice python lets you do non-fixed length assertions (huzzah), which PCRE does not (boo). Also I apologise for some bad constructs but I wrote it several years ago. All {0,1}s should be ‘?’ and all x|y should be [xy], but when your regexes get this big you get wary of changing them.

You have to split your input by re.split(‘[^a-z]’) because the regex operates at the word level. Then you do an re.findall() with the re.X/re.VERBOSE flag set.

(?:
  [bcdfghjklmnpqrstvwxz]*
  (?:
        [aeiou]+[^aeiou]e(?=[aeious])
        |
        #Easy vowels first, each may be repeated up to 3 times.
        (?:
            #i may not be part of ing, eg 'trying' is two syllables but here would be counted as one. 
            # ing is addressed below.
            # likewise ia  iod ua are generally pronounced as two syllables (deviation, period, usual). 
            # These are excluded here to  remove from the repetition qualifier. 
            a|i(?!:ng|a|o)|o|u(?!a|se)|y   
            |oe

            #e is not so easy, and to avoid the repeatition qualifier it is also addressed underneath.
            # match e if not on the end of a word like 'note', or plural forms, like 'notes'. 
            | (?:e(?!(?:s){0,1}$) )  
            
        ){1,3}
        
        # now we address the possibility of 'e' being on the end of a word and does qualify for a syllable, 
        # for example 'maybe', or 'couple'
        #Do match e at the end if preceding letters were not such that a silent e is likely: 
        # >1 consonants EXCEPTION: r two letters before the e, as in 'barge', 'nce' as in variance 
        #     errrr shouldn't the last subpattern be a look AHEAD?
        | (?<=[^aeiour][^aeiou])(?:e[sd])(?:[\s]|$)(?<!nce)
     
        #and we rejected e when on the end of word or a plural, but what about voices or aces.
        |ces$

        #short words ending in e or a, like 'me', 'the', 'sea', etc. 
        | ^[a-z]{1,2}[ea]$

        #excluded exceptions from earlier. Note non-consumed following letters to avoid overlapping.
        
        | ing    | i(?=a)  
        | i(?=o) | u(?=a)(?!are)
        
  )
  (?:
    (?:
      # bah this needs refactoring
      bb|dd|ff|gg|ll|mm|nn|ss|tt
      |ch|sh|ght?|lt|th|rn|st|ft|nd|rt|rd|gn$|nt|ct|(?<=[aeiou])xp|lf
      |ng|ck|rst|lp|ld|rp
      |
      [bcdfghjklmnpqrstvwxz]?
    )?
    (?:s$)?
  )?
)
|.+

I ran it on an excerpt of a Pirates Of the Caribbean article from Wikipedia and it split the words as follows:

unintelligible  ['un', 'int', 'ell', 'ig', 'ib', 'le']
relinquishes    ['rel', 'in', 'quish', 'es']
intelligence    ['int', 'ell', 'ig', 'en', 'ce']
negotiations    ['neg', 'ot', 'i', 'at', 'i', 'ons']
interceptor     ['int', 'er', 'cep', 'tor']
encouraging     ['en', 'cour', 'ag', 'ing']
intelligent     ['int', 'ell', 'ig', 'ent']
calculating     ['cal', 'cul', 'at', 'ing']
vocabulary      ['voc', 'ab', 'ul', 'ar', 'y']
lieutenant      ['lieut', 'en', 'ant']
previously      ['prev', 'i', 'ous', 'ly']
deciphered      ['dec', 'ip', 'her', 'ed']
acquiesces      ['ac', 'quies', 'ces']
norrington      ['nor', 'ring', 'ton']
desiderata      ['des', 'id', 'er', 'at', 'a']
admiration      ['ad', 'mir', 'at', 'i', 'on']
strategies      ['strat', 'eg', 'i', 'es']
intentions      ['int', 'ent', 'i', 'ons']
proclaimed      ['proc', 'laim', 'ed']
grappling       ['grap', 'pling']
elizabeth       ['el', 'iz', 'ab', 'eth']
evidenced       ['ev', 'id', 'en', 'ced']
advantage       ['ad', 'vant', 'ag', 'e']
returning       ['ret', 'urn', 'ing']
seemingly       ['seem', 'ing', 'ly']
commodore       ['comm', 'od', 'or', 'e']
actuality       ['act', 'u', 'al', 'it', 'y']
consensus       ['con', 'sen', 'sus']
announces       ['ann', 'oun', 'ces']
swordsman       ['sword', 'sman']
suggested       ['sugg', 'est', 'ed']
resulting       ['res', 'ult', 'ing']
determine       ['det', 'er', 'min', 'e']
indicated       ['ind', 'ic', 'at', 'ed']
persuades       ['per', 'suades']
murderous       ['murd', 'er', 'ous']
negotiate       ['neg', 'ot', 'i', 'at', 'e']
extremely       ['ex', 'trem', 'el', 'y']
mutinied        ['mut', 'in', 'ied']
stabbing        ['stabb', 'ing']
explains        ['exp', 'lains']
attitude        ['att', 'it', 'ud', 'e']
immortal        ['imm', 'ort', 'al']
contrast        ['cont', 'rast']
although        ['alt', 'hough']

Not bad?

Advertisements

I like blogging

Tagged with: , , ,
Posted in Uncategorized

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: