regular expressions: the cause of and solution to, all of life’s problems.
uhum, a few years ago I wrote a regular expression to try to approximate the number of syllables in a word.
Today I dug it out and decided it was not good enough! Instead of approximating the number of syllables, let’s try to make it match the full syllables!
The result follows. It’s in Python re module syntax, which is pretty much PCRE but you’ll notice python lets you do non-fixed length assertions (huzzah), which PCRE does not (boo). Also I apologise for some bad constructs but I wrote it several years ago. All {0,1}s should be ‘?’ and all x|y should be [xy], but when your regexes get this big you get wary of changing them.
You have to split your input by re.split(‘[^a-z]‘) because the regex operates at the word level. Then you do an re.findall() with the re.X/re.VERBOSE flag set.
(?:
[bcdfghjklmnpqrstvwxz]*
(?:
[aeiou]+[^aeiou]e(?=[aeious])
|
#Easy vowels first, each may be repeated up to 3 times.
(?:
#i may not be part of ing, eg 'trying' is two syllables but here would be counted as one.
# ing is addressed below.
# likewise ia iod ua are generally pronounced as two syllables (deviation, period, usual).
# These are excluded here to remove from the repetition qualifier.
a|i(?!:ng|a|o)|o|u(?!a|se)|y
|oe
#e is not so easy, and to avoid the repeatition qualifier it is also addressed underneath.
# match e if not on the end of a word like 'note', or plural forms, like 'notes'.
| (?:e(?!(?:s){0,1}$) )
){1,3}
# now we address the possibility of 'e' being on the end of a word and does qualify for a syllable,
# for example 'maybe', or 'couple'
#Do match e at the end if preceding letters were not such that a silent e is likely:
# >1 consonants EXCEPTION: r two letters before the e, as in 'barge', 'nce' as in variance
# errrr shouldn't the last subpattern be a look AHEAD?
| (?<=[^aeiour][^aeiou])(?:e[sd])(?:[\s]|$)(?<!nce)
#and we rejected e when on the end of word or a plural, but what about voices or aces.
|ces$
#short words ending in e or a, like 'me', 'the', 'sea', etc.
| ^[a-z]{1,2}[ea]$
#excluded exceptions from earlier. Note non-consumed following letters to avoid overlapping.
| ing | i(?=a)
| i(?=o) | u(?=a)(?!are)
)
(?:
(?:
# bah this needs refactoring
bb|dd|ff|gg|ll|mm|nn|ss|tt
|ch|sh|ght?|lt|th|rn|st|ft|nd|rt|rd|gn$|nt|ct|(?<=[aeiou])xp|lf
|ng|ck|rst|lp|ld|rp
|
[bcdfghjklmnpqrstvwxz]?
)?
(?:s$)?
)?
)
|.+
I ran it on an excerpt of a Pirates Of the Caribbean article from Wikipedia and it split the words as follows:
unintelligible ['un', 'int', 'ell', 'ig', 'ib', 'le'] relinquishes ['rel', 'in', 'quish', 'es'] intelligence ['int', 'ell', 'ig', 'en', 'ce'] negotiations ['neg', 'ot', 'i', 'at', 'i', 'ons'] interceptor ['int', 'er', 'cep', 'tor'] encouraging ['en', 'cour', 'ag', 'ing'] intelligent ['int', 'ell', 'ig', 'ent'] calculating ['cal', 'cul', 'at', 'ing'] vocabulary ['voc', 'ab', 'ul', 'ar', 'y'] lieutenant ['lieut', 'en', 'ant'] previously ['prev', 'i', 'ous', 'ly'] deciphered ['dec', 'ip', 'her', 'ed'] acquiesces ['ac', 'quies', 'ces'] norrington ['nor', 'ring', 'ton'] desiderata ['des', 'id', 'er', 'at', 'a'] admiration ['ad', 'mir', 'at', 'i', 'on'] strategies ['strat', 'eg', 'i', 'es'] intentions ['int', 'ent', 'i', 'ons'] proclaimed ['proc', 'laim', 'ed'] grappling ['grap', 'pling'] elizabeth ['el', 'iz', 'ab', 'eth'] evidenced ['ev', 'id', 'en', 'ced'] advantage ['ad', 'vant', 'ag', 'e'] returning ['ret', 'urn', 'ing'] seemingly ['seem', 'ing', 'ly'] commodore ['comm', 'od', 'or', 'e'] actuality ['act', 'u', 'al', 'it', 'y'] consensus ['con', 'sen', 'sus'] announces ['ann', 'oun', 'ces'] swordsman ['sword', 'sman'] suggested ['sugg', 'est', 'ed'] resulting ['res', 'ult', 'ing'] determine ['det', 'er', 'min', 'e'] indicated ['ind', 'ic', 'at', 'ed'] persuades ['per', 'suades'] murderous ['murd', 'er', 'ous'] negotiate ['neg', 'ot', 'i', 'at', 'e'] extremely ['ex', 'trem', 'el', 'y'] mutinied ['mut', 'in', 'ied'] stabbing ['stabb', 'ing'] explains ['exp', 'lains'] attitude ['att', 'it', 'ud', 'e'] immortal ['imm', 'ort', 'al'] contrast ['cont', 'rast'] although ['alt', 'hough']
Not bad?
let’s do programming
Brainfuck interpretor in Python. Kind of fun.
I hereby declare this code to be free under the WTFPL.
#!/usr/bin/env python
import sys
def bf(code):
input_ptr = 0
tape_ptr = 0
tape = [0] * 30000
while input_ptr < len(code):
char = code[input_ptr]
if char == '>' : tape_ptr += 1
elif char == '<' : tape_ptr -= 1
elif char == '+' : tape[tape_ptr] += 1
elif char == '-' : tape[tape_ptr] -= 1
elif char == '.' : sys.stdout.write(chr(tape[tape_ptr]))
elif char == ',' :
try: tape[tape_ptr] = ord(raw_input()[0])
except: continue
elif char == '[' and not tape[tape_ptr] or char == ']' and tape[tape_ptr]:
stack = 0
while 0 <= input_ptr < len(code):
input_ptr += 1 if char == '[' else -1
if code[input_ptr] == char: stack+=1
elif code[input_ptr] == '[' or code[input_ptr] == ']':
if stack: stack -= 1
else: break
if stack: raise Exception("Mismatched brackets")
input_ptr += 1
if __name__ == '__main__':
code = ""
if len(sys.argv) < 2 or sys.argv[1] == '-':
code = sys.stdin.read()
else: code = sys.argv[1]
bf(code)
Usage:
$ ./bf '++++++++++[>+++++++>++++++++++>+++>+<<<<-]>++.>+.+++++++..+++.>++.<<+++++++++++++++.>.+++.------.--------.>+.>.' Hello World!
yes, it’s a brilliantly stupid programming language. You get a tape memory, which you can traverse left and right with < and > respectively. You can increment and decrement the current tape element with ‘+’ and ‘-’. IO is ‘,’ and ‘.’, and [...] denotes a while loop. The loop’s condition is that the current tape element is non-zero. And that’s all there is to it.
IPs as evidence
I am reading a lot of things to do with file sharing and flawed evidence lately. One of the common misconceptions is that IPs can be arbitrarily spoofed. This isn’t really true. You can spoof your IP but you won’t receive any packets, so it’s kind of a pointless exercise unless you’re doing something cunning. It’s not practical for web usage. It’s very hard to do it most of the time because the data transfer protocol you use will most likely be built on TCP which requires a forwards-backwards handshaking procedure to establish a connection first, and it’s hard to bluff your way through the handshake if you don’t know what the other person’s sending back. Or as nmap says: TCP Sequence Prediction: Difficulty=204 (Good luck!).
But I think this is all missing the point a bit. The IP in the first place is such flimsy and meaningless evidence. There are too many humans in the chain who stand to gain from just making them up and it’s trivial to do so. Between a law firm and a ‘monitoring’ firm, you’ve got two separate entities whose entire business depends on being able to come up with a set number of IP addresses per month. It isn’t a difficult task to add on an extra 10% to ensure you get a nice bonus. It isn’t even a difficult task to fictitiously generate 95% of your data.
If I was doing this, my first thoughts would be like so: I’d probably proceed by recompiling a torrent client to write a list of IPs for each torrent to a text file (filtering by UK IPs belonging to ISPs who don’t protect their customers), then if I was a greedy cretin I’d most likely bump up the number like so:
#!/usr/bin/env python
import commands
import random
import re
ip32 = lambda n: sum([int(x)<<8*(3-i) for i, x in enumerate( (n.split(".")) ) ])
def random_ip_from_subnet(network, subnet):
ip = 0
for i in range(32)[::-1]:
snbit = (subnet >> i ) & 1
ip = ip << 1
ip |= (network >> i) & 1 if snbit else random.randint(0, 1)
return ip
users = [] # read user list in here as list of (ip, date) tuples
newusers = []
for ip, date in users:
newusers += [(ip, date)]
subnet = commands.getoutput("whois {0} | grep route | awk {{'print $2 '}}".format(ip)).trim()
if re.match("^(\d{,3}.){3}\d{,3}/\d+$", subnet):
[network, mask] = subnet.split("/")
networkip = ip32(network)
maskip = 2**(32-int(mask))-1
for i in xrange(3):
fakeip = random_ip_from_subnet(networkip, maskip)
fakedate = date + (random.random() - 0.5) * (60*60*24*28)
newusers += [(fakeip, fakedate)]
# write newusers to file here.
(note: untested code)
And there we go, I’ve just increased my income by 300% in about 30 lines of code. Each IP I generate is on the same ISP as one of the ones I’ve really captured, so I know that I’m not going to have to submit multiple court orders. Some of these will likely not be allocated at the time of the alleged infringement, so the ISP won’t be able to identify them to a user, but IPv4 is oversubscribed so we’re not just taking wild stabs in the dark. A lot of these will be allocated. I’ve looked through ACS:Law’s spreadsheets and on one of the BSkyB ones, 1387 out of 5699 records are ‘unknown’. On another 1325 out of 5358 are unknown. What’s going on here? It doesn’t say why they’re unknown, but it’s a massive failure rate.
The basic idea of this code is that it takes a known IP, it then looks up that IP’s WHOIS record which tells us what range (subnet) it’s on. Then we start generating IPs at random from that range. We also randomise the date within a 14 day range (although if I was making a serious effort, I’d then truncate the hours/minutes/seconds and instead choose time of day randomly from a non-linear distribution to favour evenings during the week (negatively skewed normal distribution, mean 8pm) and daytime at weekends — this would take an extra 10 lines or so but I’d have to read up on both the random and date modules so I didn’t bother). Also if I was making a serious effort I’d drop or regenerate IPs whose last byte is 0 or 0xFF.
When someone says “I’ve got an IP address” the correct response is not “alright here’s a court order to get the ISP to tell you who it belongs to so you can send them a letter demanding £500″; the correct response is “so?”.
anagram generator in python
it helps me with the crosswords :p
def ramnaga(letters, word = ""):
if len(letters) == 0:
if is_word(word):
return [word]
else: return []
words = []
for i in range(len(letters)):
newletters = letters[0:i] + letters[i+1:]
l = letters[i]
words += ramnaga(newletters, word + l)
return words
this would help of course if wordpress didn’t mess up the indendation [edit: ahh sourcecode mode] [edit2: untested modification, watch out!].
implementation of is_word() is left as an exercise to the reader. You obviously need a wordlist (*nix users can look in /usr/share/dict/words. Windows users can scour the internet because their OS is rubbish and doesn’t come with anything useful. I recommend using the ‘set’ object to store it because it automatically removes duplicates and it provides O(1) membership testing, which is all you care about). If you remove the is_word condition, it’s just a standard permutation generator but it’ll use a tonne of memory (easily 350MiB on a 10 character word) storing gibberish before you can filter it.
On my CPU it takes about 25 seconds on a 10 character word and about 250 on 11 letters. It’s impractical above that. If you reimplented it in C you might squeeze a 12 letter word out, but there’s a limit to how fast you can get an inherently O(n!) algorithm. I’d be interested to know if you could use some kind of coding theory to figure out permutations based on hamming distances, or something. It seems then that your complexity would be in terms of the size of the dictionary?
call as follows:
>>> words = list(set( ramnaga("ramnaga") )) # list(set()) to uniqueify elements
# and convert back to a list
>>> words
['anagram']
written in about 4 minutes due to the extremely rare good luck of getting a recursive problem right first time.