UPDATE: I have also started using the Regex module to handle some basic if/else branching. See my most recent post on this.

I noticed some days back that Yahoo Pipes had added a regex module (I think this is new) and wonder of wonders they’re actually perl compatible regex’s. (Aside: remember back in the day when regex’s were a core element of the perl bashers repertoire? Now every language has a module for PCRE’s? Yeah.. I thought so, punks!) Anyhow.. I decided to give it a whirl and created a new technorati search pipe, same as the old one but this time I append the domain-ey part of the link to the end of the title. So I try and get the important part of the domain only - so instead of “http://www.slashdot.org/something/here” - I just want “slashdot” appended in square brackets. The goal was to get titles that looked like this:

‘Fair Use Bill Introduced To Change DMCA’ [teleread]

where we pull ‘teleread’ from the url, which in this case was:

http://www.teleread.org/blog/?p=6240

It turns out this is super simple. I just added one new regex module to the pipe and gave it a few rulers. The first one just replaces the end of the title with this:

--==${link}

All that does is adds in the full url of the link with the wierd series of symbols “–==” in front of it. So, if you just had that a post from slashdot’s title might end up looking like this “Some Fancy Title –==http://www.slashdot.org/some/fancy/url”.

That’s clearly not what we want, so the second rule uses a moderately complex match:

--==http://(?:www\.)?([^/]+)(?:\.\w{2,4})/.+?$

So it starts matching from when it finds our wierd string “–==” and goes on to start parsing the link. The link begins with http:// - then we match an optional “www.”, because we don’t want to have that in our string (remember we just want the good part of the domain). Then the key part happens where we match everything that isn’t a slash. That’s the good part of the domain. The last part of the domain, the .com, .info, .org part get’s matched by the last grouping “\.\w{2,4}” which says match a dot and then 2 to 4 letters. Once that’s done, we strip off the rest of the url (the path part) and we replace all that with [$1] which is what we matched in that key part of the regular expression, the domain, surrounded by square brackets. Very importantly, notice that not only does this trim the link down to just the domain, but it gets rid of our delimiter, “–==”.

And the last bit is just a cleanup. If that second one, just above, didn’t actually match anything - for some reason it failed, perhaps it’s a secure blog and it started with https instead of http, we’d be left with an ugly title. So, we match everything from our delimiter “–==” to the end of the string, just in case.

And there you have it!

I think the regex module is pretty powerful. There’s a stunning amount of interesting things you can do with PCRE’s. I suspect that you could use this to maintain a certain type of state in your module, by using delimiters to stash some fields and using a random field as a scratch pad to make inputs for other modules. So you could take the title, append to the end of the link and use the title to add any other thing you could have, perhaps format the output of content analysis to some more search friendly string. Then later you can always restore the title, but pulling it back out of the link.

I can’t think of anything that would require that at this instant, but I suspect it’s because my imagination is bounded. In the end, I think this module does a lot for Yahoo Pipes and I suspect that a lot of interesting stuff will come of it. I hope so, at least!

← newer On the Ruby Road  ↑  Breakfast Links: Three Series of Related Events older →

TwitterCounter for @nybble73