Back to Top

Saturday, June 21, 2008

Why Web Applications Firewalls don't block


Jeremiah Grossman describes it much more concisely than I did.

To implement default-deny Web Application Firewalls (WAF) must know everything about a website at all times, even when they change. That’s programmatically documenting every expected request method, URL, parameter name/value pair, cookie, process flow, etc making default-permit deployments the rule rather than the exception.

I smell propaganda


Being in a post-communist (whatever that might mean) country has some advantages. For example it sensitives you to propaganda. You can smell it instinctively and immediately you start to raise questions: how true is this? what are the supporting facts?

Wikipedia defines propaganda as:

Propaganda is a concerted set of messages aimed at influencing the opinions or behaviors of large numbers of people. As opposed to impartially providing information, propaganda in its most basic sense presents information in order to influence its audience.

Why am I writing this? Because I'm watching the story of stuff. Now I do believe that there are some (major) problems, but this presentation is still pure propaganda. It uses generalization, figures pulled out of thin air and argumentation via authority (I've been looking into this for ten years. Trust me, I know what I'm talking about). Pure propaganda.

Incidentally, I remember a presentation from the local Chinese cultural days I went to some years ago and they showed a very nicely done propaganda piece. It was very reminiscent of the pieces we've been fed (the soundtrack was available in 30+ languages!). So if people are ready to buy that, they will surely be swung by this.

YATP - Yet An other Twitter Problem


Twitter isn't the most reliable service out there. Today I signed up to follow a friend who is too lazy to type more than 140 characters at a time ;-), so doesn't blog.

While signing up, the CAPTCHA didn't show up. After several page refreshes I took a look at the source and it is using reCAPTCHA, which is all nice and good, but for some reason my wonky DNS server couldn't resolve the address of the reCAPTCHA api server :-(. So finally I looked it up with OpenDNS:

dig @

And hardcoded it in my hosts file (this will be really nice when they move servers and I forget that it is in the hosts file and start to wonder why it doesn't work again :-)).

The morale here is that when you depend on an external service, you will have some problems if that service goes down (and down can be very tricky to define - for example reCAPTCHA wasn't down for people using a sane DNS server, it was down just for me - this means that even if Twitter used a network monitoring solution to check the availability of reCAPTCHA, it still wouldn't have done me any good).

PS. It was asking for my e-mail password to spam my friends. Yeah, like that's going to happen...

Locking a script to a given user with Perl


From a security point of view it is useful if you lock sensitive scripts (for example things which download untrusted data from the Internet) to run with a low privileged user. However it is also a good idea to make sure programatically that they are run only with the given user. One possible solution (which I will present today) is to create a module which checks the current user (and dies if it isn't what you expected) and include it in your scripts. The module can be something like the following:

package LimitedUser;

use strict;
use warnings;

 if (not $^C) {
  my $expected_user = 'limited_user';
  if ($expected_user ne getlogin()) {
   require Term::ANSIScreen;
   my $message = qq{This script can only be launched from the account "$expected_user"\n};
   print STDERR color('bold'), color('red'), $message, color('reset');     


Now lets look at this script a little. First of, it does its work very early on, in a BEGIN block. This means that if you import this module with use (rather than require), it will run before most other code gets a chance to run, and prevent them from running if the current user is not the one expected.

Second of, it checks the $^C variable to see if we are only compiling the script (which is often used to syntax check them, although it does slightly more than a plain syntax check in other languages). If this is the case, the user check is not performed, so that development can be done using any user.

The script has been tested on both Windows and Linux and should work without any problems.

Wednesday, June 18, 2008

Get the IP of the local computer from Perl


Caveat: this is only documented on Windows and may or may not work on other OSs (it doesn't work on Ubuntu 8.04). Also, if the computer has multiple IP addresses (like a LAN, WLAN and a VLAN IP), there is no telling which IP this will return.

Just a little snippet of code:

print join('.', unpack('C*', gethostbyname(''))), "\n";

As per the documentation:

If the name parameter points to an empty string or name is NULL, the returned string is the same as the string returned by a successful gethostname function call (the standard host name for the local computer).

Also, other interesting snippet from the same page:

Note The gethostbyname function does not check the size of the name parameter before passing the buffer. With an improperly sized name parameter, heap corruption can occur.

I wonder if the implementors of libraries for interpreters (Perl, Python, etc) do this check...

Monday, June 16, 2008

Reboot Windows - the hard way


I was clicking around via an RDP session on a Windows server and managed to kill the LSASS process (note to myself: next time pause the view of ProcessExplorer before killing processes!). The one minute till reboot screen promptly appeared and my first reflex was to stop the countdown (this is a trick which came in handy back when the lsass-killing worm was making the rounds). So I typed:

shutdown -a

My next step was to try to restart Windows (having no Local Security Authority process is not the best situation to be in). However I quickly found that:

  • The shutdown command wasn't working
  • ProcessExplorer could not restart the computer because it was trying to aquire the shutdown privilege dynamically, and given that no lsass process existed, it failed

I also tried to start the lsass process, but to no avail.

My option was to wait until somebody shows up on site (which could take a couple of hours) or to do something else. Naturally I decided to do something else :-).

I remembered this source code which uses an undocumented API call to change the IOPL level of the process (meaning that you can read/write from/to the ports from user mode) and then uses the keyboard controller to create a hardware reset. Don't do this at home :-)

After compiling it, the first hurdle was to get it on the machine. LSASS not running, file shares were not working, but fortunatly I had access to a FTP server and used the command line FTP client to download the executable.

The second problem was to create a session with SeTcbPrivilege. Doing as described on the page was not really feasible (without LSASS the user management wasn't really working and even if it were, I'm not sure that I could have logged back in). So I started CMD.EXE with the SYSTEM account, with the method described here, at which point I already had the required privileges.

The final problem was that the program contained code to aquire the privilege and checked for the return code. This call of course failed, not because I didn't have the privilege, but because LSASS was not running. So I removed the error checking code, and voila!

The connection went silent, and I waited and waited and waited wondering if I did the right thing. After about four minutes the machine came back up. W00t! But as I said earlier: don't try this at home!

Other bugs which are passe


After talking about a problem with older versions of Opera, here is a problem with version 5.8.4 of ActivePerl for Windows (but which isn't present in 5.8.8, so the simple solution is to upgrade):

If you use the POSIX module to print out the day of the week (Friday, Monday, etc) with a code like the one below:

use POSIX;
print POSIX::strftime('%A', localtime());

The Perl interpreter will lock up. If you try to use a formatting string which has more placeholders (for example strftime( "%A, %B %d, %Y", 0, 0, 0, 12, 11, 95, 2 );), it will omit the weekday part. I discovered this by investigating why the emails sent by the SVN commit hook had no date, only to discover that they were missing the weekday, thus creating an invalid date. Below is a quick patch to work around this issue (the proper solution is of course to upgrade to 5.8.8 or 5.10):

my $weekday = (localtime())[3];
$formatted_date = strftime('%a %b %% %X %Y', localtime());
$formatted_date =~ s/\%/$weekday/;

Benchmark with care


I saw this site recently: was constructed as a way to open people's eyes to the fact that not every PHP code snippet will run at the same speed

This is useless! Or let me reformulate: this is misleading! For one, it uses microtime to do the benchmark, which measures physical time as opposed to CPU time, meaning that as the load varies on the server, the results will change (since they are re-done at each page refresh).

Also, source code readability is far, far more important than such micro benchmarks. Usually there is more than one way to do it in programming language so that programmers can choose the representation which is closest to the semantic meaning that they had in mind (thus minimizing the effort needed to translate it into code and hence minimizing the error rate).

As I said earlier, the best optimization technique is to know the functionality available to you out-of-the-box as best as possible and avoid looping through data in PHP.

Things you can get for free


It is amazing what you (commercial) software you can get today for free:

  • Delphi
  • Visual Studio
  • IDA
  • if you are a student, you can get lifetime (!) access to all of Microsoft's products, provided you don't use them for commercial purposes

I for one welcome our overlords prefer the open-source alternatives, because I know that I can read the source code (the ultimate documentation) and poke around with it, but even if you aren't an open-source enthusiast, you can thank them for creating a marketplace where such altruism is expected from companies.

Mixed links


First of a nifty tool for all of you (us?) people using RDP: Terminal Server Ping Tool. What's even better, it's written in Delphi and the full source code is available.

Opera 9.5 is out, making problems with older versions obsolete. At the same time, via the the Think Vitamin website, Opera Dragonfly. This seems to be their response to Firebug. It is good to see that browsers give developers the tools needed to do their job quickly and efficiently.

Also a quick Perl news: a new module called autodie is in the works to correct some of the deficiencies of use Fatal. Look at the introductory video for more details.

Sunday, June 15, 2008

Subscribing to a members-only SMF forum via RSS


This is one of those bug or feature? cases. I'm member of an online forum which uses Simple Machine Forum (or SMF for short. This is a members only forum, meaning that if you are not a member (or not logged in) you see a very little subset of the forum.

Now I would like to subscribe via RSS to the posts (since this is my preferred way of consumption), but faced the following problem: unless I accessed the feed from Firefox, I only got the posts which were in the public area (not very interesting). This was true both for desktop based and web based readers, and I suspected the cause was that these clients were not logged in when they were fetching the feed.

So I did a little hack: I got the value for my PHP session ID (also, my session is set never to expire, which is not very secure, but convenient). You can do this by viewing the cookies associated with the given site and getting the value from the PHPSESSID cookie (it should look like something like: "d41d8cd98f00b204e9800998ecf8427e" - without the quotes).

Now take the RSS feed URL and append the session id to it like this:;action=.xml;PHPSESSID=d41d8cd98f00b204e9800998ecf8427e

This will fetch the feed with your credentials. Some caveats: if you are using an online reader (like Bloglines), this means effectively trusting them with your session. Also, this may or may not work depending on PHP settings and the given SMF version (I didn't look at the source code for the forum to confirm that it would always work). If your session expires, this method will stop working.

Flaws in the Cisco PIX appliances


Via NetworkWorld (emphasis added):

  • Crafted TCP ACK Packet Vulnerability
  • Crafted TLS Packet Vulnerability
  • Instant Messenger Inspection Vulnerability
  • Vulnerability Scan Denial of Service
  • Control-plane Access Control List Vulnerability

The first four vulnerabilities may lead to a denial of service (DoS) condition and the fifth vulnerability may allow an attacker to bypass control-plane access control lists (ACL). Note: These vulnerabilities are independent of each other. A device may be affected by one vulnerability and not affected by another.

I don't know what I'm scared of more: the fact that these types of vulnerabilities exists in devices which should enforce some basic separation between networks or the fact that they have a feature called Instant Messenger Inspection?

Story cards


This is really nice:

From the Atlassian Developer Blogs

Humane markup language


I'm always searching for methods to make my blog postings better. And by better I mean:

  • Easier to write. This means both speed (because I already spend quite a lot of time with different side-projects) and less formatting-cruft to add (so that I can concentrate on the actual content)
  • Offer useful features to the readers (like syntax highlighting and line numbering for source code)
  • Make the markup clean and semantic

Naturally I was interested when the Is HTML a Humane Markup Language? showed up in my feed reader, followed by Markdown, Oh the Humanity. However I find myself on the opposing side of the fence, agreeing with the points from the original Why Doesnt Wiki Do Html post. The most important rule is (and I'm paraphrasing here):

Blogs emphasis is on content, not presentation. Simple markup rules make people focus on expressing their ideas, not making them pretty.

Two main griefs I have with HTML as a markup language are:

  • It needs a lot of typing. Just for this list I had to type:
    This is both inefficient, and makes you get out of the zone and slowing you down. Remember, we can't multitask, just switch quickly between different tasks and the less we need to do that, the more efficient we are (and mind you, I've been using HTML since the days of Netscape 2.0, so it's not that I'm not familiar with it)
  • Sometimes a lot of code is needed to express concepts like syntax highlighted code. This code (generated automatically) does not lend itself to automatic transformations (for example what if I decide that I want to add/remove line numbering to all of my code samples?)

Currently I see two possible solutions to the problem:

Use a WYSIWYG editor. The problem with these are that they tend to produce horrendous markup, although lately they started to improve. Currently I'm toying around with Windows Live Writer but I don't like certain aspects of it and I'm thinking about writing some plugins to make it behave the way I want to - if only I could make it run with Mono... Its main advantage is that, as a native (as opposed to web) application it can nicely interface with the filesystem and make things like uploading images easi(er).

As any problem in computer science, things can be solved by introducing an other layer of abstraction :-). That is some kind of markup language which codifies the concepts important to the user. This solves the problem of brevity (usually) and also makes it simple to change the styling (simply replace/tweak the code which translates it into HTML). The syntax offered by DokuWiki is quite good (although it has some inconsistencies), the MediaWiki syntax seems to cobbled together, a mix of pseudo-HTML and custom tags, and don't even get me started about their table syntax :-|. But in general the approach is workable.

Finally, some words about WYSIWYG editors for wikis: they are really useful for initial adoption, but the usual approach to retrofit a full editor like TinyMCE suffers from the problem of not being extensible (this is especially important for software like DokuWiki where there are a lot of extensions which produce their own custom tags). One interesting approach which I found in the comments was the markItUp! editor. Aside from being based on my favorite JS library (jQuery), it has the advantage that it tries to speed up editing of the markup language, rather than trying to perform the complete HTML -> markup and back conversion. The approach is similar to the one employed by the XStandard editor (which I heard of via the excellent Boagworld podcast), whereby, rather than giving people the possibility to control directly the style of the text, they rather mark the type of content it represents, and the style sheet controls the actual way it is displayed.

Defense in depth for programming


Two things you should always do when developing in Perl is to use strict and use warnings (with the caveat that warnings should be disabled in production systems or redirected to a log file).

However recently I was reminded that nothing is 100% (and this isn't a compiled-vs-interpreted issue, because there are many errors compilers don't catch). The code in question was:

my $bar = "some $expression";
foo($param1, $param2, , $param4);

What happened here was that I wanted to refactor the code and get the expression out of the function call (because it was becoming long and unwieldy), however after moving out the expression I forgot to put back the variable. Perl has a nice feature whereby undefined variables are eliminated from lists, so basically I got a three member list instead of a four-member one. And the error message wasn't especially helpful either (it kept complaining about undefined values in the called function, and when I visually inspected the call, all that I saw was that $param4 is defined).

Takeaway lessons:

Just because strict doesn't complain, it doesn't mean that your code is correct (just as if a compiler doesn't complain...)

Have other methods to ensure correctness: unit tests, integration tests, ...

Microsoft and the "Not Invented Here" syndrome


A couple of days ago I was listening to a recent episode of Hanselminutes (a great podcast BTW) about Distributed Caching with Microsoft's "Velocity", and the only thing I could think of was:

How is this different from memcached?

This is why MS should be broken up into smaller divisions: to keep them from reinventing the wheel. Of course their development methodologies / QA process has a lot of benefits, but those must be applied anyway and it is a waste or resources to write something from scratch and make it go through the QA process, compared to taking something already available and doing the same (eventually adding pieces like VS/IIS integration for example).

Saturday, June 14, 2008

Resolving the problem with inserting formulas in OpenOffice with Ubuntu 8.04 (Hardy)


While writing some text (under Ubuntu 8.04 Hardy with OpenOffice 2.4) I wanted to insert a formula and much to my surprise the Insert -> Object -> Formula way grayed (greyed?) out. My first reaction was that this is probably because I was missing Java (or more precisely: OpenOffice didn't recognize my installation of the OpenJDK 6). So I tried to configure it, to no avail. I kept getting the error message: no java installation was found in the given directory. Not to developers: it would be really nice to specify what file you are exactly looking for in the directory, so that I can search my HD and located the correct folder!

I solved the Java issue by installing the package via apt-get, but this still didn't solve my formula problem:

sudo apt-get install

Finally I found some posts on the ubuntu forums which indicated that all you have to do is install the correct package:

sudo apt-get install

This solved the problem, but begs the question, why wasn't this package installed by default? I suppose that someone (Sun? Ubuntu? Debian?) was following the Microsoft Office practice of not installing the Equation editor, but this is the one optional component I always installed when I used Microsoft Office.

Disabling mod_deflate for certain files


I received a question regarding my compressed HTTP post. It goes something like this:

I want to use a PHP script as a kind of transparent proxy (that is, when I request a file, it downloads it from an other URL and serves it up to me), but mod_deflate keeps eating my Content-Length header.

My advice is of course to selectively disable mod_deflate for the given PHP script. This would mean putting something like the following in your .htaccess file:

SetEnvIfNoCase Request_URI get_file\.php$ no-gzip dont-vary

Where get_file.php is the name of the script. Some things to remember here:

  • the dot character needs to be escaped, because it is a meta-character for regular expressions (meaning any character), so you must prefix it with a backslash to mean the dot character
  • Request_URI represents the URL up to the filename, but not including any query parameters. For example if you have the URL, Reques_URI will contain However be aware that Apache contains a nice trick which can be used to generate nice URL's but which can affect this: if the given file/directory is not found, it tries to go up the path to find a file. For example, I could have written, and if the script contains the logic to serve up talk.mp3, we can have a nice URL. The side effect is however that the Request_URI is now and the regular expression must be adjusted accordingly (into something like \.mp3$)

A final word of warning: if your host allows you to open files via URLs (with readfile for example), run, run far away, because this is a very insecure configuration for PHP and chances are that the server (especially if it shared between multiple users) will be powned quickly.

Automated analisys


Disclaimer: the views expressed here are my own, and unless expressly stated, do not necessarily represent the views of any former or current employer.

Automated security analysis is good for dealing with a large flux of (possibly) malicious files, however information resulting from these types of sources must be clearly marked as such (as oppsed of information derived by humans). Example:

In a malware description from TrustedSource we find the following lines (emphasis added):

C:\autorun.inf This is a non malicious text file with the following content:


Clearly this is one of those simplistic infect USB drives type of malware and the autorun.inf file is a key component of. While it is not harmful in it self, it should clearly be removed (an analogy might help: lets say that a malware is composed out of an executable and a dll which it loads. The dll itself is not active unless the executable loads it, but is still should be marked and removed).

In conclusion: automatically generated information is good, but please do mark it as such. And also: in the name of science, question everything:

The windows kernel, software licenses and other ramblings


Somehow I ended up at and article on CodeProject titled How can I get address of KeServiceDescriptorTableShadow. The first thing that caught my eye is the fact that the contributor claims to be from China and a web developer. This seems to be a common attitude in China (and also in Russia) if you are in IT: you do whatever you have to do during the day to earn your living, but you are not considered l33t unless you do something involving kernel mode programming.

The second thing that struck me about this article, which is echoed by the comment, is the fact that it gives an alternative method to something for which there are official, well documented APIs for. This again seems to be a cultural difference: chinese tend to value working solutions as opposed to well-architected solutions. An example: the famous QQ messenger includes a kernel-mode protection component. In my opinion, this doesn't solve anything (and it breaks things I consider essential - like being able to run network facing applications from non-administrative accounts), but it seems to solve (some) problems at least temporarily for their user base.

The third thing that struck me, is the existence of the The Code Project Open License (CPOL). I feel very strongly about this, and not in a good way. There are enough licenses out there already. GPL (v2 and v3), LGPL, BSD, MIT and so on. It is already a big enough headache figuring out what can be used with what (just an example: ZFS can not be introduced in the Linux kernel because it uses a custom, Sun license instead of a standard one), introducing a new piece in the puzzle will just complicate things and make code less attractive.

Friday, June 13, 2008

Is vulnerability research ethical?


Over the TaoSecurity blog you can find a good summary on the Bruce Schneier (nice poster btw) vs Marcus Ranum face-off regarding the ethicacy of vulnerability research (also read the comments, they are worth your time).

I fully agree with Bruce on this and think that Marcus is confusing two things: the act of finding the vulnerability and what you do after it. Just as law and justice are not the same thing (trivia: this is why Justitia, the roman god of justice is newer portrayed with a lawbook in her hands, although many people think this because they confuse it with the statue of liberty), vulnerability research and your disclosure method are not the same thing. Bruce Schneier summarizes nicely why it is important to have people who know how to break things:

When someone shows me a security design by someone I don't know, my first question is, "What has the designer broken?" Anyone can design a security system that he cannot break. So when someone announces, "Here's my security system, and I can't break it," your first reaction should be, "Who are you?" If he's someone who has broken dozens of similar systems, his system is worth looking at. If he's never broken anything, the chance is zero that it will be any good.

What you do with your knowledge (the main thing Marcus focuses on) is a separate thing. As long as you:

  • Try to contact the vendor/author first
  • Try to coordinate with them to make sure that the disclosure comes after the patch is available
  • Wait a reasonable amount of time before going public
  • Not sell/give information to people if their need for information is not well motivated (for example an IPS/IDS vendor)

I consider the action of disclosing a vulnerability (even with proof of concept code) ethical.

Why exercise?


Geeks have a hard time justifying to themselves exercising. Sure, it makes you healthier and you live longer, but you'll have less time tinkering with your toys. I found the perfect reason. Via the Security4All blog:

It seems that as little as 20 minutes of exercise 3 times in a week doubles (!) your problem solving capacity. Now, as we know, 60% percent of all statistics is made up on the spot :-) and the presenter might have an axe to grind (for example to sell, but this seems to be in-line with the common take a pause from time-to-time and move around a little advice. For more videos check out his youtube channel.

Saturday, June 07, 2008

Why security is in such a sad state?


Disclaimer: as always, unless expressly stated, the views expressed here are my own and do not necessarrily reflect those of my current or former employers.

Because people hide behind titles!

Some examples:

Gary Warner Director of Research in Computer Forensics lists on his blog IP addresses associated with the latest run of Storm. I thought that everybody got the memo, but seemingly some didn't: Storm is hosted on a fast flux network, using compromised home computers! Enumerating those IP's has the same value as saying fraud is happening on the Internet!

From the ThreatExpert blog: new Rustock, blah, blah, All communication with the server is encrypted with SHA1. I kid you not. There are still some people out there who don't know the difference between encryption and hashing.

Final example: as part of my studies I was participating at a presentation held by a telecom equipment manufacturer, who was explaining some communication protocol and said the passwords are encrypted with MD5. I didn't want embarrass the guy, but I really felt a strong urge to throw a thick security book at him. And these are people responsible for the security of our communication infrastructure!

So remember, the next time you say encrypted with MD5/SHA1 I might be in the audience and you might get hit by a book!

To end on a optimistic note, here is an article from which emphasizes the need to study and learn continuously.

Friday, June 06, 2008

Using hierarchical P2P applications to reduce the bandwidth problem


It never stops to amaze me the the speed which P2P (BitTorrent) client can achieve with almost no server infrastructure! Heck, I can download conference videos as fast as if they were served up by Akamai's carefully tuned architecture! Which brings me to my point:

The current way of network design (aggregating connections with an n:1 ratio) seems to be the only economically workable method to build networks, it introduces choke-points by design. Multicasting has never really taken off at the IP level, but I think that there is a great potential for it to succeed at the application level (layer seven for all the OSI geeks :-)). The key factors in my opinion are:

  • There is an increasing amount of rich media on the internet (videos, podcasts, etc)
  • Those files are mostly static! (An important criteria for making caching - be it centralized or distributed - efficient)

All we need is an attractive enough application built on such foundation for the technology to take off, because we already have enough expertise to implement it.

Update: the BitTorrent protocol is somewhat vague on the peer selection method. Possibly each client implement a different heurisitic, possibly based on the IP distance (taking the XOR of two IPs as the distance between them and preferring peers which have lover distance). Two recent articles from George Ou's blog point out that there are many issue which must be correctly implemented for maximum performance. Also, the idea is certainly not new and software has already been created for distributed web cache and tested on a large scale.

Know your transactions and know them well


In the context of databases transactions are usually thought of as a mechanism to make ensure that different batches of work can be executed in parallel, but result be the same as if they would have been executed in series. This is however only the 10,000 foot overview, and as always, the devil is in the details.

First I would like to point to two great resources: the PostgreSQL documentation explains the four standard transaction isolation levels giving also counterexamples (it is very important to know the limitations of a technology). If you prefer a slightly longer discussion and/or like podcasts, listen episode 99 of the software engineering radio which deals with transactions.

Now to give a concrete example (with PostgreSQL / plPgSQL): suppose that you wish to implement a 1-to-m structure in the database. You would then typically use three tables:

----------      --------
| Master |      | Link |      ---------
----------      --------      | Child |
| mId    | <--> | mId  |      ---------
| ...... |      | cId  | <--> | cId   |
                --------      | ..... |

When doing an insert, a naive approach would be:

SELECT cId FROM Child INTO childId WHERE someAttribute = someValue;
  -- supposing that cId is a serial
  INSERT INTO Child (someAttribute) VALUES (someValue); 
  SELECT cId FROM Child INTO childId WHERE someAttribute = someValue;

And this works fine and dandy in developement and test, however in production, when it gets some serious parallel pounding, things will start to fail with unique constraint violation (supposing that you have set up the tables correctly and put a unique constraint on someAttribute - otherwise you have data duplication, the very thing you wish to avoid with this structure).

What happened? The default transaction level in PostgreSQL is read committed. Thus what happened was: two transactions were trying to insert the same value. The first one committed immediately after the second one did the select. So the second one was under the impression that the given value does not exists in the database, while it actually exists and is already visible to the second transaction (because it is committed).

How can you guard against it? The first-instinct solution would be to raise the transaction isolation level to serializable, however this has a big performance impact (although it works). The second solution is to catch the unique constraint violation exception and conclude that it doesn't matter who inserted the value, it already exists, thus we can use its value:

SELECT cId FROM Child INTO childId WHERE someAttribute = someValue;
  -- supposing that cId is a serial
    INSERT INTO Child (someAttribute) VALUES (someValue); 
  EXCEPTION WHEN unique_violation THEN
    -- someone else already inserted it, but we're happy with it
  SELECT cId FROM Child INTO childId WHERE someAttribute = someValue;

One final remark: this code is optimized for the case when the value already exists in the majority of the cases. If the opposite is true (the value needs to be inserted in the majority of the cases), you could use the following slight variation, which does not check before it tries to insert:

  INSERT INTO Child (someAttribute) VALUES (someValue); 
EXCEPTION WHEN unique_violation THEN
  -- already exists, do nothing
SELECT cId FROM Child INTO childId WHERE someAttribute = someValue;

Living on the edge with Ubuntu


As I said earlier, I'm not very impressed with Ubuntu 8.04 (hardy), but I'll give it an other chance. There are some rumors floating around that July 10 there will be some major updates and the steady stream of updates seem to fix a few issues as well.

Warning! Don't do this if you can't support eventual breakage and/or your system is critical!

If you wish to keep up with the latest updates to the packages, even if they are not released to the public at large, go to System -> Administration -> Software Source and on the Updates page check Pre-released updates. This allowed me to get Firefox 3 (instead of b5) for example. However be ready to drop down to the terminal and do some apt-get update / apt-get upgrade / dpkg-reconfigure.

Using Eclipse with OpenJDK 6 on Ubuntu


Update: There seems to be a simpler way to do this. Take a look at the second comment.

Java 1.6 (also known as Java 6) is now open-source so I installed it on Ubuntu and tried to run Eclipse with it. Unfortunately it said that no compatible java vm was found while searching /usr/lib/j2sdk1.4-sun/bin/java. So I dropped to the command line and tried to run it from there and found out several things:

searching for compatible vm...
  testing /usr/lib/jvm/java-gcj...not found
  testing /usr/lib/kaffe/pthreads...not found
  testing /usr/lib/jvm/java-6-sun...not found
  testing /usr/lib/jvm/java-1.5.0-sun...not found
  testing /usr/lib/j2se/1.5...not found
  testing /usr/lib/j2se/1.4...not found
  testing /usr/lib/j2sdk1.5-ibm...not found
  testing /usr/lib/j2sdk1.4-ibm...not found
  testing /usr/lib/j2sdk1.6-sun...not found
  testing /usr/lib/j2sdk1.5-sun...not found
  testing /usr/lib/j2sdk1.4-sun...not found
Could not create /usr/local/lib/eclipse/.eclipseextension. Please run as root:
    touch /usr/local/lib/eclipse/.eclipseextension
    chmod 2775 /usr/local/lib/eclipse/.eclipseextension
    chown root:staff /usr/local/lib/eclipse/.eclipseextension

First, it was searching in more locations than the dialog box said. Second there seems to be some additional problems related to extensions, fortunately the error message also contained the advice on how to fix it. So first of all I created a symlink to where eclipse was expecting to find java:

sudo mkdir -p /usr/lib/j2sdk1.6-sun/bin
sudo ln -s /usr/bin/java /usr/lib/j2sdk1.6-sun/bin/java

Second, I followed the advice from the error message:

sudo su
touch /usr/local/lib/eclipse/.eclipseextension
chmod 2775 /usr/local/lib/eclipse/.eclipseextension
chown root:staff /usr/local/lib/eclipse/.eclipseextension

And now Eclipse seems to run fine.

Random YouTube videos


Via George Ou's blog

Via Grand Stream Dreams a series of Mac vs Pc vs Linux video:

Profiling PHP with XDebug


Resolving performance problems is hard (even more so when you have to do so with somebody else's code) and some clear measurements are very welcome. I tried out XDebug some years back when and it didn't work very well back then, but the latest release seems quite good. Here is how to use its profiler:

  1. First and foremost make sure that you are not using it on a production server! Good as it may be, it adds a substantial slowdown to the processing and it some rare cases the server could crash!
  2. Find out the PHP version you have and the location of PHP ini. You can do this by viewing a simple php file containing nothing more than:
  3. Get the XDebug module corresponding to your PHP version. This can be as simple as doing sudo apt-get install php-xdebug or going to the XDebug site and downloading it
  4. Register XDebug as a Zend extension (it is very importand not to register it as a normal extensions, that is with php_ext, but as a Zend extension with zend_extension_ts because it needs to hook in at a much deeper level than a normal extension)
  5. Setup the correct configuration in your php.ini. The documentation is a good starting point. Some things you should remember:
    • Make sure that xdebug.profiler_enable is set to 1
    • Make sure that xdebug.profiler_output_dir is set to an existing directory
    • Change the xdebug.profiler_output_name setting, since the default will result almost certainly in data loss (since it saves the profile for the same page in the same file every time). I usually use cachegrind.out.%R.%t.%r
    The parameters can also be set on a per-directory basis if that's what you need using .htaccess files and php_value / php_flag.
  6. The directory containing the trace files can grow very big very quick, so make sure that you keep an eye on it and purge it from time to time. This is an other reason to only use it on a test server.
  7. At this point you have two options: you can use either your web-browser to visit the target URL or a command line program like cURL or wget. The advantages/disadvantages being:
    • With a web-browser you are using a real client which is fetching all the dependencies of the page. Also, you can fetch pages which have complex access methods (like HTTPS with sessions which require login). The disadvantage is that you have no precise measure of time, only impressions (this feels slow), which can mislead you.
    • With a command-line program. These don't fetch all the dependencies and need a little work to set-up to handle authenticated pages (if you need HTTPS I definitely suggest cURL), but can be more easily benchmarked.
  8. Now that you have the trace data, you would load it up into something like KCacheGrind (for Linux) or WCacheGrind (for Windows). Don't let the fact that the later wasn't updated in three years scare you, it still works very well.
  9. At this point you can have two possibilities: a function taking long time to execute or a function being called many, many times. Also, keep in mind that the times displayed from the profiling information are relevant only from a magnitude point of view, not their exact value (again, enabling profiling slows the system down considerably, so disabling it will speed it up). There is no general silver bullet to make scripts fast, however here are two tips to get you started:
    • Know the functions available in PHP and avoid rewriting them, since they are implemented directly in C which is much more efficient. Be especially weary of constructs which loop over arrays for example.
    • If a function is called repeatedly, but it always gives the same result for the same parameters, consider caching the result and first trying a lookup in the cache before starting to compute the answer.

Happy profiling! To give you an idea about the benefits: I used this technique on the blog plugin for DokuWiki and obtained a speedup of ~20%!

Possible problem with Opera and setTimeout / clearTimeout


I've not been able to reproduce this with the new 9.27 release, however I'm quite sure that it is an issue in 9.25:

When you use setTimeout, you have two options: either passing a function reference or a string which gets eval-ed. In either case setTimeout is supposed to return an integer ID which can be used with clearTimeout. However I found that with Opera 9.25 the returned ID was 0 for calls which used strings and that doing a clearTimeout with this value had no effect (ie the code still got executed). However I can't reproduce it with 9.27, so everything is good now.

Two simple things you can do for Perl right now!


Go write a page for the Perl 5 Wiki and help it reach 1000 pages

Rate a CPAN module

Thursday, June 05, 2008

Getting the current interpreter in PHP


Recently a friend of mine asked if there is a way to dynamically find out the path of the PHP executable. The working scenario was: running some scripts (from the command line) which would spawn other scripts and rather than hardcoding the path to the PHP interpreter he wanted to use the same interpreter when launching the child script as the one used to launch the parent script.

PHP removes itself from the $argc variable. I also digged around in the $_SERVER and $_ENV variable to no avail. I also learned a new function: get_defined_constants. However what I did not find was an easy solution.

Under Linux you could probably play around with the /proc/self pseudo-filesystem, however this was under Windows. Finally I came up with the following (somewhat) controverted solution:

Using the win32ps extension I list the information about all the processes, then I use getmypid to find out the pid of the current process and look it up in the list.

Why isn't my GIN index being used in Postgres after I install the _int contrib package?


While working with some GIN indexes (to speed up the && and <@ / @> operators) I was very surprised to see that after the installation of the intarray (_int) contrib module the query planner stopped using the index. After some poking around it came to me:

The contrib module is defining its own operators for the arrays, operators which are different from the ones used by GIN to build the indexes. My quick and dirty solution is to drop the functions which implement those operators with cascade, which results in the operators themselves being dropped also:

DROP FUNCTION _int_overlap(integer[], integer[]) CASCADE;

DROP FUNCTION _int_contained(integer[], integer[]) CASCADE;

DROP FUNCTION _int_contains(integer[], integer[]) CASCADE;

This drops the functions which implement the aforementioned three operators. And now everything works again. A word of caution: probably there are less destructive methods to resolve this problem (possibly recreating the index with an other operator class or something like that), however this worked for me.

Postgres Rocks!

It really does!

The hard edges of Python


I've been playing around with Python (mostly because pefile is written in it) and got very annoyed with the whole white-space as a control structure. In theory it all sounds great: you write beautiful code and it just works. However in practice I find this approach lacking in at least two ways:

  • When moving code around (inside of the same file or between files), many times I got in a situation when the indentation looked alright visually, but the interpreter complained. (Then again, I guess it could have been worse: the interpreter could have silently accepted the line(s), but attached it to the wrong block)
  • When I need to step back (decrease the indentation, instead of increasing it) it seems that it is much harder identifying the level of indentation needed. My theory about this is that while you increase your indentation level usually by one (thus it is easy to follow), it is not uncommon to decrease it by several levels at once (which is much harder to follow). In C-like I find that the following method works reasonably well: when I start a block, I immediately place the ending marker (bracket) on the corresponding version. Then the bracket, combined with support in the editor for highlighting bracket pairs, gives a very good indication of my position.

Some other (mild) annoyances:

Python seems to have this idea of interpreting everything as late as possible. Again, this sounds nice, but when I wait 30 minutes just to find out that I forgot to import some module and a function is missing, it makes me ask: where is my Perl strict mode?

This issue also seems to be related to the dynamism: you can't use a function until you defined it (which sound ok, but here is the kicker) even if it is in the same file! Ok, Pascal was great and I loved Delphi, but grow up already. The parser went through the file, it knows that the function exists, now let me use it!

The python debugger doesn't have a command to inspect the contents of a class (something like 'x' in the Perl debugger). You have p (for print) and pp (for pretty print), but both of those print out something along the lines of class F at 0xblahblah. So here are some things I found useful:

  • Most of the sites have a very weird attitude. They suppose that you would want to add code to your source to debug it (!??). If I have to insert code in my file, I just do a bunch of print statements and be done with it. Fortunately the documentation mentions (although very briefly) that you can debug a script by running it as follows:
    python -m pdb
  • You can find the debugger commands also in the documentation. The one I found very useful was the alias example:
    alias pi for k in %1.__dict__.keys(): print "%1.",k,"=",%1.__dict__[k]
    which creates a new command named si which you can use to inspect the class elements (see my earlier complaint with regards to p and pp)
  • Although the si command/alias is very useful, it can screw up the terminal badly if the class contains variables with binary data. In this case you are better off printing only the key names.

Update: before I forget - the implementation of pack/unpack was very annoying also. Why reinvent the wheel when there are very good implementations already? And why not include an arbitrarily many modifier? Why do I have to use a custom function (which must be declared before hand)?

Monday, June 02, 2008

How can you find out if a site is affected by ThePlanet outage?


If you haven't heard: The Planet a big, big hosting provider from the USA had some unfortunate incident at one of their hosting facility (involving fires, explosions, etc). Fortunately it seems that no actual computing gear was damaged, however it will be some time until they get back 100% online.

In the meantime, if you are wondering if the connection timeout to your favorite site is related to this, head over to and type in the domain in question. Scroll down to Server Data and if at IP Location you see Texas - Dallas - Internet Services Inc, you got your answer. This is the situation for example with

If you want to do this manually, do an nslookup on the name and check if the IP is in the - range.

PS. I'm not affiliated with domaintools, I'm just a happy customer.