vpalos.com // milk, cookies, segfaults…

Bash URI parser using SED

by Valeriu Paloş on November 16, 2009

Warning! This version is now obsolete!
Check out the new and improved version (using only Bash built-ins) here!

Here is a command-line (bash) script that uses sed to split the segments of an URI into usable variables. It also validates the given URI since malformed strings produce the text “ERROR” which can be handled accordingly:

# Assembling a sample URI (including an injection attack)
uri_1='http://user:pass@www.example.com:19741/dir1/dir2/file.php'
uri_2='?param=some_value&array[0]=123¶m2=\`cat /etc/passwd\`'
uri_3='#bottom-left'
uri="$uri_1$uri_2$uri_3"

# Parse URI
op=`echo "$uri" | sed -nrf "uri.sed"`

# Handle invalid URI
[[ $op == 'ERROR' ]] && { echo "Invalid URI!"; exit 1; }

# Execute assignments
eval "$op"

# ...work with URI components...

Notice the "uri.sed" file given to sed?
It is actually responsible for the URI parsing and it contains the required regular expression rules that will produce bash code out of the given URI which, in turn, when executed, will create our final variables to play with:

# initialize
s/[\r\n]+//g; s/`/%60/g; s/"/%22/g; T begin; :begin

# scheme, address, path, query, fragment
s/^(([a-z]+):\/\/)?(([^:\/]+(:[^@\/]*)?@)?[^:\/?]+(:[0-9]+)?)(\/[^?]*)?(\?[^#]*)?(#.*)?$/\
uri_scheme="\2"; uri_address="\3"; uri_path="\7"; uri_query="\8"; uri_fragment="\9"/i
T error

# user, pass, host, port
s/uri_address="(([a-z0-9_.+=-]+)(:([^@]*))?@)?([a-z0-9.-]*)(:([0-9]*))?"/\0; \
uri_user="\2"; uri_pass="\4"; uri_host="\5"; uri_port="\7"/i; T error

# path parts
h; s/.*uri_path="([^"]+)".*/uri_parts=(); \1/
s/\/+([^/]+)/uri_parts[$[${#uri_parts[*]}]]="\1"; /ig; x; G

# query args
h; s/.*uri_query="([^"]+)".*/uri_args=(); \1/
s/[?&]+([^= ]+)(=([^&]*))?/uri_args[$[${#uri_args[*]}]]="\1"; uri_arg_\1="\3"; /ig
x; G

# print
s/\n\ +//g; s/\n//g; p; q

# failure
:error; c ERROR

After the successful execution of this piece of code the following variables will exist in the running environment:

uri_scheme="http"
uri_address="user:pass@www.example.com:19741"
uri_user="user"
uri_password="pass"
uri_host="www.example.com"
uri_port="19741"

uri_path="/dir1/dir2/file.php"

uri_parts[0]="dir1"
uri_parts[1]="dir2"
uri_parts[2]="file.php"

uri_query="?param=some_value&array[0]=123¶m2=`cat /etc/passwd`"

uri_args[0]="param"
uri_args[1]="array[0]"
uri_args[2]="param2"

uri_arg_param="some_value"
uri_arg_array[0]="123"
uri_arg_param2="`cat /etc/passwd`"

uri_fragment="#bottom-left"

You could play around with it a bit and tell me if you find any problems. Right now it is only a first effort but it could be improved. Cheers!

8 thoughts on “Bash URI parser using SED

  1. Dan Fekete says:

    Wow, thanks for the fast response.

    I checked out your newer version and it looks like a much better solution. So I think I’ll just get to work implementing it into my project
    (http://thefekete.net/gitweb/?p=gitWebTools.git;a=blob;f=publish;h=b5c4f6dedda2ebf2180a889a937e22878c833449).

    I’ll leave any other comments on the new post.

  2. valeriup says:

    @all: Dan’s comment reminded me that I made a big improvement on this parser a while back so I hurried and posted a new article about it. Check it out!

  3. Pingback: vpalos.com » URI parsing using Bash built-in features

  4. valeriup says:

    Hi Dan, thank you very much for your feedback, that’s just what I need to improve on this!

    However, I was not able to replicate the error! Please give me the exact URI string that produces the error so I can do some debugging! :)

  5. Dan Fekete says:

    Here’s my uritest script:

    #!/bin/bash
    
    uri=$1
    
    echo $uri
    
    # Parse URI
    op=`echo "$uri" | sed -nrf "uri.sed"`
    
    # Handle invalid URI
    [[ $op == 'ERROR' ]] &
    
    # Execute assignments
    eval "$op"
    
    echo $uri_scheme
    echo $uri_address
    echo $uri_user
    echo $uri_password
    echo $uri_host
    echo $uri_port
    echo $uri_path
    
  6. Dan Fekete says:

    Hey, thanks so much for posting this. It’s perfect for my application.

    I’m having a problem though with the path parts and query strings sections of uri.sed. If I comment them out `eval $op’ works fine, but if I run it normally I get:

    ./uritest: array assign: line 1: unexpected EOF while looking for matching `”‘
    ./uritest: array assign: line 15: syntax error: unexpected end of file

    It’s no problem for me, as all i’m looking to do is parse git, ssh and file uri’s (without args) and I don’t need the path parts. But I thought I’d let you know, and I was curious if this is a problem with my setup or what.

    My setup:
    Ubuntu Server 9.10 amd64
    Linux helpcomputer 2.6.28-17-generic #58-Ubuntu SMP Tue Dec 1 21:27:25 UTC 2009 x86_64 GNU/Linux
    GNU bash, version 3.2.48(1)-release (x86_64-pc-linux-gnu)
    GNU sed version 4.1.5

    Thanks again for this post, totaly awesome.

  7. valeriup says:

    [Update] Moved the [cci_bash]sed[/cci_bash] instructions into a separate file for modularization.

  8. valeriup says:

    [Edit] Changed parsing of the query args to permit parsing of arguments that have no value assigned to them (e.g. …?arg_with_no_value&…)

Don't keep it to yourself!...