If you have a file that ends in a known extension like list.txt
or mountain.jpg
then you can guess what’s inside it. But how do you actually know? After all, you could mv list.txt kitten.png
and suddenly your plain text file is pretending to be an image of an adorable little cat.
If the contents of the file are text then the easiest way to look inside is the head
command:
> $ head kitten.png
Go to the store and buy:
* Overpriced avocados
* Organic lawn furniture
* Seven gallons of 1% milk
head
shows you the first 10 (by default) lines of any file. And as long as it’s a text file with newline characters signifying line breaks then this shows you something readable. But what if it’s actually an image file?
> $ head actually_a_kitten.png
âPNG
IæãR
+iCCPICC Profile(œùíw\ì˚æŸÉïÜå∞˜^Dˆô≤D!$1$
≈Öà
T.ê¢à´eHùà(HQk±ä®®4ëœßıˆˆ~û=ê|∏º
⎼▒☃┌⎽ └▒⎽├e⎼ $
Oh lovely. That last line there (ending in a $
) is actually the prompt for your next command. We just accidentally sent so many weird characters to the screen that some of them changed the encoding of your terminal. No fun.
A safer way to do this would be with the less
command which creates a new full-terminal buffer and safely returns you to your (working) prompt when you’re done. But even then you need to parse this binary with your eyes and hope that the âPNG
is, in fact, a sign that this is a valid png file.
The file
command
Luckily, any Unixy OS has the file
utility installed. This is a program that will try to guess the type of a file based on examining details of the contents.
> $ file actually_a_kitten.png
actually_a_kitten.png: PNG image data, 8 x 8, 8-bit/color RGBA, non-interlaced
Not only do we know for sure that it’s a PNG but we know the resolution and the color depth. Handy.
How does file
work?
This utility first tries to make a distinction about whether the argument is a file that’s safe to print on the screen (“text”), executable, or any other kind of data.
The procedure for determining this is to first look at the file on disk and see if it’s special in some way, empty, or if there’s some other significant issue with it that makes examining the contents unnecessary. It basically runs the stat
command on your file:
> $ stat kitten.png
File: 'kitten.png'
Size: 3084 Blocks: 8 IO Block: 4096 regular file
Device: fd00h/64768d Inode: 533679 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 1193/jackdanger) Gid: ( 1193/jackdanger)
Access: 2015-06-28 07:35:35.864034633 +0000
Modify: 2015-06-28 07:35:35.864034633 +0000
Change: 2015-06-28 07:35:35.864034633 +0000
It’s a file (not a symlink or a directory) and it takes up 8 blocks of space, using a total of 3084 bytes. Alright then, let’s read it and figure out what’s inside.
Finding the magic on your computer
When I first learned how this worked the person showing me the ropes asked “Do you want to see where the magic is?” and I was utterly confused. Then he ran man file
and pointed out that the process of determining what’s inside a file from individual heuristics is, in fact, known as magic.
Or, or more specifically:
> $ man file | tr ' ' "\n" | grep /magic # this tr will swap spaces for newlines
/etc/magic
/etc/magic,
/usr/share/misc/magic
/usr/share/misc/magic.mgc
/usr/share/misc/magic.mgc,
Before we look inside one of these files by hand let’s practice using file
itself:
> $ file /usr/share/misc/magic.mgc
/usr/share/misc/magic: symbolic link to `../file/magic.mgc'
Oh, that didn’t even inspect the contents of the file – it stopped at the stat
step. What did it see there?
> $ stat /usr/share/misc/magic.mgc
File: ‘/usr/share/misc/magic.mgc’ -> ‘../file/magic.mgc’
Size: 17 Blocks: 0 IO Block: 4096 symbolic link
Device: ca01h/51713d Inode: 2023 Links: 1
Access: (0777/lrwxrwxrwx) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2015-06-28 08:12:44.599057183 +0000
Modify: 2014-07-10 17:48:04.000000000 +0000
Change: 2014-12-15 08:35:19.791057183 +0000
Birth: -
Yeah, I see the words ‘symbolic link’ there for sure. Now let’s follow the symlink and see what’s really inside it:
> $ readlink -f /usr/share/misc/magic.mgc
/usr/share/file/magic.mgc
> $ file $(readlink -f !$) # This is identical to "file /usr/share/file/magic.mgc"
/usr/share/file/magic.mgc: magic binary file
for file(1) cmd (version 8) (little endian)
This is great. It knows this file so well that it knows it’s its own magic file-detection file. So what’s inside it?
> $ less /usr/share/file/magic.mgc
BINARY DATA STUFF
Yech. How about one of the others?
> $ head -n 15 /usr/share/file/magic
##------------------------------------------------------------------------------
## $File: acorn,v 1.5 2009/09/19 16:28:07 christos Exp $
## acorn: file(1) magic for files found on Acorn systems
##
## RISC OS Chunk File Format
## From RISC OS Programmer's Reference Manual, Appendix D
## We guess the file type from the type of the first chunk.
0 lelong 0xc3cbc6c5 RISC OS Chunk data
>12 string OBJ_ \b, AOF object
>12 string LIB_ \b, ALF library
# RISC OS AIF, contains "SWI OS\_Exit" at offset 16.
16 lelong 0xef000011 RISC OS AIF executable
Now we’re talking. This format is a series of specific checks for byte values in a file. The first couple examples are kinda hard to make sense of so I’ll skip down to a simpler one:
0 string \x89PNG\x0d\x0a\x1a\x0a PNG image data
!:mime image/png
>16 belong x \b, %ld x
>20 belong x %ld,
>24 byte x %d-bit
>25 byte 0 grayscale,
>25 byte 2 \b/color RGB,
>25 byte 3 colormap,
>25 byte 4 gray+alpha,
>25 byte 6 \b/color RGBA,
#>26 byte 0 deflate/32K,
>28 byte 0 non-interlaced
>28 byte 1 interlaced
This means “If the string of bytes starting at position 0 are as follows then it’s a ‘PNG image data’”. Each of the lines starting with >
can refine the result, adding more data to it. Apparently it’s part of the PNG format that if the 25th byte is a 4 then this is a gracscale image with an alpha channel. Cool.
Hey, we can detect Photoshop files, too!
0 string 8BPS Adobe Photoshop Image
!:mime image/vnd.adobe.photoshop
>4 beshort 2 (PSB)
>18 belong x \b, %d x
>14 belong x %d,
>24 beshort 0 bitmap
>24 beshort 1 grayscale
>>12 beshort 2 with alpha
>24 beshort 2 indexed
>24 beshort 3 RGB
>>12 beshort 4 \bA
>24 beshort 4 CMYK
>>12 beshort 5 \bA
>24 beshort 7 multichannel
>24 beshort 8 duotone
>24 beshort 9 lab
>12 beshort > 1
>>12 beshort x \b, %dx
>12 beshort 1 \b,
>22 beshort x %d-bit channel
>12 beshort > 1 \bs
There are some 500 known formats in here (egrep -c '^# .* file' /usr/share/misc/magic
) detecting all kinds of things. How correct is this set of heuristics, though? Since it’s just looking for key data and ignoring most other bytes is it possible to fool it?
To generate fake data we’re going to read from /dev/urandom
and write it straight to standard out (i.e. your screen). we’ll do this 512 bytes at a time and pipe it straight to file
as a special (-s
) file:
> $ dd if=/dev/urandom bs=512 | file -s -
/dev/stdin: data
Okay, not much. It’s “data”. Great. But if we try that in a bash loop…
> $ for _ in {0..20}; do dd if=/dev/urandom bs=512 | file -s -; done
/dev/stdin: data
/dev/stdin: data
/dev/stdin: data
/dev/stdin: data
/dev/stdin: data
/dev/stdin: data
/dev/stdin: data
/dev/stdin: data
/dev/stdin: data
/dev/stdin: data
/dev/stdin: data
/dev/stdin: data
/dev/stdin: data
/dev/stdin: data
/dev/stdin: MPEG ADTS, layer III, v1, 40 kbps, 48 kHz, Monaural
/dev/stdin: data
/dev/stdin: data
/dev/stdin: data
/dev/stdin: data
/dev/stdin: data
/dev/stdin: data
Wait. What? What’s an MPEG ADTS file? Doesn’t matter, we generated data that fit the profile of it with just a few bytes in the right place.
And if we run this in a loop and ignore the misses we see way more interesting stuff.
> $ while :; do dd if=/dev/urandom bs=512 | file -s - | grep -v ': data'; done
/dev/stdin: Sendmail frozen configuration - version =\236\207\021\033\316h\312d\236i\326\033\236\350C\234\373
/dev/stdin: 8086 relocatable (Microsoft)
/dev/stdin: hp200 (68010) BSD
/dev/stdin: Sendmail frozen configuration - version \332\317\022?\310]\272s+^\021\032\212\234Q\341;\326\320
/dev/stdin: SysEx File - Fujitsu
/dev/stdin: DBase 3 data file
/dev/stdin: DBase 3 data file (899713281 records)
/dev/stdin: MPEG-4 LOAS, single stream
/dev/stdin: DBase 3 data file with memo(s) (1054998150 records)
/dev/stdin: Dyalog APL version 219 .105
This really is magic, then, in the sense that this isn’t predictable and isn’t super reliable. But it’s still more effective than trying to cat it to your screen and look at the bytes yourself.