Do you have a mysterious file? The Linux
file Command will quickly tell you what type of file it is. However, if it is a binary file, there is even more to learn about it.
file has quite a number of stable mates to help you with the analysis. We’ll show you how to use some of these tools.
Identifying file types
Files usually have characteristics that enable software packages to recognize the type of file and the data it contains. There would be no point in opening a PNG file in an MP3 player, so it is both useful and pragmatic for a file to carry some form of ID with it.
This can be a few signature bytes at the very beginning of the file. This enables a file to uniquely identify its format and content. Sometimes the file type is derived from some aspect of the internal organization of the data itself known as the file architecture.
Some operating systems, such as Windows, rely entirely on the extension of a file. You can call it gullible or trusting, but Windows assumes that any file with the DOCX extension is really a DOCX word processing file. Linux is not what you are about to see. It wants evidence and looks in the file to find it.
Using the file command
We have a collection of different file types in our current directory. They are a mix of document, source code, executable, and text files.
ls Command shows us what is in the directory and the
-hl (human readable sizes, long listing) tells us the size of each file:
let us try it
file on a couple of these and see what we get:
The three file formats are recognized correctly. Where possible
file gives us a little more information. The PDF file is supposedly in the Version 1.5 format.
Even if we rename the ODT file to have an extension with the arbitrary value of XYZ, the file will still be correctly identified, both within the
Files File browser and in the command line with
Files File browser will display the correct icon. On the command line,
file ignores the extension and looks in the file to determine its type:
file on media such as image and music files usually provides information on format, encoding, resolution, etc .:
Interestingly, even with plain text files
file does not judge the file by its extension. To the exampleif you have a file with the “.c” extension that contains standard plain text but no source code,
file don’t confuse it with a real C Source code file:
file correctly identifies the header file (“.h”) as part of a C source code collection of files and knows that the makefile is a script.
Use file with binary files
Binary files are more of a “black box” than others. Image files can be displayed, sound files can be played and document files can be opened with the corresponding software package. However, binary files are more of a challenge.
To the example, the files “hello” and “wd” are binary executables. They are programs. The “wd.o” file is an object file. When source code is compiled by a compiler, one or more object files are created. These contain the machine code that the computer will eventually execute when the finished program is executed, along with information for the linker. The linker checks each object file for function calls to libraries. It links them to all the libraries that the program uses. The result of this process is an executable file.
The watch.exe file is a binary executable file that has been cross-compiled to run on Windows:
Take the last one first
file tells us that watch.exe is an executable PE32 + console program for the x86 family of processors on Microsoft Windows. PE stands for Portable Executable Format, has the 32- and 64-bit versions. PE32 is the 32-bit version and PE32 + is the 64-bit version.
The other three files are all identified as Executable and linkable format (ELF) files. This is a standard for executable files and shared object files such as. B. Libraries. We’ll look at the ELF header format shortly.
What you might notice is that the two executables (“wd” and “hello”) are identified as Linux standard base (LSB) shared objects, and the object file “wd.o” is identified as LSB-relocable. The word executable is evident in its absence.
Object files are relocatable, which means that the code they contain can be loaded into memory at any location. The executables are listed as shared objects because the linker created them from the object files to inherit this ability.
This enables the Randomization of the address space layout (ASMR) system to load the executable files into memory at addresses of its choice. Standard executable files have a load address encoded in their headers that dictates where they will be loaded into memory.
ASMR is a safety technology. Loading executable files into memory at predictable addresses makes them vulnerable to attack. This is because attackers always know their entry points and the locations of their functions. Location-independent executables (PIE) positioned at a random address overcome this vulnerability.
If we Put our program together with the
gcc Compiler and set the
-no-pie Option we generate a conventional executable file.
-o With the option (output file) we can specify a name for our executable file:
gcc -o hello -no-pie hello.c
file on the new executable and see what has changed:
The size of the executable file is as before (17 KB):
ls -hl hello
The binary is now identified as the default executable. We are doing this for demonstration purposes only. If you compile applications this way, you lose all of the benefits of ASMR.
Why is an executable file so big?
hello The program is 17 KB in size, so it can hardly be called large, but then everything is relative. The source code is 120 bytes:
What locks the binary file if only one string is printed in the terminal window? We know there is an ELF header, but it’s only 64 bytes long for a 64-bit binary. Obviously it has to be something else:
ls -hl hello
let us scan the binary with the
strings Command as an easy first step to find out what’s in it. We channel it in
strings hello | less
There are many strings in the binary aside from “Hello, Geek world!” from our source code. Most of them are labels for regions within the binary and the names and association information of shared objects. This includes the libraries and functions within those libraries that the binary file depends on.
ldd command shows us the common object dependencies of a binary file:
The output contains three entries, two of which contain a directory path (the first does not):
- linux-vdso.so: Virtual Dynamic Shared Object (VDSO) is a kernel mechanism that enables a number of kernel-space routines to be accessed through a user-space binary. this avoids the overhead of a context switch from user kernel mode. Shared VDSO objects adhere to the Executable and Linkable Format (ELF) so that they can be dynamically linked to the binary file at runtime. The VDSO is assigned dynamically and uses ASMR. The VDSO capability is provided by the standard GNU C library if the kernel supports the ASMR scheme.
- libc.so.6: the GNU C library common object.
- /lib64/ld-linux-x86-64.so.2: This is the dynamic linker that the binary wants to use. The dynamic linker queries the binary to find out what dependencies it has. It starts these shared objects into memory. It prepares the binary file to run and find and access the dependencies in memory. Then it starts the program.
The ELF header
We can examine and decode the ELF header Use of
readelf Utility and the
-h (File header) option:
readelf -h hello
The header is interpreted for us.
The first byte of all ELF binary files is set to the hexadecimal value 0x7F. The next three bytes are set to 0x45, 0x4C and 0x46. The first byte is a flag that identifies the file as an ELF binary file. To make this crystal clear, the next three bytes spell out “ELF” in ASCII:
- Great: Indicates whether the binary is a 32- or 64-bit executable (1 = 32, 2 = 64).
- Data: Show the Endianism in use. Endian coding defines the way in which multibyte numbers are stored. With big-endian coding, a number is stored with its most significant bits first. With little-endian coding, the number with its least significant bits is stored first.
- Execution: The version of ELF (currently it is 1).
- Operating system / ABI: Represents the kind of Application binary interface in use. This defines the interface between two binary modules, for example a program and a shared library.
- ABI version: The version of the ABI.
- Type: The type of the ELF binary. The common values are
ET_RELfor a relocatable resource (e.g. an object file),
ET_EXECfor an executable file compiled with the
ET_DYNfor an ASMR capable executable file.
- Machine: the Instruction set architecture. This indicates the target platform for which the binary file was created.
- Execution: Always set to 1 for this version of ELF.
- Entry point address: The memory address within the binary file at which execution begins.
The other entries are sizes and numbers of regions and sections within the binary so that their positions can be calculated.
A quick look at the first eight bytes of the binary file with
hexdump displays the signature byte and the “ELF” string in the first four bytes of the file. the
-C (canonical) option gives us the ASCII representation of the bytes along with their hexadecimal values, and the
-n With the (number) option we can specify how many bytes we want to see:
hexdump -C -n 8 hello
objdump and the Granular View
If you want to see the details you can do the
objdumpCommand with the
-d (disassemble) option:
objdump -d hello | less
This breaks down the executable machine code and displays it in hexadecimal bytes along with the assembler equivalent. The address position of the first byte in each line is displayed on the far left.
This is only useful if you can read assembly language or are curious about what’s going on behind the curtain. There’s a lot of output so we put it in
Compile and link
There are many ways to compile a binary file. To the example, the developer decides whether to add debug information. The way in which the binary file is linked also plays a role in terms of content and size. If the binary references share objects as external dependencies, it is smaller than one to which the dependencies are statically linked.
Most developers are already familiar with the commands discussed here. For others, however, they offer some easy ways to browse around and see what’s in the binary black box.