Jan Kratochvil
Projects Products GIT Resume Contact
Projects
UNIX UNIX-devel Web Amiga MS-Windows MS-DOS Patches
Captive: The first free NTFS read/write filesystem for GNU/Linux

 

Previous document Parent Next document

Implementation Details

Choice of the Emulation Methods

The intent of the project was to get reliable read-write access to NTFS partition. There are several possible ways to achieve that:

Virtualmachine Running the Original W32 Subsystem

Creating virtual-hardware PC and running the original W32 binaries including their boot-loader etc. Disk device access would be passed as virtual IDE disk (=hard disk drive). File access API would be implemented either by special escaping by some trapped instruction out of the virtualmachine while using W32 file access API or using the standard W32 SMB (Server Message Block) network access through some virtual network card. The latter network access solution is almost the currently available possibility of running full-blown disk-sharing real Microsoft Windows NT inside virtual machine emulator such as VMware.

pros: Full compatibility due to fully native codebase.

cons: Hard to debug, missing documentation of NT booting internals, possible problems by different PC virtual-hardware than expected by NT, requirement of fully installed Microsoft Windows NT product.

"ntoskrnl.exe" Inside Virtual Address Space

This solution was chosen by the project. Binary filesystem driver and also ntoskrnl.exe binary file are required. Unfortunately ntoskrnl.exe expects a native PC virtual-hardware missing during regular UNIX user space process emulation, therefore such instructions must be trapped and emulated/ignored from case to case.

Also the initialization code of ntoskrnl.exe is not executed by this project since it expects to get full PC hardware access privileges and thus some datastructures do not get initialized by it (need to be trapped later at runtime stage). Some of the missing initializations are solved by API functions wrapping.

pros: Lightweight, easier to debug.

cons: Possible incompatible emulation of ntoskrnl.exe parts, missing documentation needed for the implementation.

Filesystem Driver Inside Virtual Address Space

Unlike previous method here we do not use even ntoskrnl.exe as the complete kernel part of W32 is emulated from the project source files. cdfs.sys driver was successfuly ran in this manner in the former versions of this project but the possibility to run without ntoskrnl.exe was dropped since it had no licensing gains (you need the original Microsoft Windows NT files at least for the filesystem driver itself) and the emulation of undocumented parts reusable from ntoskrnl.exe binary was a pain.

pros: Lightweight, easier to debug.

cons: Possible incompatible emulation of the whole ntoskrnl.exe, its missing documentation.

API Function Implementation Choices

During the initial point of the project development all the API functions were defined as unimplemented, of course. Any call of such unimplemented function is fatal and results in program termination. When we need to implement any required API function we have multiple choices to do so: Direct pass to original ntoskrnl.exe, Wrap of the original ntoskrnl.exe function, Native implementation – $ReactOS, Native implementation – $Wine or Native implementation – project specific.

Sandboxing of W32 Filesystem

The emulated W32 environment running the original W32 filesystem driver is separated from the rest of UNIX OS. It achieves the following goals:

Sandboxing is provided with the following attributes:

Project Components Architecture
Project Components Architecture

 

This security is almost the same as provided by emulated virtual machines such as VMware.

Sandboxing Scheme
Sandboxing Scheme

 

Project can be also used in non-sandboxed mode by --no-sandbox option as it is easier to debug without CORBA/ORBit RPC. In this case the DirectorySlave/FileSlave options are used directly instead of their DirectoryParent/FileParent peers.

"patched" vs. "unpatched" Libraries

Library is called patched if we require loading its original binary code file. Project needs to patch it to be able to trap all the function entry points. The only currently patched library of this project is ntoskrnl.exe.

Library is called unpatched if no original binary code is needed since all of its functions are completely emulated by the native implementations of this project. The typical unpatched representative is hal.dll as it specializes on the hardware dependent code and therefore it must be completely replaced by this project running in the GNU/Linux operating system environment. Early versions of this project had also full unpatched native implementation of ntoskrnl.exe but it no longer applies.

Memory Management

Original Microsoft Windows NT architecture uses two address space areas – user space and kernel space. User space is mapped in the range 0x00000000 to 0x7FFFFFFF, kernel space is mapped in the range 0x80000000 (KERNEL_BASE in ReactOS sources) to 0xFFFFFFFF. All these virtual memory ranges represent addresses after their MMU (Memory Management Unit) mapping, of course. More discussion can be found in the description by Microsoft.

This project runs in the virtual address space used both for the UNIX user space process part and for the W32 kernel space. Therefore this project defines that W32 kernel runs in the whole range 0x00000000 to 0xFFFFFFFF since there are no special mapping assumptions about the UNIX user space process mapping. No W32 user space exists in this project. Such approach also nullifies any special memory moving operations between W32 kernel space and W32 user space memory areas (such as MmSafeCopyToUser()).

Unicode Strings and Characters

W32 platform uses 16-bit type wchar_t while GNU/Linux uses a 32-bit one. This can be problem during GCC (GNU C Compiler) compilation of combination of native UNIX C sources (assuming 32-bit GCC with 32-bit wchar_t) and ReactOS C sources (assuming W32 compiler with 16-bit wchar_t) for literal wide strings (C source file systax: L"wstring"). Possibilities to solve this issue list:

Supported Binary Formats

The native W32 binary format is identified as PE-32 (Portable Executable 32-bit), such files have all the usual extensions such as .sys, .exe, .dll etc. PE-32 loading support was already implemented by ReactOS, its memory mapping specifics just had to be ported to GNU/Linux environment by this project. This loading support does not (yet) cover importing of debug symbols from W32 .PDB (Program DataBase) files in GNU/Linux ABI (Application Binary Interface) compatible way.

This project also supports transparent loading of UNIX .so (Shared Object file) binary format. If you have W32 source files for some W32 library you can try to compile it by GCC to get the shared library with GNU/Linux ABI compatible debug information (GCC option -ggdb3 recommended). Beware of possible compilation problems as Microsoft C code expects exception handling to be supported by the compiler (definitely not the case of the plain C compiler of GCC) — all the exception catching code should be discarded as any generated exceptions are always fatal when such driver is running in the scope of this project. You can use the following script of this project to compile W32 filesystem source files as UNIX .so: src/w32-mod/ext2fsd.so-build.sh

Be aware of some differences if you use PE-32 binary format file vs. .so format file. PE-32 use the appropriate W32 specific cdecl/stdcall/fastcall call types, .so must be completely compiled in the standard UNIX cdecl call type semantics. Native function implementations do not need to be explicitely exported by captivesym as they are resolved automatically by the UNIX dynamic system linker. It may be surprising you will have to fix all such missing symbol exports if you advance during the development from the debugging .so file for the production version of the original PE-32 binary file.

At Most One Mounted Filesystem

The project technically supports only one (exactly one...) mounted filesystem device and only one filesystem driver. There is nothing complicated to support multiple disks and multiple loaded filesystem modules but as they would share the address space it would only bring a possible complications during bug reports and the bug solving itself. It was considered as a more sane way to support multiple W32 mounted disks by completely separately running project instances in a different UNIX processes communicating from their sandboxes via CORBA sandbox interface. This sandboxing feature is not yet deployed although its code is already prepared.

The project also does not support any state cleanup to be able to load filesystem A, cleanup A and load a different filesystem B in the same process address space. It complies with the preventions of the possible debugging complications as noted above. Despite this you still must call the function captive_shutdown() to flush all the pending filesystem buffers to the disk. After calling captive_shutdown() the process address space is no longer usable for any further project operations and the process is expected to be terminated in the manner compatible with its driving CORBA sandbox interface control master.

Each sandbox executing the untrusted W32 binary filesystem driver code is connected through its CORBA sandbox interface at the point of upper layer libcaptive-specific filesystem API, at the point of the bottom layer of GIOChannel device access and also for transfers of GLib logging messages/warnings/errors out of the sandbox to the user.

Multithreading and Multiple Processors

W32 platform stands on its thorough architecture parallelism. It must lock all its objects to maintain coherence in presence of multithreading and multiple processors. Since the author of this project considers any parallel execution a serious obstacle for debugging the whole project architecture was designed to prevent any undeterministic behaviour. Therefore this projects always emulates uniprocessor Microsoft Windows NT kernel (KeNumberProcessors symbol is always 1), everything runs in the single initial thread/process and all the filesystem operations are performed as synchronous ("synchronous" by flags FILE_SYNCHRONOUS_IO_ALERT, FO_SYNCHRONOUS_IO, IRP_SYNCHRONOUS_API, IRP_SYNCHRONOUS_PAGING_IO, forced TRUE result of IoIsOperationSynchronous() etc.). For several cases needed only by ntfs.sys there had to be supported asynchronous access (STATUS_PENDING return code) – parallel execution is emulated by GLib g_idle_add_full() with g_main_context_iteration() called during KeWaitForSingleObject().

Since there is a possibility a real W32 parallel threading would be yet needed in the future all the code that would be hit by W32 multithreading capability is marked by TODO:thread comment.

Multiple processors (SMP) support will never need to be implemented since uniprocessor W32 kernels apparently run the filesystem driver modules fine. As this project implements only the uniprocessor W32 kernel all the processor locking functions and structures such as KSPIN_LOCK etc. can be safely implemented as no-operations.

Asynchronous callbacks registered for IO_WORKITEMs are passed as GLib idle functions by g_idle_add_full(). Although they will probably never be executed during non-interactive project's batch executions it is the responsibility of W32 driver implementation to complete all the pending tasks before its W32 shutdown. Such W32 shutdown is done during cleanup of the project's execution by captive_shutdown().

Paranoia Checks

A general approach of software projects development is to implement many internal sanity checks during the development stage but to produce the most optimized final release product without those debugging checks.

Facilities for these practices can be seen in the standard C include files for example as function assert() which gets disabled by the NDEBUG symbol used during the final optimized executable compilation. This project uses Gnome GLib messaging subsystem offering sanity checks discarded by symbols G_DISABLE_ASSERT and G_DISABLE_CHECKS. Microsoft also produces two versions of its products – regular customers use the "free build" (also called "retail") while the programmers should develop their code on the "checked build" product releases.

As this project will always run unknown binary code of proprietary W32 filesystem drivers, the code can never be trusted. Such code even runs in the same unprotected address space as its controlling UNIX code. Since there is not enough documentation for the W32 components of the system and also such documentation is usually misleading it can never be considered as 100% emulation. Even in the final releases all the sanity checks implemented in this project should remain active as all the project's code always interacts with unknown and untrusted W32 binaries.

Microsoft Windows NT code is written in a foolproof style as it accepts even invalid input values, and which it usually corrects. This makes long-term debugging a pain as it hides sources of problems. "Checked build" releases were probably designed to fix this flaw by strict consistency checks but it did not reach its goals as such checks are usually missing in the code.

This project has strict consistency checks across all the code to make the debugging phase easy enough. Failed sanity check is not always a bug – sometimes it just means the real W32 binary code is more benevolent than it could be expected according to the documentation and such sanity check gets removed for the next version build. In other cases the failed sanity checks mean the execution path for some unexpected arguments combination was not yet implemented by this project. I may also mean a bug, of course...

Last but not least – never miss a possible sanity check as its later removal is in an order of magnitude cheaper than an uncaught invalid assumption. Failed assertion is not always a bug although it has to be fixed, of course.

STATUS_LOG_FILE_FULL

After writing approx. 1MB of data on NTFS test partition NTFS driver returns for any further write requests STATUS_LOG_FILE_FULL error code. Apparently it is caused by the fact this project is single-threaded and it ignores the spawn of parallel journalling thread during ntfs.sys initialization.

Fortunately ntfs.sys will clear its journalling log file during filesystem unmount. This project will therefore remount the volume if STATUS_LOG_FILE_FULL is detected to workaround missing journalling thread.

Similiar behaviour can be seen during write of compressed files — the file gets written uncompressed and its compression will proceed only during the final filesystem unmount.

For these reasons it was mandatory to support transparent volume remounting.

ParentConnector volume remounter

The sandbox master component of this project has control of restarting its sandbox slaves containing the W32 filesystem. Target goal of ParentConnector component is to transparently provide persistent view of files and directories over the sandboxed slaves being restarted.

In the case of read-only operations it would be simple as we could only save our state of currently opened filesystem objects with their read file/directory offset. Write operations can be handled as the read-only ones as long as all the operations are successful. In the case of W32 filesystem crash we loose all the past write operations. If we would redo all the write operations we could very easily invoke the same crash. Therefore we write:

Filesystem crash broke dirty object: FILE/PATH/NAME

message to syslog and refuse any further operations with this object.

Parent Connector
Parent Connector

 

HANDLE represents W32 object open in existing W32 filesystem.HANDLE is created on-demand according to the saved state of the object (such as its pathname). Even the whole VFS sandbox slave is spawn on-demand if some object operation requests it.

W32 filesystem crash can obviously occur at any moment - it generates GObject signal abort. Successful filesystem unmount (even as the part of remount operation) must be first preceded by detach signal to close all existing W32 HANDLEs. After their close the filesystem gets the unmount requests. Only in the case all the close operations succeeded including the final filesystem unmount the signal cease can be activated to notify all the dirty (written) objects they are now clean. During this cease signal the project will also flush the sandbox commit buffer to its underlying media.

Objects never written remain in clean state and they can be transparently reopened even if W32 filesystem crash occurs.

 

 

Previous document Next document

EOF