Compared with the traditional shared-memory programming model, message passing models for chip multiprocessors CMPs have distinct advantages due to the relative ease of validation and the fact that they are more portable. This paper proposes a design of integrating a message-passing engine into each router of the network-on-chip as well as the programming-friendly message passing interface for these engines. Combined with the DMA mechanism, the proposed design applies the on-chip RAM as intermediary message buffer, and frees the CPU core from message-passing operations to a large extent. The detailed design and implementation, including the register-transfer-level RTL descriptions of the engine, are presented. Evaluations show that: compared with the software-based solution, it can decrease the message passing latency by one or two orders of magnitude. Co-simulation also demonstrates that the proposed designs effectively boost the performance of point-to-point communications on-chip, while the consumptions of power and chip-area are both limited.