Multi-echelon inventory models aim to minimize the system-wide total cost in a multi-stage supply chain by applying a proper ordering policy to each of the stages. When backorder costs are incurred at more than one stage, there are no known optimal policies even in a serial system. We applied several deep reinforcement learning algorithms (e.g. DQN, A2C, T3D) and found that the results are comparable or better than the best known heuristics. We also propose a mechanism to reduce the training time by incorporating known heuristics into the exploration process of the deep reinforcement learning algorithms.