The context of the paper is the widening gap between CPU and DRAM speed. The central argument is that caches won't help, because compulsory misses are inevitable, so eventually run-time will be dominated by compulsory misses, even with a perfect cache (no capacity or conflict misses). They then assume a fixed miss probability (which is the main fallacy of the paper) and draw their conclusions from that.
The first flaw in this argument is the claim that compulsory misses are inevitable: We can reduce the number of compulsory misses by, e.g., making the cache lines longer, and indeed this is happening (line sizes: 486: 16B, P5/P6/K6: 32B, K7: 64B, Pentium 4 L2: 128B).
The main fallacy of the paper is the assumption of a fixed miss rate for compulsory misses (of at least 0.2%). Cache miss rates in general are very dependent on the application, and this holds even more for compulsory misses.
In particular, even if we take a single single-process view of caching, we see that the number of compulsory misses is determined by the amount of memory accessed, with hardly any correlation with the number of dynamically executed instructions, so assuming a fixed miss rate is absurd. We can easily compute the time spent in compulsory misses by dividing the memory used by the actual memory bandwidth (typically not more than a factor of 2 from the peak bandwidth). On current high-performance machines this time is limited to a few seconds if the process stays in physical RAM (and once you get into paging, don't worry about compulsory cache misses). Memory bandwidth tends to grow with memory size, so this won't change much (the fact that DRAM arrays get more lines over time may limit bandwidth growth to the square root of memory growth eventually, but that is quite far in the future for DRAM).
Switching to a whole-system view of caching, the concept of compulsory misses becomes meaningless, at least in conjunction with a physically-tagged perfect cache. Compulsory cache misses occur once, at system startup time, if at all (several architectures can clear architectural memory without touching DRAM; subblock validate bits are another alternative to deal with writes). New processes just reuse cache lines for physical memory that is already in the cache, and don't suffer any compulsory misses.
This view ignores DMA-based I/O and multiprocessing, but these can be ignored for a discussion about the memory wall (it is not about the I/O wall; and with an infinitely fast CPU, who needs multiprocessing?).
The techniques I mentioned above (long cache lines, physically-tagged caches, etc.) are not new, and certainly not reactions to the paper. They were already known in 1994 (when the paper was written).
The paper implies that every application will run into the memory wall eventually, and I think that this is wrong. Long-running applications with nice cache characteristics will spend little time waiting for DRAM and will scale well with CPU speed improvements (e.g., our LaTeX benchmark).
The more general idea of a widening gap between memory and CPU speeds is certainly correct. As a consequence, some programs will not scale with CPU speed improvements (memory-bound programs); some other programs may become memory-bound as CPUs get faster; one could say that these programs hit the memory wall.
However, the cache miss rate depends on the cache design (size, line size, prefetching, associativity etc.), and that is being improved, too, so some previously memory-bound programs may become CPU-bound; moreover, many compiler, library, and application programmers are aware of the issue and change the software to cause fewer cache misses (a textbook example is cache blocking in numerical software). So, it is by no means clear that more programs will hit the memory wall in the future.